Hive

MapReduce를 사용하는 선언적 언어인 hive를 정리 합니다.

홈페이지 : http://hive.apache.org/

HiveQL

다운로드 : http://hive.apache.org/releases.html, http://www.apache.org/dyn/closer.cgi/hive/
라이센스 : Apache 2.0
플랫폼 : Java

hive 개요

Hadoop 기반의 데이터 웨어하우징용 솔루션
페이스북에서 개발하여 오픈소스로 공개
HiveQL 사용

Source로 설치 파일 만들기

svn co http://svn.apache.org/repos/asf/hive/trunk hive-trunk
cd hive-trunk
ant package
ls -alF build/dist/

CentOS에서 Hive 설치

사전 준비 사항

Pig 0.11.1
Hadoop 1.1.2
Java 1.7.0_19
CentOS 6.4, 64 bits

MySQL에 hive 데이터베이스를 생성하고 Hive용 table을 생성 합니다.

mysql -uhive -p hive
    source /appl/hive/src/metastore/scripts/upgrade/mysql/hive-schema-0.10.0.mysql.sql
    show tables;
    exit;

설치

Hive를 다운로드하여 /appl/hive 폴더에 압축을 풉니다.

wget http://apache.mirror.cdnetworks.com/hive/hive-0.12.0/hive-0.12.0.tar.gz
tar zxvf hive-0.12.0.tar.gz
chown -R root:root hive-0.12.0
mv hive-0.12.0 /appl/hive

//--- JDBC Driver 복사
cp /cloudnas/install/mysql-connector-java-5.1.25-bin.jar /appl/hive/lib

vi .bashrc

export HIVE_HOME=/appl/hive
export PATH=$PATH:$HIVE_HOME/bin

Hive에서 사용할 HDFS 디렉토리 구성

hadoop dfs -mkdir /tmp
hadoop dfs -mkdir /user/hive/warehouse
hadoop dfs -chmod g+w /tmp
hadoop dfs -chmod g+w /user/hive/warehouse

vi /appl/hive/conf/hive-site.xml

fs.default.name : Name Node 접속 정보

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?> 

<configuration>
 <property>
   <name>fs.default.name</name>
   <value>hdfs://cloud001.cloudserver.com:9000</value>
 </property>

 <property>
   <name>javax.jdo.option.ConnectionURL</name>
   <value>jdbc:mysql://localhost:3306/hive?useUnicode=true&characterEncoding=UTF-8</value>
 </property>
 <property>
   <name>javax.jdo.option.ConnectionDriverName</name>
   <value>org.gjt.mm.mysql.Driver</value>
 </property>
 <property>
   <name>javax.jdo.option.ConnectionUserName</name>
   <value>hive</value>
 </property>
 <property>
   <name>javax.jdo.option.ConnectionPassword</name>
   <value>???</value>
 </property>
 <property>
   <name>datanucleus.autoCreateSchema</name>
   <value>false</value>
 </property>
 <property>
   <name>datanucleus.fixedDatastore</name>
   <value>true</value>
 </property>
</configuration>

서비스 확인

start-all.sh                 //--- Hadoop이 먼저 실행이 되어 있어야 합니다.
hive
    show tables;
    exit;
hive --help

참고 문헌

Hive 설치 및 환경구축하기, 2013.1

Hive Architecture

HiveQL

Hive CLI 기초

Hive cli 실행

hive                        //--- hive --service cli

hive 명령 실행시 먼저 실행되는 환경 설정 파일

~/.hiverc

환경 변수 관련 NameSpace

hivevar (Default, 생략 가능), hiveconf, system, env (읽기 전용)

set;                        //--- 전체 환경 변수 표시
set env:HIVE_HOME;          //--- HIVE_HOME 환경 변수 표시
set hivevar:foo=~;          //--- 환경 변수에 값 지정
${환경변수}                 //--- 명령행에서 환경 변수 사용 방법

HiveQL : ~.hql, ~.q

hive -f ~.hql
hive 
    source ~.hql;
    exit;

Data

Hive Table에서 사용할 수 있는 데이터 형

string
tinyint, smallint, int, bigint
float, double
boolean, timestamp, binary
struct, map, array : array<string>, map<string, int>, struct<~:string, ~:int>

item = struct('~', '~'); //--- item.name
item = map('name1', 'value1', 'name2', 'value2'); //--- item["name1"], item.name1
item = array('val1', 'val2'); //--- item[0], 0, 1, 2, ...

데이터 구성

\n : 레코드 구분
^A : 항목 구분 (\001), Ctrl_A
^B : struct, map, array에서 각 항목 구분 (\002), Ctrl_B
^C : map에서 key와 value 구분 (\003), Ctrl_C

HiveQL DDL (Data Definition Language)

Database 관리

default : Default로 제공되는 database 이름
hive-default.xml 파일에서 hive.metastore.warehouse.dir로 저장되는 위치를 지정
Default hive.metastore.warehouse.dir : /user/hive/warehouse

use ~;                                 //--- 사용할 database를 선택
create database ~                      //--- /user/hive/warehouse/~.db 폴더 생성
       location '/user/test/warehouse'
       comment '~'
       with dbproperties (name1 = value2, nam2 = value2);
show databases [like '~*';
describe database ~;
set hive.cli.print.current.db=true;    //--- 현재 사용하고 있는 database를 화면에 표시
drop database if exists ~ [cascade];   //--- cascade : Database에 있는 table도 모두 삭제

Managed Table 관리

테이블의 full Name : dbName.tableName, /user/hive/warehouse/dbName.db/tableName 에 저장

create table [if not exists] [~.]~ (
    ~ string [comment '~'],
    ~ int
    )
    comment '~'
    tblproperties (name1 = value2, nam2 = value2)
    location '/user/hive/warehouse/~.db/~'
    row format delimited
        fields terminated by '\001\
        collection items terminated by '\002'
        map keys terminated by '\003'
        lines terminated by '\n'
    stored as textfile;
create table ~ like ~;                    //--- 하나의 테이블의 Schema를 복사하여 다른 테이블 생성
create table ~ 
    as select ~
         from ~
        group by ~
        order by ~;
show tables ['~'] [in ~];
describe [formatted | extended] ~;
drop table [if exists] ~;

External table 관리

외부 테이블은 테이블 삭제시 데이터는 삭제되지 않습니다.

create external table ~ (
    ~
    )
    location '/data/aaa';

Partitioned table 관리

/user/hive/warehouse/~.db/~/p1=~/p2=~ 에 데이터 저장
p1, p2는 필드 처럼 사용

//--- strict : partitioned field 외에는 where 조건에서 사용하지 못하도록 함
//--- nonstrict : 모든 필드를 where 조건에서 사용 가능
set hive.mapred.mode=strict;
create table ~ (
    ~
    )
    partitioned by (p1 string, p1 string);
show partitions ~ [partition(p1='~')];

alter table ~ 
      add [if not exists] partition(~=~)    //--- table에 partition을 추가하고 데이터의 위치와 연결
      location '~';
alter table ~
      drop [if exists] partition(~);

HiveQL DML(Data Manipulation Language)

Data 로드 및 저장

local : 사용. 데이터 복사 (지역 데이터), 미사용. 데이터 이동
overwrite : 사용. 해당 폴더의 파일을 모두 삭제후 추가, 미사용. 데이터 추가

load data [local] inpath '~'         //--- inpath에는 폴더를 지정 합니다.
     [overwrite] into table ~
     partition (~=~, ~=~);

//--- Hive 테이블의 데이터를 외부 파일로 저장
insert overwrite [local] directory '~'    //--- overwrite 대신 into 사용 가능
       select ~;

Table CRUD : Insert

//--- into 대신에 overwrite를 사용하면 원래 데이터를 지우고 새로 데이터가 추가 됩니다.
insert into table ~
       partition (~=~, ~=~)
       select * from ~ where ~;

//--- 동작 partition
//--- hive.exec.dynamic.partition = false              //--- true. 동적 partition 모드
//--- hive.exec.dynamic.partition.mode = strict        //--- nonstrict. 모든 partition 컬럼이 동적으로 할당
//--- hive.exec.max.dynamic.partitions.pernode = 100   //--- Node당 최대 동적 파티션의 갯수
//--- hive.exec.max.dynamic.partitions = 1000          //--- insert문이 만들수 있는 최대 동적 파티션 갯수
//--- hive.exec.max.created.files = 100000             //--- 하나의 query가 만들수 있는 최대 파일 갯수
insert into table ~              //--- select 문의 마지막에 사용한 field를 partition field와 매핑하여 데이터 저장
       partition (~=~, ~=~)
       select ~, ~, ~, ~ from ~;
//--- 정적 partition
from ~                           //--- 데이터를 한번만 읽어 다수의 insert문을 적용 합니다.
     insert into table ~
            partition (~)
            select * where ~
     insert into table ~
            partition (~)
            select * where ~;

Table CRUD : select

jon문 적용 후 where 절 평가

select ~ as ~,                    //--- 'aa.*' : aa로 시작하는 모든 필드를 조회 합니다.
       case
           when 조건 then '~'
           else '~'
       end as ~
  from ~ as ~
       join ~ on ~ = ~           //--- 가장 큰 테이블을 뒤에 배치
       left outer join ~ on ~    //--- 왼쪽에 있는 레코드를 반환, 오른쪽에 값이 없다면 null을 반환
       left semi join ~ on ~     //--- 조건을 만족하는 왼쪽에 있는 레코드를 반환
       right outter join ~ on ~  //--- 오른쪽에 있늘 레코드를 반환, 왼쪽에 값이 없다면 null을 반환
       full outer join ~ on ~
 where ~ and ~ like '%aa"'       //--- str rlike ~, str regexp ~ : 정규 표현식(~)과 일치하면 true
 group by ~
 having ~                        //--- group by 에서 생성된 결과로 조건 처리
 order by ~                      //--- 전체 데이터 정렬
 distribute by ~                 //--- sort by를 보완, ~별로 reducer에서 처리
 sort by ~                       //--- 각 Node(Reducer)에서만 정렬
 cluster by ~                    //--- distribute by 와 sort by의 결합
 limit ~;

from ~
     select ~
      where ~;

select *                         //--- 표본 데이터 추출, m. 전체 bucket 갯수, n. 가져올 bucket 번호 (1, 2, ...)
  from ~ tablesample(bucket n out of m on rand()) newName; 
select *                         //--- hive.sample.seednumber = 7383
  from ~ tablesample(0.1 percent) newName;     //--- Seed number를 사용하여 표본 데이터 추출

uniton all : 두개 이상의 테이블을 합쳐서 결과를 반환 합니다.

select ~ from ~
union all
select ~ from ~

View

create view [if not exists] ~ [(~, ~, ~)]
       comment '~'
       tblproperties (~)
       as select ~ from ~;
drop view [if exists] ~;

함수

함수의 종류

UDF : User-Defined Function
UDAF : User-Defined Aggregate Function
UDTF : User-Defined Table generating Function

Function

show functions;
describe function [extended] ~;

통계 함수

bigint count([distinct] ~)
double sum(~), avg(~), min(~), max(~)
double var_pop(~), var_samp(~)             //--- 분산 / 표본 분산
double stddev_pop(~), stddev_samp(~)       //--- 표준 편차 / 표본 표준 편차
double covar_pop(~), covar_samp(~)         //--- 공분산 / 표본 공분산
double corr(~, ~)                          //--- 상관 관계
double percentile(~, p), percentile_approx(~, p, NB)      //--- 백분위, P (double 0 ~ 1), NB = 10000
array(double) percentile(~, [p1, ...]), percentile_approx(~, [p1, ...], NB)   //--- 백분위
array<struct {'x', 'y'}> histogram_numeric (~, NB)   //--- NB 히스토그램 빈즈의 배열, x. 중간값, y. 높이

레코드 함수

records explode(array), explode(map)       //--- array와 map로 레코드로 변환
records stack(n, col1, ... coln)           //--- col*을 n개씩 묶어 레코드로 변환
tuple json_tuple(jsonStr, p1, ..., pn)
//--- partName : host, path, query, ref, protocol, authority, file, userinfo, query:keyName
tuple parse_url_tuple(url, partName1, .., partNamen)
   select parse_url_tuple(url, 'HOST', 'PATH') as (host, path) from ~;

변환 함수

cast (~ as float)                          //--- ~을 float 형으로 변환
string regexp_replace(str, regex, replace), regexp_extract(str, regex, index)

//--- 날자 관련 함수
string from_unixtime(int), to_date(string)
int year(str), month(str), day(str)

Hive 매뉴얼

Hive 도움말

hive --help
hive --service cli --help

Hive Service

beeline
cli : default, Command line interface
help
hiveserver : Thrift server
hiveserver2
hwi : Hive Web Interface
jar : Hive 환경에서 application을 실행
lineage
metastore : 다중 client 지원을 위해 Hive 외부에 MetaStore를 구동하는 서비스
metatool
orcfiledump
rcfilecat : RCFile 내용을 출력

hive cli 사용법

! Linux_Shell_명령어
dfs -help;
dfs -ls /;                         //--- HDFS 명령어 실행
set hive.cli.print.header=true;    //--- Table Header 표시

hwi 서비스 실행

vi /appl/hive/conf/hive-site.xml

 <property>
   <name>hive.hwi.listen.host</name>
   <value>0.0.0.0</value>
 </property>
 <property>
   <name>hive.hwi.listen.port</name>
   <value>9999</value>
 </property>
 <property>
   <name>hive.hwi.war.file</name>
   <value>/lib/hive-hwi-0.11.0.war</value>
 </property>

hwi 실행

hive --service hwi

http://localhost:9999/hwi/ 에서 서비스 확인

Thrift server 실행

hive --service hiveserver &
netstat -an | grep LISTEN | grep tcp      //--- 사용 port 확인, 10000 port 사용

ZooKeeper를 사용하여 Hive 잠금 설정

vi /appl/hive/conf/hive-site.xml

 <property>
   <name>hive.zookeeper.quorum</name>
   <value>cloud001.cloudserver.com</value>   //--- ZooKeeper가 여럿 있을 경우 ","로 구분하여 기입 합니다.
 </property>
 <property>
   <name>hive.support.concurrency</name>
   <value>true</value>
 </property>

hive 에서 사용

show locks [extended];
lock table ~ exclusive;              //--- 테이블에 대해서 베타적 잠금 설정
unlock table ~;

Hive 개발자 매뉴얼

데이터 입출력

Textfile 포맷

create table ~
stored as textfile;

Sequencefile 포맷 (key/value 로 구성된 파일, 압축시 사용이 편리)

create table ~
stored as sequencefile;

RCFile 포맷 (Row, Column 단위로 접근 방식을 제공)

create table ~ (
    ~
    )
    row format serde 'org.apache.hadoop.hive.serde2.columnar.ColumnarSerDe'
        with serdeproperties ('~'='~')  
    stored as
        inputformat  'org.apache.hadoop.hive.ql.io.RCFileInputFormat'
        outputformat 'org.apache.hadoop.hive.ql.io.RCFileOutputFormat';

RCFile 조회

hive --service rcfilecat /user/hive/warehouse/~/~

스토리지 Handler

create table ~ (                //--- hive 테이블 생성
    key int, name string, price float
    )
    stored by 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
    with serdeproperties("hbase.columns.mapping" = ":key,stock:val")
    tblproperties ("hbase.table.name" = "~");

create external table ~ (       //--- 기존 hive table 연동
    ~
    )
    stored by 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
    with serdeproperties("hbase.columns.mapping" = "cf1:val")
    tblproperties ("hbase.table.name" = "~");

Hive에서 데이터 처리

Input Format Object : 입력 데이터를 레코드로 분리

org.apache.hadoop.mapred.TextInputFormat
org.apache.hadoop.hive.ql.io.BucketizedHiveInputFormat

SerDe : 레코드를 컬럼으로 분해 또는 컬럼을 레코드로 결합, Serializer/Deserializer

org.apache.hadoop.hive.serde2.lazy.LazySimpleSerde

Output Format Object : 레코드를 저장

org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat

Table 정의를 통해 데이터 처리 class 지정

create table ~ (
    ~
    )
    row format serde '~'              //--- SerDe의 full class name
        with serdeproperties ('~'='~')  
    stored as
        inputformat '~'               //--- Input Format Object의 full class name
        outputformat '~';             //--- Output Format Object의 full class name

사용자 정의 InputFormat

public class ~ implements InputFormat {
    public InputSplit[] getSplits(JobConf jc, int i) throws IOException {
    }
}

사용자 정의 함수

UDF 작성 및 배포

UDF 생성

package ~;
@Description(name="~", ~)

public class UDF~ extends UDF {
    public String evaluate(~) {
        return ~;
    }
}

public class UDF~ extends GenericUDF {
    private GenericUDFUtils.ReturnObjectInspectorResolver rtdata;
    private ObjectInspector[] args;

    //--- 입력 데이터 검사
    public ObjectInspector initialize(ObjectInspector[] arguments) throws UDFArgumentException {
        return (new GenericUDFUtils.ReturnObjectInspectorResolver(true)).get();
    }

    //--- 함수 실행
    public Object evaluate(DifferedObject[] arguments) throws HiveException {
        Object rtVal = null;

        return rtVal;
    }

    //--- Debuging 정보 표시
    public String getDisplayString(String[] children) {
        return ~;
    }
}

Compile 후 jar 파일 생성
Hive에 임시 등록

hive
    add jar ~.jar;
    create temporary function ~      //--- 함수 이름 지정
           as '~';                   //--- class의 full path 지정

Hive에 영구 등록
vi ql/src/java/org/apache/hadoop/hive/ql/exec/FunctionRegistry.java

registerUDF("~", ~.class, false);
registerGenericUDF("~", ~.class);
//--- Hive를 다시 빌드 합니다. (hive-exec-*.jar 파일)

매크로

create temporary macro ~(~ string) 매크로_내용;    //--- 함수명(인수)

스트림 관리

Streaming 함수의 종류 : map(~), transform(~) reduce(~)

테스트용 테이터 생성

vi /etc/zztemp.txt

123     24
124     25

스트리밍용 bash Script 작성

vi /etc/zztemp.bash && chmod 755 /etc/zztemp.bash

#!/bin/bash
while read LINE
do
    echo $LINE
done

hive에 bash script 등록 및 실행 테스트

zztemp.bash 대신에 Linux에 있는 /bin/cat 등을 직접 실행할 수 있습니다.

hive
    create table zztemp (f1 int, f2 int)
           row format delimited fields terminated by '\t';
    load data local inpath 'file:///root/zztemp.txt' into table zztemp;
    add file file:///root/zztemp.bash;        //--- 등록된 프로그램은 작업이 완료되면 삭제 됩니다.

    //--- zztemp 테이블에 있는 f1, f2 필드를 zztemp.bash의 표준 입력으로 전달하고 그 결과(newF1, newF2)를 가져 옵니다.
    select transform(f1, f2)
     using 'zztemp.bash' as (newF1 int, newF2 int)
      from zztemp;

사용자 정의 Hook

Hook

PreHook
PostHook

사용자 정의 색인 핸들러

https://cwiki.apache.org/confluence/display/Hive/IndexDev#CREATE_INDEX

Thrift Client

Thrift를 사용하여 Hive 연동

import org.apache.hadoop.hive.service.*;
import org.apache.thrift.protocol.*;
import org.apache.thrift.transport.*;

transport = new TSocket("localhost", 10000);
protocol = new TBinaryProtocol(protocol);
client = new HiveClient(protocol);

transport.open();
client.getClusterStatus();
client.execute("~");
client.getSchema();
client.getQueryPlan();
client.fetchOne(), fetchN(), fetchAll()

참고 문헌

RHive

아파치 하이브(Hive) 튜토리얼 번역, 2012.03

Querying JSON records via Hive, 2013.06

https://github.com/rcongiu/Hive-JSON-Serde/downloads

http://www.congiu.com/a-json-readwrite-serde-for-hive/

성능 벤치마크 테스트

Hive

목차

hive 개요

Source로 설치 파일 만들기

CentOS에서 Hive 설치

사전 준비 사항

설치

Hive Architecture

HiveQL

Hive CLI 기초

Data

HiveQL DDL (Data Definition Language)

HiveQL DML(Data Manipulation Language)

함수

Hive 매뉴얼

Hive 개발자 매뉴얼

데이터 입출력

사용자 정의 함수

스트림 관리

사용자 정의 Hook

사용자 정의 색인 핸들러

Thrift Client

참고 문헌

둘러보기 메뉴

개인 도구

이름공간

변수

보기

더 보기

검색

주요 메뉴

둘러보기

자매 사이트

친구 사이트

개인 메뉴

도구