大数据干货系列（五）-Hive总结

企鹅号小编

发布于 2018-01-29 11:15:13

1.5K0

发布于 2018-01-29 11:15:13

文章被收录于专栏：大数据

Hive总结

一、本质

Hive基于一个统一的查询分析层，通过SQL语句的方式对HDFS上的数据进行查

询、统计和分析。

二、四大特点**

• Hive本身不存储数据，它完全依赖HDFS和MapReduce，具有可扩展的存储能力和计算能力

• Hive的内容是读多写少，不支持对数据的改写和删除

• Hive中没有定义专门的数据格式，由用户指定

• Hive是一个SQL解析引擎，将SQL语句转译成MR Job

下例：Hive写的wordcount

三、HQL与SQL对比

四、Hive体系架构

可以将Hive体系分为三层，从上至下依次为用户接口、语句转换、数据存储

五、Hive建表

1.确认建内部表还是外部表：

– create table

删除表的时候，Hive将会把属于表的元数据和数据全部删掉

–create external table

在导入数据到外部表，数据并没有移动到自己的数据仓库目录下，删除时仅仅删除表的元数据

2.Partition和Bucket

– Table可以拆分成partition，就像手机中的相册按照日期划分为一个个的小照片集，作用是缩小查询范围，加快检索速度

–Partition进一步可以通过”CLUSTERED BY“划分为多个Bucket，Bucket中的数据可以通过‘SORT BY’排序，作用是能提高查询操作效率（如mapside join），常用于采样sampling：

select * from student tablesample(bucket 1 out of 2 on id);

六、Hive的优化***

1.Map的优化

•增加map的个数：

•减少map的个数（合并小文件）：

• Map端聚合（combiner）：

2.Reduce的优化

•设置reduce的个数：

• reduce任务处理的数据量

•避免使用可能启动mapreduce的查询语句

1)group by

2)order by(改用distribute by和sort by)

3.Join的优化

• Join on的条件：

SELECT a.val, b.val, c.val

FROM a

JOIN b ON (a.key = b.key1)

JOIN c ON (a.key = c.key1)

• Join的顺序：

/*+ STREAMTABLE(a) */ ：a被视为大表

/*+ MAPJOIN(b) */：b被视为小表SELECT /*+ STREAMTABLE(a) */ a.val, b.val, c.val

FROM a

JOIN b ON (a.key = b.key1)

JOIN c ON (c.key = b.key1);

4.数据倾斜的优化

•万能方法：

•大小表关联：

Small_table join big_table

•数据中有大量或NULL：

on case when (x.uid = '-' or x.uid = '0‘ or x.uid is null)

then concat('dp_hive_search',rand()) else x.uid

end = f.user_id;

•大大表关联：

Select/*+MAPJOIN(t12)*/ *

from dw_log t11

join (

select/*+MAPJOIN(t)*/ t1.*

from (

select user_id from dw_loggroup byuser_id

) t

join dw_user t1

on t.user_id=t1.user_id

) t12

on t11.user_id=t12.user_id

• count distinct时存在大量特殊值：

select cast(count(distinct user_id)+1 as bigint) as user_cnt

from tab_a

where user_id is not null and user_id ''

•空间换时间：

select day,

count(case when type='session' then 1 else null end) as session_cnt,

count(case when type='user' then 1 else null end) as user_cnt

from (

select day,session_id,type

from (

select day,session_id,'session' as type

from log

union all

select day user_id,'user' as type

from log

)

group by day,session_id,type

) t1

group by day

5.其他的优化

•分区裁剪（partition）：

Where中的分区条件，会提前生效，不必特意做子查询，直接Join和GroupBy

•笛卡尔积：

Join的时候不加on条件或者无效的on条件，Hive只能使用1个reducer来完成笛卡尔积

• Union all：

先做union all再做join或group by等操作可以有效减少MR过程，多个Select，也只需一个MR

• Multi-insert & multi-group by：

从一份基础表中按照不同的维度，一次组合出不同的数据

FROM from_statement

INSERT OVERWRITE TABLE table1 [PARTITION (partcol1=val1)] select_statement1 group by key1

INSERT OVERWRITE TABLE table2 [PARTITION(partcol2=val2 )] select_statement2 group by key2

• Automatic merge：

当文件大小比阈值小时，hive会启动一个mr进行合并

• Multi-Count Distinct：

select dt, count(distinct uniq_id), count(distinct ip)

from ods_log where dt=20170301 group by dt

•并行实行：

七、Hive案例

1.导入本地Local的数据，并进行简单统计

load data (local) inpath "" overwrite into table a1;

2.两表Join

select a.*, b.* from w_a a join w_b b on a.usrid=b.usrid;

3.UDF

• UDF函数可以直接应用于select语句，对查询结构做格式化处理后，再输出内容。

•编写UDF函数的时候需要注意一下几点：

–需要实现evaluate函数

– evaluate函数支持重载

•导出的jar包需要add后，才可以使用

4.利用Insert命令导入数据

insert into table test1 partition(c) select * from test2;

5.通过查询直接插入数据

create table test2 as select * from test1;

6.导出文件

insert overwrite (local) directory '/home/badou/hive_test/1.txt'

select usrid,sex from w_a;

7.Partition的使用

#1.建表

create TABLE p_t

(usrid string

age string

)

partition by (dt string)

row format delimited fields terminated by '\t'

lines terminated by '\n';

#2.插入数据

load data (local) inpath "" overwrite into table p_t partition(dt='20170302');

#3.查询数据

select * from p_t where dt='20170302';

以上.

如果觉得本文对你有帮助，可以帮忙点个赞表示支持吗，谢谢！

如果有任何意见和建议，也欢迎再下方留言~

关注这个公众号，每天22：00会有三道大数据面试题准时推送给你哦~

本文来自企鹅号 - 每天学点java干货媒体

如有侵权，请联系 cloudcommunity@tencent.com 删除。

hive

sql

本文来自企鹅号 - 每天学点java干货媒体

如有侵权，请联系 cloudcommunity@tencent.com 删除。

hive

sql

登录后参与评论

0 条评论

热度

大数据干货系列（五）-Hive总结

大数据干货系列（五）-Hive总结

社区

活动

资源

关于

腾讯云开发者

热门产品

热门推荐

更多推荐