本文共 2046 字,大约阅读时间需要 6 分钟。
创建person表
CREATE TABLE `person`( `id` int, `name` string, `address` string)
添加如下数据:
hive> insert into person values(1, 'lisi', 'beijing');hive> insert into person values(2, 'zhangsan', 'chengdu');hive> insert into person values(3, 'wangwu', 'shanghai');hive> insert into person values(4, 'zhaoliu', 'guangzhou');hive> insert into person values(5, 'name5', 'beijing');
order by会对查询结果执行一个全局排序,reducer的数量是1。因此这个过程可能会很漫长。
hive> select * from person order by id desc;5 name5 beijing4 zhaoliu guangzhou3 wangwu shanghai2 zhangsan chengdu1 lisi beijing
sort by 只会对每个reducer 中的数据进行排序,也就是执行一个局部排序过程。
hive> set mapreduce.job.reduces=3;hive> insert overwrite local directory '/root/sortby-result' select * from person sort by id desc;
# 每个分区的数据按id降序[root@master ~]# cat /root/sortby-result/000000_0 5name5beijing[root@master ~]# cat /root/sortby-result/000001_0 4zhaoliuguangzhou3wangwushanghai2zhangsanchengdu[root@master ~]# cat /root/sortby-result/000002_0 1lisibeijing
distribute by 控制mapper中的输出在 reducer 中是如何进行划分的,使用distribute by可以保证相同key的记录被划分到一个reducer中。
# 以address分区然后再按id排序hive> set mapreduce.job.reduces=3;hive> insert overwrite local directory '/root/distributeby-result' select * from person distribute by address sort by id desc;
[root@master ~]# cat /root/distributeby-result/000000_0 4zhaoliuguangzhou3wangwushanghai[root@master ~]# cat /root/distributeby-result/000001_0 5name5beijing1lisibeijing[root@master ~]# cat /root/distributeby-result/000002_0 2zhangsanchengdu
distribute by 和 sort by 合用就相当于cluster by,但是cluster by 不能指定排序为asc或 desc 的规则,只能是升序排列。
hive> set mapreduce.job.reduces=3;hive> insert overwrite local directory '/root/clusterby-result' select * from person cluster by address;
[root@master ~]# cat /root/distributeby-result/000000_0 4zhaoliuguangzhou3wangwushanghai[root@master ~]# cat /root/distributeby-result/000001_0 5name5beijing1lisibeijing[root@master ~]# cat /root/distributeby-result/000002_0 2zhangsanchengdu
转载地址:http://tecmb.baihongyu.com/