前两篇,给大家分享了Lateral View、Lateral View Outer、explode和posexplode的用法,但是在群里大家有更加深入的讨论
,讨论posexplode的应用场景,感觉之前的分享不是很尽兴,于是再加一篇
为了更加完成更加复杂的测试,我们把基础数据调整一下,数据如下:
create table test(id int, items array<string>,cnts array<string>);
insert into test values
(1, array('apple', 'banana', 'pear'),array('1','2','3')),
(2,array('watermelon', 'orange'),array('5','6')),
(3,array(null),array(null));
+----------+----------------------------+----------------+
| test.id | test.items | test.cnts |
+----------+----------------------------+----------------+
| 1 | ["apple","banana","pear"] | ["1","2","3"] |
| 2 | ["watermelon","orange"] | ["5","6"] |
| 3 | NULL | NULL |
+----------+----------------------------+----------------+
我们使用两次lateral view explode,可以计算出两列的笛卡尔积,SQL如下
select
id,items,cnts,item,cnt
from test
lateral view explode(items) t_item as item
lateral view explode(cnts) t_cnts as cnt
结果会是:
+-----+----------------------------+----------------+-------------+------+
| id | items | cnts | item | cnt |
+-----+----------------------------+----------------+-------------+------+
| 1 | ["apple","banana","pear"] | ["1","2","3"] | apple | 1 |
| 1 | ["apple","banana","pear"] | ["1","2","3"] | apple | 2 |
| 1 | ["apple","banana","pear"] | ["1","2","3"] | apple | 3 |
| 1 | ["apple","banana","pear"] | ["1","2","3"] | banana | 1 |
| 1 | ["apple","banana","pear"] | ["1","2","3"] | banana | 2 |
| 1 | ["apple","banana","pear"] | ["1","2","3"] | banana | 3 |
| 1 | ["apple","banana","pear"] | ["1","2","3"] | pear | 1 |
| 1 | ["apple","banana","pear"] | ["1","2","3"] | pear | 2 |
| 1 | ["apple","banana","pear"] | ["1","2","3"] | pear | 3 |
| 2 | ["watermelon","orange"] | ["5","6"] | watermelon | 5 |
| 2 | ["watermelon","orange"] | ["5","6"] | watermelon | 6 |
| 2 | ["watermelon","orange"] | ["5","6"] | orange | 5 |
| 2 | ["watermelon","orange"] | ["5","6"] | orange | 6 |
+-----+----------------------------+----------------+-------------+------+
上面我们可以看到:item列和cnt两列进行了笛卡尔积计算,在需要计算笛卡尔积的时候,可以这样完成。
可是,如果我们的items和cnts是有对应关系的,即 items中的apple对应cnts中的1,banana对应2,pear对应3,我们就需要有对应的关系,这个时候posexplode就闪亮登场了
select
id,item_idx,item,cnt_idx,cnt
from test
lateral view posexplode(items) t_item as item_idx,item
lateral view posexplode(cnts) t_cnts as cnt_idx,cnt
结果会是:
+-----+-----------+-------------+----------+------+
| id | item_idx | item | cnt_idx | cnt |
+-----+-----------+-------------+----------+------+
| 1 | 0 | apple | 0 | 1 |
| 1 | 0 | apple | 1 | 2 |
| 1 | 0 | apple | 2 | 3 |
| 1 | 1 | banana | 0 | 1 |
| 1 | 1 | banana | 1 | 2 |
| 1 | 1 | banana | 2 | 3 |
| 1 | 2 | pear | 0 | 1 |
| 1 | 2 | pear | 1 | 2 |
| 1 | 2 | pear | 2 | 3 |
| 2 | 0 | watermelon | 0 | 5 |
| 2 | 0 | watermelon | 1 | 6 |
| 2 | 1 | orange | 0 | 5 |
| 2 | 1 | orange | 1 | 6 |
+-----+-----------+-------------+----------+------+
我们可以限定item_idx和cnt_idx 相等,得到对应的数据:
select
id,item_idx,item,cnt_idx,cnt
from test
lateral view posexplode(items) t_item as item_idx,item
lateral view posexplode(cnts) t_cnts as cnt_idx,cnt
where item_idx=cnt_idx
查询结果如下
+-----+-----------+-------------+----------+------+
| id | item_idx | item | cnt_idx | cnt |
+-----+-----------+-------------+----------+------+
| 1 | 0 | apple | 0 | 1 |
| 1 | 1 | banana | 1 | 2 |
| 1 | 2 | pear | 2 | 3 |
| 2 | 0 | watermelon | 0 | 5 |
| 2 | 1 | orange | 1 | 6 |
+-----+-----------+-------------+----------+------+
我们可以看到只留下了对应关系的数据。
然而,此时群里的朋友提出了新的解法,引入了一个新的函数 map_from_arrays(),具体写法如下:
select
id,items,cnts,map_from_arrays(items,cnts) as col3
from test
可是结果出现了意外,语句报错说没有map_from_arrays函数。
于是我去spark集群上进行了测试
可以得到如下结果,可以拿到对应关系结果。
于是大家学到了一个新的方法map_from_arrays