在我的输入XML中,父代中有多个“父代”家庭。在每个父代中,ChildRoots中可以有单个或多个"ChildRoot“系列。在这个层次结构中有4个字段需要解析。在Hive XPATH函数的帮助下,如何在不发生交叉联接的情况下实现这一点?在这种情况下,没有特定的字段来参数化解析的字符串,以便映射到它的右父级(Parents/ parent /ChildRoot/ChildRoot/)。
<Parents>
<Parent>
<ChildRoots>
<ChildRoot>
<FieldA>Field A info1</FieldA>
<FieldB>Field B info1</FieldB>
<FieldC>Field C info1</FieldC>
<FieldD>Field D info1</FieldD>
</ChildRoot>
</ChildRoots>
</Parent>
<Parent>
<ChildRoots>
<ChildRoot>
<FieldA>Field A info2</FieldA>
<FieldC>Field C info2</FieldC>
<FieldD>Field D info2</FieldD>
</ChildRoot>
<ChildRoot>
<FieldA>Field A info3</FieldA>
<FieldB>Field B info3</FieldB>
<FieldD>Field D info3</FieldD>
</ChildRoot>
</ChildRoots>
</Parent>
</Parents>
预期输出:
FieldA FieldB FieldC FieldD
Field A info1 Field B info1 Field C info1 Field D info1
Field A info2 Field C info2 Field D info2
Field A info3 Field B info3 Field D info3
发布于 2021-08-08 05:53:32
使用count() XPATH函数可以计算每个级别上的元素数量。例如,XPATH_INT(xml, 'count(/Parents/Parent)')
提供了2
。使用横向视图posexplode为每个级别生成行,并使用分解位置参数化XPATH,因此所有这些XPATH都将具有一致的位置索引,并且不会发生交叉连接。
在这个例子中,<Parent>
只有一个子<ChildRoots>
,如果它是每个父<ChildRoots>
中唯一的一个,你可以删除计数和分解的计算,这将会简化代码。我的代码允许每个<Parent>
有多个<ChildRoots>
。
演示:
with mytable as (
select '<Parents>
<Parent>
<ChildRoots>
<ChildRoot>
<FieldA>Field A info1</FieldA>
<FieldB>Field B info1</FieldB>
<FieldC>Field C info1</FieldC>
<FieldD>Field D info1</FieldD>
</ChildRoot>
</ChildRoots>
</Parent>
<Parent>
<ChildRoots>
<ChildRoot>
<FieldA>Field A info2</FieldA>
<FieldC>Field C info2</FieldC>
<FieldD>Field D info2</FieldD>
</ChildRoot>
<ChildRoot>
<FieldA>Field A info3</FieldA>
<FieldB>Field B info3</FieldB>
<FieldD>Field D info3</FieldD>
</ChildRoot>
</ChildRoots>
</Parent>
</Parents>
' as xml
)
select --Get ChildRoot count and generate rows with position index, build final XPATH
--xml, parent_pos, ChildRoots_pos,
--cr.pos+1 as ChildRoot_pos, --this expression used in xpath as is:
XPATH_STRING(xml, concat('/Parents/Parent[',parent_pos,']/ChildRoots[',ChildRoots_pos,']/ChildRoot[',cr.pos+1,']/FieldA/text()')) as FieldA,
XPATH_STRING(xml, concat('/Parents/Parent[',parent_pos,']/ChildRoots[',ChildRoots_pos,']/ChildRoot[',cr.pos+1,']/FieldB/text()')) as FieldB,
XPATH_STRING(xml, concat('/Parents/Parent[',parent_pos,']/ChildRoots[',ChildRoots_pos,']/ChildRoot[',cr.pos+1,']/FieldC/text()')) as FieldC,
XPATH_STRING(xml, concat('/Parents/Parent[',parent_pos,']/ChildRoots[',ChildRoots_pos,']/ChildRoot[',cr.pos+1,']/FieldD/text()')) as FieldD
from
( --Get ChildRoots count and generate rows with position index
select xml, parent_pos, crs.pos+1 as ChildRoots_pos
--XPATH_INT(xml, concat('count(/Parents/Parent[',parent_pos,']/ChildRoots)')) as ChildRoots_cnt
from
( --Get Parent count and generate rows with position index
select xml, p.pos+1 as parent_pos
--XPATH_INT(xml, 'count(/Parents/Parent)') as Parent_cnt
from mytable lateral view outer posexplode(split(space(XPATH_INT(xml, 'count(/Parents/Parent)')-1),'')) p as pos, x
) p lateral view outer posexplode(split(space(XPATH_INT(xml, concat('count(/Parents/Parent[',parent_pos,']/ChildRoots)'))-1),'')) crs as pos, x
) crs lateral view outer posexplode(split(space(XPATH_INT(xml, concat('count(/Parents/Parent[',parent_pos,']/ChildRoots[',ChildRoots_pos,']/ChildRoot)'))-1),'')) cr as pos, x
结果:
fielda fieldb fieldc fieldd
Field A info1 Field B info1 Field C info1 Field D info1
Field A info2 Field C info2 Field D info2
Field A info3 Field B info3 Field D info3
发布于 2021-08-07 12:28:30
所需字段可按其名称查找,如下所示
xmllint --xpath '/Parents/Parent[1]/ChildRoots/ChildRoot[1]/child::*[name()="FieldA" or name()="FieldB" or name()="FieldC" or name()="FieldD"]' test.xml ; echo
# <FieldA>Field A info1</FieldA><FieldB>Field B info1</FieldB><FieldC>Field C info1</FieldC><FieldD>Field D info1</FieldD>
xmllint --xpath '/Parents/Parent[2]/ChildRoots/ChildRoot[1]/child::*[name()="FieldA" or name()="FieldB" or name()="FieldC" or name()="FieldD"]' test.xml ; echo
# <FieldA>Field A info2</FieldA><FieldC>Field C info2</FieldC><FieldD>Field D info2</FieldD>
xmllint --xpath '/Parents/Parent[2]/ChildRoots/ChildRoot[2]/child::*[name()="FieldA" or name()="FieldB" or name()="FieldC" or name()="FieldD"]' test.xml ; echo
# <FieldA>Field A info3</FieldA><FieldB>Field B info3</FieldB><FieldD>Field D info3</FieldD>
https://stackoverflow.com/questions/68694975
复制相似问题