问XML中重复元素(“行对象”)的自动检测/解析
EN

Stack Overflow用户

提问于 2017-02-14 14:17:34

回答 1查看 75关注 0票数 0

我正在尝试编写一个通用的XML解析器，用于使用未知模式的提要。基本上，我想对“行”在XML文档中的位置做出最好的猜测。以下是两个示例提要：

馈送1，示例：

<xml>
  <some-container-tag>
    <some-row-tag>
      <attribute-1>value</attribute-1>
      <attribute-2>value</attribute-2>
      <attribute-3>value</attribute-3>
      <attribute-4>value</attribute-4>
    </some-row-tag>
    <some-row-tag>
      <attribute-1>value</attribute-1>
      <attribute-2>value</attribute-2>
      <attribute-3>value</attribute-3>
      <attribute-4>value</attribute-4>
    </some-row-tag>
    ...
  </some-container-tag>
</xml>

馈送2，示例：

<xml>
  <some-container-tag>
    <some-row-tag>
      <attribute-1>value</attribute-1>
      <attribute-2>value</attribute-2>
      <attribute-3>value</attribute-3>
      <attribute-4>value</attribute-4>
      <optional-nested-attribute-set>
         ...
      </optional-nested-attribute-set>
    </some-row-tag>
    <some-row-tag>
      <attribute-1>value</attribute-1>
      <attribute-2>value</attribute-2>
      <attribute-3>value</attribute-3>
      <attribute-4>value</attribute-4>
      <optional-nested-attribute-set>
         ...
      </optional-nested-attribute-set>
    </some-row-tag>
    ...
  </some-container-tag>
  <some-other-container-tag>
    <some-row-tag>
      <attribute-1>value</attribute-1>
      <attribute-2>value</attribute-2>
      <attribute-3>value</attribute-3>
      <attribute-4>value</attribute-4>
      <optional-nested-attribute-set>
         ...
      </optional-nested-attribute-set>
    </some-row-tag>
  </some-other-container-tag>
</xml>

到目前为止，我所做的是遍历结构并将xpath映射到一个计数，例如，第一个提要将如下所示：

xml => 1
xml/some-container-tag => 1
xml/some-container-tag/some-row-tag => n
xml/some-container-tag/some-row-tag/attribute-1 => n
xml/some-container-tag/some-row-tag/attribute-2 => n
xml/some-container-tag/some-row-tag/attribute-3 => n
xml/some-container-tag/some-row-tag/attribute-4 => n

现在我的想法是“基本单元”(行级别)将是最低级别的非叶子节点，尽管我在审查这个想法时遇到了问题(在这里单独开发)。

当然，Feed2要“复杂得多”，因为可能有嵌套的属性(本质上是子数组)，而且可能有两个父列表。

这里什么是足够好的通用方法？

parsing

xml-parsing

xml

algorithm

回答 1

Stack Overflow用户

发布于 2017-02-14 14:44:17

您的问题是，您正在尝试将多维树形结构转换为二维表格结构。如果没有模式，您就没有好的方法来确保您的假设是正确的，但如果您必须这样做，您必须提出一些假设。

您可以通过层次结构中的深度来处理它，而不是计算特定深度处的节点数(没有什么可以说所有叶子节点都将处于相同的深度，这就是您现在遇到的问题)：

根标签深度0(

Depth 0)指示新的数据集合1 (some-container-tag)指示新的二维标签2 (some-row-tag)指示二维标签3+中的新行指示进入该行的条目，该行本身可以具有子条目。也许这些字符串表示为CSV字符串，或者表示为指向另一个类似数组/表格的数据结构的指针-但如果您开始添加这样的内容，您将不再真正处理二维结构。

所有这一切实际上都取决于您最终需要对数据做什么，以及在您选择处理数据的语言中什么类型的假设是有效的。无论哪种方式，您可能更好的是通过深度来分析，而不是通过计数来解析出来。此外，如果这确实是无模式的，那么您可能需要考虑如何处理XML中显示的属性。

票数 0

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/42228486

复制

R：purrr包用于循环迭代

mapreduce

purrr中有多个迭代函数，可以用于快速解决循环迭代的问题，purrr中常用的迭代函数有map、map2、walk、reduce等等。

生信菜鸟团

2020/07/16

1.6K0