首页
学习
活动
专区
工具
TVP
发布
社区首页 >问答首页 >将HTML页面解析为DOM树的JSON表示

将HTML页面解析为DOM树的JSON表示
EN

Stack Overflow用户
提问于 2019-03-05 02:40:32
回答 1查看 330关注 0票数 1

我有一个作为index.html的文件,在该文件中有div标签,我试图从html页面中的所有div标签中获取内容,但我只从第一个div标签中获取内容,我需要从html页面中存在的所有div中获取内容。

下面是我的代码:

代码语言:javascript
复制
<?php

    // Function to get the contents of an attribute of an HTML tag
    function get_attribute_contents($element) {
        $obj_attribute = array ();
        foreach ( $element->attributes as $attribute ) {
            $obj_attribute [$attribute->name] = $attribute->value;
        }
        return $obj_attribute;
    }

    // Function to get contents of a child element of an HTML tag
    function get_child_contents($element) {
        $obj_child = array ();
        foreach ( $element->childNodes as $subElement ) {
            if ($subElement->nodeType != XML_ELEMENT_NODE) {
                if (trim ( $subElement->wholeText ) != "") {
                    $obj_child ["value"] = $subElement->wholeText;
                }
            } else {
                if ($subElement->getAttribute ( 'id' )) {
                    $obj_child [$subElement->tagName . "#" . $subElement->getAttribute ( 'id' )] = get_tag_contents ( $subElement );
                } else {
                    $obj_child [$subElement->tagName] = get_tag_contents ( $subElement );
                }
            }
        }
        return $obj_child;
    }

    // Function to get the contents of an HTML tag
    function get_tag_contents($element) {
        $obj_tag = array ();
        if (get_attribute_contents ( $element )) {
            $obj_tag ["attributes"] = get_attribute_contents ( $element );
        }
        if (get_child_contents ( $element )) {
            $obj_tag ["child_nodes"] = get_child_contents ( $element );
        }

        return $obj_tag;
    }

    // Function to convert a DOM element to an object
    function element_to_obj($element) {
        $object = array ();
        $tag = $element->tagName;
        $object [$tag] = get_tag_contents ( $element );
        return $object;
    }

    // Function to convert an HTML to a DOM element
    function html_to_obj($html) {
        $dom = new DOMDocument ();
        $dom->loadHTML ( $html );
        $docElement = $dom->documentElement;
        return element_to_obj ( $dom->documentElement );
    }

    // Reading the contents of an HTML file
    $html = file_get_contents ( 'index.html' );
    header ( "Content-Type: text/plain" );

    // Coverting the HTML to JSON
    $output = json_encode ( html_to_obj ( $html ) );

    // Writing the JSON output to an external file
    $file = fopen ( "js_output.json", "w" );
    fwrite ( $file, $output );
    fclose ( $file );

    echo "HTML to JSON conversion has been completed.\n";
    echo "Please refer to json_output.json to view the JSON output.";
?>

html文件是:

代码语言:javascript
复制
<div class="issue-message">
    Rename this package name to match the regular expression
    '^[a-z]+(\.[a-z][a-z0-9]*)*$'.
    <button class="button-link issue-rule icon-ellipsis-h little-spacer-left" aria-label="Rule Details"></button>
</div>
<div class="issue-message">
    Replace this use of System.out or System.err by a logger.
    <button class="button-link issue-rule icon-ellipsis-h little-spacer-left" aria-label="Rule  Details"></button>
</div>
<div class="issue-message">
    Replace this use of System.out or System.err by a logger.
    <button class="button-link issue-rule icon-ellipsis-h little-spacer-left" aria-label="Rule  Details"></button>
</div>
<div class="issue- 
    message">
    Rename this package name to match the regular expression '^[a-z]+
    (\.[a-z][a-z0-9]*)*$'.
    <button
        class="button-link issue-rule icon-ellipsis-h little-spacer-left" aria-label="Rule Details"></button>
</div>
<div class="issue-message">
    Replace this use of System.out or System.err by a logger.
    <button class="button-link issue-rule icon-ellipsis-h little-spacer-left" aria-label="Rule  Details"></button>
</div>

作为以下文件的代码输出,我得到了只存在于第一个div标记中的内容的json转换:

代码语言:javascript
复制
{
  "html": {
    "child_nodes": {
      "body": {
        "child_nodes": {
          "p": {
            "child_nodes": {
              "value": "Issues found:"
            }
          },
          "div": {
            "attributes": {
              "class": "issue-message"
            },
            "child_nodes": {
              "value": "This block of commented-out lines of code should be removed.",
              "button": {
                "attributes": {
                  "class": "button-link issue-rule icon-ellipsis-h little-spacer-left",
                  "aria-label": "Rule Details"
                }
              }
            }
          }
        }
      }
    }
  }
}
EN
页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持
原文链接:

https://stackoverflow.com/questions/54989542

复制
相关文章

相似问题

领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档