首页
学习
活动
专区
圈层
工具
发布
首页
学习
活动
专区
圈层
工具
MCP广场
社区首页 >问答首页 >使用php DOMDocument?如何获取"h1 h2 h3 h4 h5 h6“DOM中存在的所有代码?

使用php DOMDocument?如何获取"h1 h2 h3 h4 h5 h6“DOM中存在的所有代码?
EN

Stack Overflow用户
提问于 2022-11-18 14:16:26
回答 1查看 33关注 0票数 1

用DOMDocument?如何获取"h1 h2 h3 h4 h5 h6“DOM中存在的所有代码?我需要"h1 h2 h3 h4 h5 h6“之间的html内容。

代码语言:javascript
运行
复制
$html = <<<'HTML'

 txt1
<h2>h2 txt2</h2>
        txt3<br>
        txt4<br>
        txt5

      <h3>h3 txt6</h3>
        txt7
 
      <h3>h3 txt8</h3>
        txt9<br>

<h2>h2 txt10</h2>
 txt11

<h2>h2 txt12</h2>
 txt13
 
HTML;
代码语言:javascript
运行
复制
$query = '//*[not(contains("h1 h2 h3 h4 h5 h6 html body", name()))]';

产出:

代码语言:javascript
运行
复制
        string(1) "p"
        string(6) "txt1"
        -----
        string(2) "br"
        string(0) ""
        -----
        string(2) "br"
        string(0) ""
        -----
        string(2) "br"
        string(0) ""

txt1 txt3

txt4

txt5 txt7 ..。

没有标签的文本不包括在内。我怎么能接受呢?

完整示例,如果(1)1=测试查询不包含: NOT (h1 h2 h3..)

如果(0) $query =‘//*包含(“h1 h2 h3 h4 h5 h6",name());这将从标题生成TOC,但我也需要标题之间的html

代码语言:javascript
运行
复制
<?php



$html = <<<'HTML'

 txt1
<h2>h2 txt2</h2>
        txt3<br>
        txt4<br>
        txt5

      <h3>h3 txt6</h3>
        txt7
 
      <h3>h3 txt8</h3>
        txt9<br>

<h2>h2 txt10</h2>
 txt11

<h2>h2 txt12</h2>
 txt13
 
HTML;


libxml_use_internal_errors(true);
$dom = new DOMDocument;
$dom->loadHTML($html);
$xp = new DOMXPath($dom);



#$query = '//*[contains("h1 h2 h3 h4 h5 h6", name())]';


    # 1 = test
if(1){

$query = '//*[not(contains("h1 h2 h3 h4 h5 h6 html body", name()))]';
$nodes = $xp->query($query);
//Using DOMDocument, ? how to get all code that exists between within "h1 h2 h3 h4 h5 h6" DOM?
//I need the html content between the "h1 h2 h3 h4 h5 h6" + I can query DOM "h1 h2 h3 h4 h5 h6" elements  
   
    
    echo '<pre>';    
    #var_dump($nodes); exit;
    foreach($nodes as $node) {
        echo '<hr>';
        var_dump($node->localName);
        var_dump($node->nodeValue);
    }
    

echo '<pre>';

$vardumpis= <<<'VARDU'

        string(1) "p"
        string(6) "txt1"
        -----
        string(2) "br"
        string(0) ""
        -----
        string(2) "br"
        string(0) ""
        -----
        string(2) "br"
        string(0) ""
        
VARDU;

exit; 
}
    # end test
    
$query = '//*[contains("h1 h2 h3 h4 h5 h6", name())]';
$nodes = $xp->query($query);

//generate TOC from headlines result1:
$currentLevel = ['level' => 0, 'count' => 0];
$stack = [];
$format = '<li>%s</li>';
$result1 = '';



foreach($nodes as $node) {
    $level = (int)$node->tagName[1]; // extract the digit after h
  
    while($level < $currentLevel['level']) {
        $currentLevel = array_pop($stack);
        $result1 .= '</ul>';
    }
    
    if ($level === $currentLevel['level']) {
        $currentLevel['count']++;
    } else {
        $stack[] = $currentLevel;
        $currentLevel = ['level' => $level, 'count' => 1];
        $result1 .= '<ul>';
    }
    
    $result1 .= sprintf($format, $node->nodeValue);    
}
$result1 .= str_repeat('</ul>', count($stack));


//THIS is what I need  result2:

$target2 = <<<'TARG'
txt1<br>
</ul><h2>h2 txt2</h2><ul>
            txt3<br>
            txt4<br>
            txt5

            <h3>h3 txt6</h3><ul>
            txt7

            </ul><h3>h3 txt8</h3><ul>
            txt9
</ul>
</ul><h2>h2 txt10</h2><ul>
txt11

</ul><h2>h2 txt12</h2><ul>
txt13
</ul>

TARG;


file_put_contents('toc15.htm', 'This I have: TOC result1:<br>'. $result1 .'<br><br><hr>This I need: target2 with content between headlines tags <br>'. $target2);


//help php DOM:  https://3v4l.org/aDSrK  https://schlitt.info/opensource/blog/0704_xpath.html#node-relations  https://www.php.net/manual/en/class.domdocument.php  https://schlitt.info/opensource/blog/0704_xpath.html#node-relations      https://www.abdulibrahim.com/php-scraping-using-dom-and-xpath-tutorial/#xpath_conditions      https://www.lambdatest.com/blog/complete-guide-for-using-xpath-in-selenium-with-examples/

我有: TOC result1:

代码语言:javascript
运行
复制
h2 txt2
______h3 txt6
______h3 txt8
h2 txt10
h2 txt12
EN

回答 1

Stack Overflow用户

发布于 2022-11-18 18:25:21

可以使用以下简单的XPath: string(/)获得整个文档的字符串值

文档中的所有文本节点将是: // text ()

因此,有:

$query = '//text()';

我有:

代码语言:javascript
运行
复制
NULL
string(6) "txt1
"
NULL
string(7) "h2 txt2"
NULL
string(14) "
        txt3"
NULL
string(14) "
        txt4"
NULL
string(24) "
        txt5

      "
NULL
string(7) "h3 txt6"
NULL
string(25) "
        txt7
 
      "
NULL
string(7) "h3 txt8"
NULL
string(14) "
        txt9"
NULL
string(4) "

"
NULL
string(8) "h2 txt10"
NULL
string(12) "
 txt11

因此,现在必须将已识别的文本写入文件中。

至少再远一点。

代码语言:javascript
运行
复制
票数 0
EN
页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持
原文链接:

https://stackoverflow.com/questions/74491002

复制
相关文章

相似问题

领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档