文章/答案/技术大牛

发布

社区首页 >问答首页 >使用php DOMDocument？如何获取"h1 h2 h3 h4 h5 h6“DOM中存在的所有代码？

问使用php DOMDocument？如何获取"h1 h2 h3 h4 h5 h6“DOM中存在的所有代码？
EN

Stack Overflow用户

提问于 2022-11-18 14:16:26

回答 1查看 33关注 0票数 1

用DOMDocument？如何获取"h1 h2 h3 h4 h5 h6“DOM中存在的所有代码？我需要"h1 h2 h3 h4 h5 h6“之间的html内容。

$html = <<<'HTML'

 txt1
<h2>h2 txt2</h2>
        txt3<br>
        txt4<br>
        txt5

      <h3>h3 txt6</h3>
        txt7
 
      <h3>h3 txt8</h3>
        txt9<br>

<h2>h2 txt10</h2>
 txt11

<h2>h2 txt12</h2>
 txt13
 
HTML;

$query = '//*[not(contains("h1 h2 h3 h4 h5 h6 html body", name()))]';

产出：

        string(1) "p"
        string(6) "txt1"
        -----
        string(2) "br"
        string(0) ""
        -----
        string(2) "br"
        string(0) ""
        -----
        string(2) "br"
        string(0) ""

txt1 txt3

txt4

txt5 txt7 ..。

没有标签的文本不包括在内。我怎么能接受呢？

完整示例，如果(1)1=测试查询不包含: NOT (h1 h2 h3..)

如果(0) $query =‘//*包含(“h1 h2 h3 h4 h5 h6"，name())；这将从标题生成TOC，但我也需要标题之间的html

<?php



$html = <<<'HTML'

 txt1
<h2>h2 txt2</h2>
        txt3<br>
        txt4<br>
        txt5

      <h3>h3 txt6</h3>
        txt7
 
      <h3>h3 txt8</h3>
        txt9<br>

<h2>h2 txt10</h2>
 txt11

<h2>h2 txt12</h2>
 txt13
 
HTML;


libxml_use_internal_errors(true);
$dom = new DOMDocument;
$dom->loadHTML($html);
$xp = new DOMXPath($dom);



#$query = '//*[contains("h1 h2 h3 h4 h5 h6", name())]';


    # 1 = test
if(1){

$query = '//*[not(contains("h1 h2 h3 h4 h5 h6 html body", name()))]';
$nodes = $xp->query($query);
//Using DOMDocument, ? how to get all code that exists between within "h1 h2 h3 h4 h5 h6" DOM?
//I need the html content between the "h1 h2 h3 h4 h5 h6" + I can query DOM "h1 h2 h3 h4 h5 h6" elements  
   
    
    echo '<pre>';    
    #var_dump($nodes); exit;
    foreach($nodes as $node) {
        echo '<hr>';
        var_dump($node->localName);
        var_dump($node->nodeValue);
    }
    

echo '<pre>';

$vardumpis= <<<'VARDU'

        string(1) "p"
        string(6) "txt1"
        -----
        string(2) "br"
        string(0) ""
        -----
        string(2) "br"
        string(0) ""
        -----
        string(2) "br"
        string(0) ""
        
VARDU;

exit; 
}
    # end test
    
$query = '//*[contains("h1 h2 h3 h4 h5 h6", name())]';
$nodes = $xp->query($query);

//generate TOC from headlines result1:
$currentLevel = ['level' => 0, 'count' => 0];
$stack = [];
$format = '<li>%s</li>';
$result1 = '';



foreach($nodes as $node) {
    $level = (int)$node->tagName[1]; // extract the digit after h
  
    while($level < $currentLevel['level']) {
        $currentLevel = array_pop($stack);
        $result1 .= '</ul>';
    }
    
    if ($level === $currentLevel['level']) {
        $currentLevel['count']++;
    } else {
        $stack[] = $currentLevel;
        $currentLevel = ['level' => $level, 'count' => 1];
        $result1 .= '<ul>';
    }
    
    $result1 .= sprintf($format, $node->nodeValue);    
}
$result1 .= str_repeat('</ul>', count($stack));


//THIS is what I need  result2:

$target2 = <<<'TARG'
txt1<br>
</ul><h2>h2 txt2</h2><ul>
            txt3<br>
            txt4<br>
            txt5

            <h3>h3 txt6</h3><ul>
            txt7

            </ul><h3>h3 txt8</h3><ul>
            txt9
</ul>
</ul><h2>h2 txt10</h2><ul>
txt11

</ul><h2>h2 txt12</h2><ul>
txt13
</ul>

TARG;


file_put_contents('toc15.htm', 'This I have: TOC result1:<br>'. $result1 .'<br><br><hr>This I need: target2 with content between headlines tags <br>'. $target2);


//help php DOM:  https://3v4l.org/aDSrK  https://schlitt.info/opensource/blog/0704_xpath.html#node-relations  https://www.php.net/manual/en/class.domdocument.php  https://schlitt.info/opensource/blog/0704_xpath.html#node-relations      https://www.abdulibrahim.com/php-scraping-using-dom-and-xpath-tutorial/#xpath_conditions      https://www.lambdatest.com/blog/complete-guide-for-using-xpath-in-selenium-with-examples/

我有: TOC result1：

h2 txt2
______h3 txt6
______h3 txt8
h2 txt10
h2 txt12

php

dom

domxpath

回答 1

Stack Overflow用户

发布于 2022-11-18 18:25:21

可以使用以下简单的XPath: string(/)获得整个文档的字符串值

文档中的所有文本节点将是: // text ()

因此，有：

$query = '//text()'；

我有：

NULL
string(6) "txt1
"
NULL
string(7) "h2 txt2"
NULL
string(14) "
        txt3"
NULL
string(14) "
        txt4"
NULL
string(24) "
        txt5

      "
NULL
string(7) "h3 txt6"
NULL
string(25) "
        txt7
 
      "
NULL
string(7) "h3 txt8"
NULL
string(14) "
        txt9"
NULL
string(4) "

"
NULL
string(8) "h2 txt10"
NULL
string(12) "
 txt11

因此，现在必须将已识别的文本写入文件中。

至少再远一点。

票数 0

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/74491002

复制

相似问题

问使用php DOMDocument？如何获取"h1 h2 h3 h4 h5 h6“DOM中存在的所有代码？
EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问使用php DOMDocument？如何获取"h1 h2 h3 h4 h5 h6“DOM中存在的所有代码？EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问使用php DOMDocument？如何获取"h1 h2 h3 h4 h5 h6“DOM中存在的所有代码？
EN