文章/答案/技术大牛

发布

社区首页 >问答首页 >如何在Java中用Jsoup从javascript变量中解析html？

问如何在Java中用Jsoup从javascript变量中解析html？
EN

Stack Overflow用户

提问于 2013-07-29 18:51:50

回答 2查看 13.2K关注 0票数 2

我使用Jsoup来解析html文件，并从元素中提取所有可见文本。问题是javascript变量中的一些html位显然被忽略了。什么是最好的解决方案来让这些比特出来？

示例：

<!DOCTYPE html>
<html>
<head>
    <script>
        var html = "<span>some text</span>";
    </script>
</head>
<body>
    <p>text</p>
</body>
</html>

在本例中，Jsoup只从p标记中提取文本，这是它应该做的事情。如何从var html span中提取文本？该解决方案必须应用于数千个不同的页面，所以我不能依赖于具有相同名称的javascript变量。

java

javascript

html

jsoup

回答 2

Stack Overflow用户

发布于 2013-07-29 19:42:34

您可以使用Jsoup将所有的<script>-tags解析为DataNode-objects。

DataNode一个数据节点，用于样式、脚本标记等内容，其中的内容不应显示在文本中()。

 Elements scriptTags = doc.getElementsByTag("script");

这将为您提供标记<script>的所有元素。

然后，可以使用getWholeData()-method提取节点。

//获取该节点的数据内容。String getWholeData()

 for (Element tag : scriptTags){                
        for (DataNode node : tag.dataNodes()) {
            System.out.println(node.getWholeData());
        }        
  }

Jsoup API - DataNode

票数 5

Stack Overflow用户

发布于 2013-11-02 12:16:53

我不太确定答案，但我在here之前看到过类似的情况。

您可能可以使用Jsoup和手动解析来根据该答案获得文本。

我只是针对您的特定情况修改了这段代码：

Document doc = ...
Element script = doc.select("script").first(); // Get the script part


Pattern p = Pattern.compile("(?is)html = \"(.+?)\""); // Regex for the value of the html
Matcher m = p.matcher(script.html()); // you have to use html here and NOT text! Text will drop the 'html' part


while( m.find() )
{
    System.out.println(m.group()); // the whole html text
    System.out.println(m.group(1)); // value only
}

希望对您有所帮助。

票数 1

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/17922129

复制

相似问题

问如何在Java中用Jsoup从javascript变量中解析html？
EN

回答 2

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问如何在Java中用Jsoup从javascript变量中解析html？EN

回答 2

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问如何在Java中用Jsoup从javascript变量中解析html？
EN