问在使用Mojo::DOM处理HTML文档时，如何最可靠地保留HTML实体？
EN

Stack Overflow用户

提问于 2019-03-13 05:25:20

回答 2查看 308关注 0票数 6

我正在使用Mojo::DOM来识别并打印出数百个HTML文档中的短语(即所选HTML标记之间的文本字符串)，这些文档是我从Movable内容管理系统中的现有内容中提取出来的。

我将这些短语写到一个文件中，这样它们就可以被翻译成其他语言，如下所示：

        $dom = Mojo::DOM->new(Mojo::Util::decode('UTF-8', $page->text));

    ##########
    #
    # Break down the Body into phrases. This is done by listing the tags and tag combinations that
    # surround each block of text that we're looking to capture.
    #
    ##########

        print FILE "\n\t### Body\n\n";        

        for my $phrase ( $dom->find('h1, h2, h2 b, h3, p, p strong, span, a, caption, th, li, li a')->map('text')->each ) {

            print_phrase($phrase); # utility function to write out the phrase to a file

        }

当Mojo::DOM遇到嵌入式HTML实体(比如™和 )时，它会将这些实体转换为编码字符，而不是像写的那样传递。我希望实体按照编写的方式传递。

我意识到我可以使用Mojo::Util::decode将这些HTML实体传递给我正在编写的文件。问题是“undef，如果没有，例如因为它已经被转换成Perl字符，它将返回You can only call decode 'UTF-8' on a string that contains valid UTF-8.。”

如果是这种情况，我要么在调用Mojo::Util::decode('UTF-8', $page->text)之前尝试找出如何测试当前HTML页面的编码，要么必须使用其他一些技术来保留编码的HTML实体。

在使用Mojo::DOM处理HTML文档时，如何最可靠地保留编码的HTML实体？

perl

html-entities

mojolicious

movabletype

回答 2

Stack Overflow用户

回答已采纳

发布于 2019-04-10 10:46:04

通过测试，我和我的同事能够确定Mojo::DOM->new()正在自动解码与符号字符(&)，从而无法将HTML实体保留为书面形式。为了解决这个问题，我们添加了以下子例程来对“与”符号进行双重编码：

sub encode_amp {
    my ($text) = @_;

    ##########
    #
    # We discovered that we need to encode ampersand
    # characters being passed into Mojo::DOM->new() to avoid HTML entities being decoded
    # automatically by Mojo::DOM::Util::html_unescape().
    #
    # What we're doing is calling $dom = Mojo::DOM->new(encode_amp($string)) which double encodes
    # any incoming ampersand or &amp; characters.
    #
    #
    ##########   

    $text .= '';           # Suppress uninitialized value warnings
    $text =~ s!&!&amp;!g;  # HTML encode ampersand characters
    return $text;
}

稍后在脚本中，我们实例化一个新的Mojo::DOM对象时，通过encode_amp()传递$page->text。

    $dom = Mojo::DOM->new(encode_amp($page->text));

##########
#
# Break down the Body into phrases. This is done by listing the tags and tag combinations that
# surround each block of text that we're looking to capture.
#
# Note that "h2 b" is an important tag combination for capturing major headings on pages
# in this theme. The tags "span" and "a" are also.
#
# We added caption and th to support tables.
#
# We added li and li a to support ol (ordered lists) and ul (unordered lists).
#
# We got the complicated map('descendant_nodes') logic from @Grinnz on StackOverflow, see:
# https://stackoverflow.com/questions/55130871/how-do-i-most-reliably-preserve-html-entities-when-processing-html-documents-wit#comment97006305_55131737
#
#
# Original set of selectors in $dom->find() below is as follows:
#   'h1, h2, h2 b, h3, p, p strong, span, a, caption, th, li, li a'
#
##########

    print FILE "\n\t### Body\n\n";        

    for my $phrase ( $dom->find('h1, h2, h2 b, h3, p, p strong, span, a, caption, th, li, li a')->
        map('descendant_nodes')->map('each')->grep(sub { $_->type eq 'text' })->map('content')->uniq->each ) {           

        print_phrase($phrase);

    }

上面的代码块结合了@Grinnz之前的建议，如本问题中的注释所示。也感谢@Robert的回答，他很好地观察了Mojo::DOM是如何工作的。

这段代码绝对适用于我的应用程序。

票数 0

Stack Overflow用户

发布于 2019-03-13 06:44:47

看起来，当您映射到文本时，您将替换XML实体，但是当您使用节点并使用其内容时，实体将被保留。下面是最小的例子：

#!/usr/bin/perl
use strict;
use warnings;
use Mojo::DOM;

my $dom = Mojo::DOM->new('<p>this &amp; &quot;that&quot;</p>');
for my $phrase ($dom->find('p')->each) {
    print $phrase->content(), "\n";
}

打印：

this &amp; &quot;that&quot;

如果您想保留您的循环和映射，请使用map('content')替换map('text')，如下所示：

for my $phrase ($dom->find('p')->map('content')->each) {

如果您有嵌套的标记，并且只想查找文本(但不打印这些嵌套的标记名称，只想打印它们的内容)，则需要扫描DOM树：

#!/usr/bin/perl
use strict;
use warnings;
use Mojo::DOM;

my $dom = Mojo::DOM->new('<p><i>this &amp; <b>&quot;</b><b>that</b><b>&quot;</b></i></p><p>done</p>');

for my $node (@{$dom->find('p')->to_array}) {
    print_content($node);
}

sub print_content {
    my ($node) = @_;
    if ($node->type eq "text") {
        print $node->content(), "\n";
    }
    if ($node->type eq "tag") {    
        for my $child ($node->child_nodes->each) {
            print_content($child);
        }
    }
}

打印的内容：

this & 
"
that
"
done

票数 3

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/55130871

复制

相似问题

问在使用Mojo::DOM处理HTML文档时，如何最可靠地保留HTML实体？
EN

回答 2

Stack Overflow用户

Stack Overflow用户

社区

活动

资源

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问在使用Mojo::DOM处理HTML文档时，如何最可靠地保留HTML实体？EN

回答 2

Stack Overflow用户

Stack Overflow用户

社区

活动

资源

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问在使用Mojo::DOM处理HTML文档时，如何最可靠地保留HTML实体？
EN