文章/答案/技术大牛

发布

社区首页 >问答首页 >将计划文本与HTML内容匹配

问将计划文本与HTML内容匹配
EN

Stack Overflow用户

提问于 2017-06-19 15:12:54

回答 1查看 106关注 0票数 2

我需要在植物文本与HTML内容之间执行匹配，一旦找到匹配，我需要提取匹配的HTML (不需要更改内容，因为我需要完全相同的HTML内容)，我可以使用java regex实用程序在许多场景中匹配，但是在下面的场景中它失败了。

下面是用于将文本与HTML字符串匹配的示例代码

public static void main(String[] args) {

    String text = "A crusader for the rights of the weaker sections of the Association&#39;s (ADA&#39;s),choice as the presidential candidate is being seen as a political masterstroke.";
    String regex = "A crusader for the rights of the weaker sections of the Association's (ADA's) ".replaceAll(" ", ".*");

    Pattern pattern = Pattern.compile(regex);
    Matcher matcher = pattern.matcher(text);
    // Check all occurrences
    while (matcher.find()) {

        System.out.print("Start index: " + matcher.start());
        System.out.print(" End index: " + matcher.end());
        System.out.println(" Found: " + matcher.group());

    }
}

在边缘情况下的正在失败

案例1:

资料来源：= "A crusader for the rights of the weaker sections of the Association's (ADA's),choice as the presidential candidate is being seen as a political masterstroke."；

匹配文本的= "A crusader for the rights of the weaker sections of the Association's (ADA's)"

预期输出： “A crusader for the rights of the weaker sections of the Association's (ADA's)”

案例2:

资料来源：

“<ul>
   <li>Lorem ipsum dolor sit amet, consectetuer adipiscing elit.</li>
   <li>Aliquam tincidunt mauris eu risus.</li>
   <li>Vestibulum auctor dapibus neque.</li>
see (<a href=\"https://www.webpagefx.com/web-design/html-ipsum/">HTML Content Sample </a>.)
</ul>”

文本匹配： “see (HTML Content Sample.)”

预期输出： “see (<a href=\"https://www.webpagefx.com/web-design/html-ipsum/">HTML Content Sample </a>.)”

案例3: 源文本：= "Initial history includes the following:</p>\n<p>Documentation of <li>Aliquam tincidunt mauris eu risus.</li>"

匹配文本的= "Initial history includes the following: Documentation of"

预期的匹配输出：”Initial history includes the following :</p>\n<p>Documentation of”

string

java

regex

回答 1

Stack Overflow用户

发布于 2017-06-23 14:27:50

最近，我提出了一个正则表达式来匹配HTML标记，支持引用属性和引用属性中的转义引号：

<([^'">]|"([^\\"]|\\"?)+"|'([^\\']|\\'?)+')+>。

我认为在HTML中搜索纯文本同时保留HTML的最简单的方法是修改纯文本，这样它就会忽略单词边界上的标记。

// Usage: htmlSearch("ab cd").matcher("<b>ab</b> <i>cd</i>").matches();
public static Pattern htmlSearch(String plain) {
    // Check for tags before and after every word, number and symbol
    plain = plain.replaceAll("[A-Za-z]+|\\d+|[^\\w\\s]", 
            "``TAGS``$0``TAGS``";
    // Check for tags wherever (one or more) spaces are found
    plain = plain.replaceAll("\\s+", "((\\s|&nbsp;)+|``TAGS``)*");
    // Handle special characters
    plain = plain
            .replace("<", "(<|&lt;|&#60;)")
            .replace(">", "(>|&gt;|&#62;)")
            .replace("&", "(&|&amp;|&#38;)")
            .replace("'", "('|&apos;|&#39;)")
            .replace("\"", "(\"|&quot;|&#34;)")
            .replaceAll("[()\\\\{}\\[\\].*+]", "\\$0");
    // Insert the ``TAGS`` pattern
    String tags = "(<([^'\">]"
                + "|\"([^\\\"]|\\\"?)+"
                + "|'([^\\']|\\'?)+')+>)*";
    plain = plain.replace("``TAGS``", tags);

    return Pattern.compile(plain);
}

票数 0

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/44634179

复制

相似问题

问将计划文本与HTML内容匹配
EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问将计划文本与HTML内容匹配EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问将计划文本与HTML内容匹配
EN