我需要在植物文本与HTML内容之间执行匹配,一旦找到匹配,我需要提取匹配的HTML (不需要更改内容,因为我需要完全相同的HTML内容),我可以使用java regex实用程序在许多场景中匹配,但是在下面的场景中它失败了。
下面是用于将文本与HTML字符串匹配的示例代码
public static void main(String[] args) {
String text = "A crusader for the rights of the weaker sections of the Association's (ADA's),choice as the presidential candidate is being seen as a political masterstroke.";
String regex = "A crusader for the rights of the weaker sections of the Association's (ADA's) ".replaceAll(" ", ".*");
Pattern pattern = Pattern.compile(regex);
Matcher matcher = pattern.matcher(text);
// Check all occurrences
while (matcher.find()) {
System.out.print("Start index: " + matcher.start());
System.out.print(" End index: " + matcher.end());
System.out.println(" Found: " + matcher.group());
}
}
在边缘情况下的正在失败
案例1:
资料来源:= "A crusader for the rights of the weaker sections of the Association's (ADA's),choice as the presidential candidate is being seen as a political masterstroke."
;
匹配文本的= "A crusader for the rights of the weaker sections of the Association's (ADA's)"
预期输出: “A crusader for the rights of the weaker sections of the Association's (ADA's)”
案例2:
资料来源:
“<ul>
<li>Lorem ipsum dolor sit amet, consectetuer adipiscing elit.</li>
<li>Aliquam tincidunt mauris eu risus.</li>
<li>Vestibulum auctor dapibus neque.</li>
see (<a href=\"https://www.webpagefx.com/web-design/html-ipsum/">HTML Content Sample </a>.)
</ul>”
文本匹配: “see (HTML Content Sample.)”
预期输出: “see (<a href=\"https://www.webpagefx.com/web-design/html-ipsum/">HTML Content Sample </a>.)”
案例3: 源文本:= "Initial history includes the following:</p>\n<p>Documentation of <li>Aliquam tincidunt mauris eu risus.</li>"
匹配文本的= "Initial history includes the following: Documentation of"
预期的匹配输出:”Initial history includes the following :</p>\n<p>Documentation of”
发布于 2017-06-23 14:27:50
最近,我提出了一个正则表达式来匹配HTML标记,支持引用属性和引用属性中的转义引号:
<([^'">]|"([^\\"]|\\"?)+"|'([^\\']|\\'?)+')+>
。
我认为在HTML中搜索纯文本同时保留HTML的最简单的方法是修改纯文本,这样它就会忽略单词边界上的标记。
// Usage: htmlSearch("ab cd").matcher("<b>ab</b> <i>cd</i>").matches();
public static Pattern htmlSearch(String plain) {
// Check for tags before and after every word, number and symbol
plain = plain.replaceAll("[A-Za-z]+|\\d+|[^\\w\\s]",
"``TAGS``$0``TAGS``";
// Check for tags wherever (one or more) spaces are found
plain = plain.replaceAll("\\s+", "((\\s| )+|``TAGS``)*");
// Handle special characters
plain = plain
.replace("<", "(<|<|<)")
.replace(">", "(>|>|>)")
.replace("&", "(&|&|&)")
.replace("'", "('|'|')")
.replace("\"", "(\"|"|")")
.replaceAll("[()\\\\{}\\[\\].*+]", "\\$0");
// Insert the ``TAGS`` pattern
String tags = "(<([^'\">]"
+ "|\"([^\\\"]|\\\"?)+"
+ "|'([^\\']|\\'?)+')+>)*";
plain = plain.replace("``TAGS``", tags);
return Pattern.compile(plain);
}
https://stackoverflow.com/questions/44634179
复制相似问题