我有一个字符串,它本质上是一页文本的价值。
一个样本将是:“最终,饼干耳蜡12和阅读时间:最多15分钟”。
我想要提取的是在子字符串“阅读时间”之后第一次出现‘2位数字+分钟’。我的字符串要大得多,到处都有一些数字,所以我想用regex来做这件事,但是我不知道如何从这里开始。
示例:
输入:“最终,饼干耳垢12和阅读时间:最多15分钟”
产出:"15分钟“
发布于 2021-08-05 09:27:28
这是一句话:
print(s[s.find("Reading Time") + s[s.find("Reading Time") : len(s)].find("minutes") - 3 : s.find("Reading Time") + s[s.find("Reading Time") : len(s)].find("minutes") + 7])
发布于 2021-08-05 09:22:28
这与regex有点不同,但是为什么不利用更强大的自然语言处理Python库来实现这一点呢?
下面是spaCy的Matcher
的一个例子(如果您接受额外的依赖,那么https://spacy.io/usage/rule-based-matching应该比regex更灵活和易于使用):
import spacy
from spacy.matcher import Matcher
nlp = spacy.load("en_core_web_sm")
matcher = Matcher(nlp.vocab)
pattern = [{"LOWER": "reading"}, # we require 'reading time' to be in the pattern
{"LOWER": "time"},
{"OP": "*"}, # there may be some stuff (optionally)
{"LIKE_NUM": True}, # then we look for a number and 'minutes'
{"LOWER": "minutes"}]
matcher.add("duration", [pattern])
# some tests, and just two of them should give in output something
tests = ["Ultimately, biscuits earwax 12 as well as Reading Time: up to 15 minutes",
"I wonder if this will take a reading time of more than 15 or 17 minutes in the end",
"Will it take us more than 50 minutes?",
"I don't have anything like 'reading time'",
"spaCy rocks!"]
# print results for each example
for test in tests:
doc = nlp(test)
matches = matcher(doc)
for match_id, start, end in matches:
print(doc[end-2:end]) # just get the final two tokens
通过调整pattern
,你应该能够根据你的需要来匹配句子。
https://stackoverflow.com/questions/68663630
复制相似问题