问从R中的*.txt文件中提取文本
EN

Stack Overflow用户

提问于 2018-12-05 00:51:38

回答 3查看 911关注 0票数 1

我已经使用Mac的表达式来确认我的Regex工作，但我找不到从文本文件中提取信息的命令。我有2,500个文本文件，我需要提取每个文档的日期，以便填充数据集。仅供参考，"date“是要提取的第一个变量，还会有其他变量。这些文件的格式各不相同，并且有多个日期。我只对每个文档的第一个日期感兴趣。一些文档有一个包含日期的新行，另一些文档则以单词" date“或”Date“开始。

每个文本文档的示例：

Bangor
dorset
LL56 43r

date:         10 july 2009
take notice:  the blah blah blah text goes here and there's lots of it.
action:

有效的正则表达式：

"\\d{1,2}\\s+(?:january|february|march|april|may|june|july|august|september|october|november|december)\\s+\\d{4}"

文本文档在R Studio环境中作为单元素字符矢量可见。我想提取文本“原样”，这样就像...

> strapply(NoFN, ("\\d{1,2}\\.?:january|february|march|april|may|june|july|august|september|october|november|december\\.\\d{4}")[[1]]
> [1] 10 july 2009

显然，这实际上并不起作用！

非常感谢！伊恩

regex

text

回答 3

Stack Overflow用户

回答已采纳

发布于 2018-12-05 01:04:00

您的正则表达式不适合R，因为您需要转义\字符。

正则表达式应为：

"\\d{1,2}\\s+(?:january|february|march|april|may|june|july|august|september|october|november|december)\\s+\\d{4}"

如果使用stringr包，并且将文本加载到txt，则可以执行以下操作：

library(stringr)

txt = "Bangor dorset LL56 43r\n date: 10 july 2009 \n take notice: the blah blah blah text goes here and there's lots of it. action:"

str_match(string = txt, pattern = "\\d{1,2}\\s+(?:january|february|march|april|may|june|july|august|september|october|november|december)\\s+\\d{4}")

        [,1]          
[1,] "10 july 2009"

票数 2

Stack Overflow用户

发布于 2018-12-05 01:04:16

我相信这就是答案。它使用内置变量month.name，并且与问题中的不同，它使用()对月份进行分组。

txt <- "\n date: 10 july 2009 \n take notice: the blah blah blah text goes here and there's lots of it. action:"

pattern <- paste(tolower(month.name), collapse = "|")
pattern <- paste0("(", pattern, ")")
pattern <- paste("[[:digit:]]{1,2}[[:space:]]*", pattern, "[[:digit:]]{4}")

m <- regexpr(pattern, txt)
regmatches(txt, m)
#[1] "10 july 2009"

票数 0

Stack Overflow用户

发布于 2018-12-05 04:34:02

谢谢大家，这是一种款待！

库(字符串)

txt = "Bangor dorset LL56 43r\n日期: 2009年7月10日\n请注意:这里有很多废话文本。操作：“

模式(string= txt，str_match= "\d{1,2}\s+(?:january|february|march|april|may|june|july|august|september|october|november|december)\s+\d{4}")

    [,1]

1，“2009年7月10日”

票数 0

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/53617865

复制

相似问题

问从R中的*.txt文件中提取文本
EN

回答 3

Stack Overflow用户

Stack Overflow用户

Stack Overflow用户

社区

活动

资源

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问从R中的*.txt文件中提取文本EN

回答 3

Stack Overflow用户

Stack Overflow用户

Stack Overflow用户

社区

活动

资源

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问从R中的*.txt文件中提取文本
EN