我想提取的句子,其中有一个特定的词在文本文件,其中包含多个段落。
例如:数字印度是印度政府的一项举措,目的是通过改善在线基础设施和增加互联网连接,确保以电子方式向公民提供政府服务。该方案于2015年7月1日由纳伦德拉·莫迪总理发起。
现在,从这一段中,我需要摘除所有含有“印度”一词的句子。
我试图在R中使用substr和substring命令,但没有帮助。有谁能帮我解决这个问题。
提前谢谢
发布于 2017-12-05 05:56:26
您可以像这样使用grep
text <- c("Digital India is an initiative by the Government of India to ensure that Government services are made available to citizens electronically by improving online infrastructure and by increasing Internet connectivity. It was launched on 1 July 2015 by Prime Minister Narendra Modi.")
text <- unlist(strsplit(text, "\\."))
text[grep(pattern = "India", text, ignore.case = T)]
[1] "Digital India is an initiative by the Government of India ...
发布于 2018-01-29 11:30:57
使用正则表达式和grep
(或者最有可能是R中的任何模式匹配函数)对从给定输入字符串中提取的特性提供了更好的控制。也就是说,base-R regmatches
(与regexpr
的结合)或stringr中的str_extract_all
可以帮助您完成特定的任务,而无需预先明确地拆分输入向量。
例如,使用以下表达式可以很容易地提取包含“India”一词的任何句子。请注意,为了说明起见,我添加了另一句派生形式的“India”。
text = "Digital India is an initiative by the Government of India ensuring that Government services are made available to citizens electronically by improving online infrastructure and by increasing Internet connectivity. It was launched on 1 July 2015 by Prime Minister Narendra Modi."
text = paste(text, "Indian summer is a periodically recurring weather phenomenon in Central Europe.")
library(stringr)
str_extract_all(text, "([:alnum:]+\\s)*India[[:alnum:]\\s]*\\.")[[1]]
[1] "Digital India is an initiative by the Government of India ensuring that Government services are made available to citizens electronically by improving online infrastructure and by increasing Internet connectivity."
[2] "Indian summer is a periodically recurring weather phenomenon in Central Europe."
关于web上的正则表达式,有很多很好的教程,所以我将在这里详细介绍一下。为了破译上述语句,R中的正则表达式可能是一个很好的起点。
https://stackoverflow.com/questions/47646664
复制相似问题