文章/答案/技术大牛

发布

社区首页 >问答首页 >使用stringi包在R中提取文本

问使用stringi包在R中提取文本
EN

Stack Overflow用户

提问于 2016-12-27 17:44:47

回答 1查看 82关注 0票数 1

我有下面的文本，需要提取特定的词之前和之后的一个特定的词

示例：

sometext <- "about us, close, products & services, focus, close, research & development, topics, carbon fiber reinforced thermoplastic, separators for lithium ion batteries, close, for investors, close, jobs & careers, close, \nselect language\n\n, home > corporate social responsibility > \nsocial report\n >  quality assurance\n, \nensuring provision of safe products, \nthe teijin group resin & plastic processing business unit is globally expanding its engineering plastics centered on polycarbonate resin, where we hold a major share in growing asian markets. these products are widely used in applications such as automotive components, office automation equipment and optical discs (blu-ray, dvd). customers include automotive manufacturers, electronic equipment manufacturers and related mold companies. customer data is organized into a database as groundwork to actively promote efforts to enhance customer satisfaction., \nin accordance with iso 9001 (8-4, 8-2), the regular implementation of"
library(stringi)
stri_extract_all_fixed(sometext , c('engineering plastics', 'iso 9001','office automation'), case_insensitive=TRUE, overlap=TRUE)

实际输出如下

[[1]]
[1] "engineering plastics"

[[2]]
[1] "iso 9001"

[[3]]
[1] "office automation"

所需输出：

[1] globally expanding its engineering plastics centered on polycarbonate resin
[2] accordance with iso 9001 (8-4, 8-2), the regular implementation of

基本上我需要提取文本之前和之后我提到的具体单词

text-extraction

stringr

stringi

回答 1

Stack Overflow用户

发布于 2017-02-18 05:42:23

这是一些开始的想法：

sometext <- "about us, close, products & services, focus, close, research & development, topics, carbon fiber reinforced thermoplastic, separators for lithium ion batteries, close, for investors, close, jobs & careers, close, \nselect language\n\n, home > corporate social responsibility > \nsocial report\n >  quality assurance\n, \nensuring provision of safe products, \nthe teijin group resin & plastic processing business unit is globally expanding its engineering plastics centered on polycarbonate resin, where we hold a major share in growing asian markets. these products are widely used in applications such as automotive components, office automation equipment and optical discs (blu-ray, dvd). customers include automotive manufacturers, electronic equipment manufacturers and related mold companies. customer data is organized into a database as groundwork to actively promote efforts to enhance customer satisfaction., \nin accordance with iso 9001 (8-4, 8-2), the regular implementation of"
library(stringi)
words <- c('engineering plastics', 'iso 9001','office automation')
pattern <- stri_paste("([^ ]+ ){0,10}", words, "([^ ]+ ){0,10}")
stri_extract_all_regex(sometext , pattern, case_insensitive=TRUE, overlap=TRUE)

解释：我在你想要的单词前后添加简单的正则表达式：

"([^ ]+ ){0,10}"

这意味着：

，除了空格，尽可能多地重复
，然后空格
，所有这一切都是

的十倍

这非常简单和幼稚(例如，它将所有的“&”或“>”都视为单词)，但它是有效的。

票数 0

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/41342732

复制

相似问题

问使用stringi包在R中提取文本
EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问使用stringi包在R中提取文本EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问使用stringi包在R中提取文本
EN