我想在单词“帮助”之前和之后抓住2-3个单词。
我有如下一段文字:
....features和许多绿色植物帮助舒缓nerves...blah blah...cozy在他们的毛绒毯子,以帮助放松神经
这就是我所做的
x <- paste("(\\S+\\s+|^)(\\S+\\s+|)(\\S+\\s+|)", treatSym[i], ".?(\\s+\\S+|)(\\s+\\S+|$)(\\s+\\S+|$)", sep="")
matching <- gregexpr(x,text)
regmatches(text, matching, invert = FALSE)
我得到了这个错误,因为我猜测长度(匹配)= 2。但是,当只有一个匹配时,它工作得很好。
Error in regmatches(text, matching, invert = FALSE) :
‘x’ and ‘m’ must have the same length
有没有一个更好的解决方案,叫出2-3个字之前和之后的关键字?
发布于 2016-05-13 01:29:52
n
是长度为2的向量,表示关键字前后的单词数。
n <- c(2, 2)
x <- "....features and lots of greenery to help soothe the nerves...blah blah...cozy up in their plush blankets to help relax the nerves"
pat <- sprintf('(?:[a-z]+ ){%s}help(?: [a-z]+){%s}', n[1], n[2])
m <- gregexpr(pat, x, perl = TRUE)
regmatches(x, m)[[1]]
# [1] "greenery to help soothe the" "blankets to help relax the"
作为一种功能
f <- function(string, keyword, n = c(2,2)) {
# pat <- sprintf('(?:[a-z]+ ){%s}%s(?: [a-z]+){%s}', n[1], keyword, n[2])
pat <- sprintf('(?:[a-z]+ ){0,%s}%s(?: [a-z]+){0,%s}', n[1], keyword, n[2])
m <- gregexpr(pat, string, perl = TRUE)
regmatches(string, m)[[1]]
}
f(x, 'help', c(1, 2))
# [1] "to help soothe the" "to help relax the"
发布于 2016-05-13 01:10:09
另一种选择是拆分单词,获取help
的索引,并在每个help
之前/之后获取2或3个单词。
library(magrittr)
library(stringi)
library(SOfun) ### https://github.com/mrdwab/SOfun
x <- "....features and lots of greenery to help soothe the nerves...blah blah...cozy up in their plush blankets to help relax the nerves"
选项1:只需得到以下单词
### Remove ... and split words
temp <- stri_replace_all_regex(pattern = "[[:punct:]]", replacement = " ", str = x) %>%
stri_split_fixed(pattern = " ") %>%
unlist %>%
.[nchar(.) > 0]
data.frame(word = temp, stringsAsFactors = FALSE) %>%
getMyRows(pattern = grep(pattern = "help", x = .$word), range = -3:3) %>%
lapply(function(ana){ana[-grep(pattern = "help", x = ana)]})
#[[1]]
#[1] "of" "greenery" "to" "soothe" "the" "nerves"
#
#[[2]]
#[1] "plush" "blankets" "to" "relax" "the" "nerves"
如果要查看为每个help
选择了哪些单词,可以尝试以下方法。
选项2:创建数据框架
temp <- stri_replace_all_regex(pattern = "[[:punct:]]", replacement = " ", str = x) %>%
stri_split_fixed(pattern = " ") %>%
unlist %>%
.[nchar(.) > 0]
data.frame(word = temp, stringsAsFactors = FALSE) %>%
getMyRows(pattern = grep(pattern = "help", x = .$word), range = -3:3) %>%
lapply(function(ana){ana[-grep(pattern = "help", x = ana)]}) -> temp
do.call(rbind,
lapply(temp, function(y){
data.frame(word = y,
ind = c(-3:-1, 1:3),
stringsAsFactors = FALSE)}
)
)
# ind indicates relative positions of the words. words with negative
# numbers are on left side of help. Words with positive numbers on right.
# word ind
#1 of -3
#2 greenery -2
#3 to -1
#4 soothe 1
#5 the 2
#6 nerves 3
#7 plush -3
#8 blankets -2
#9 to -1
#10 relax 1
#11 the 2
#12 nerves 3
发布于 2016-05-13 00:39:12
您可以使用quanteda
包进行类似的操作。
my.string <- "....features and lots of greenery to help soothe the nerves...blah blah...cozy up in their plush blankets to help relax the nerves"
library(quanteda)
kwic(my.string, "help", window = 3, valuetype = "fixed")
contextPre keyword contextPost
[text1, 11] of greenery to [ help ] soothe the nerves
[text1, 30] plush blankets to [ help ] relax the nerves
https://stackoverflow.com/questions/37199262
复制相似问题