文章/答案/技术大牛

发布

社区首页 >问答首页 >匹配并放入类别-R的正则表达式

问匹配并放入类别-R的正则表达式
EN

Stack Overflow用户

提问于 2018-11-07 12:09:51

回答 2查看 355关注 0票数 0

我有三个矢量。一个包含文本或实际单词/句子(文本)，一个向量包含我要搜索的单词(xreg)，第三个向量(类别)包含每个文本如果找到匹配应该属于的类别。以下是三个向量：

text <- c("Sole Service here", "Freedom to Include","Freedom to Incl","Premier Reg",
"Bankhall","Bankhall","Premier Regiona","St James Play",
"Premier Regional","Health online","Premier Regional",
"Tenet","Health on line","Tenet","Nations","Woolwich",
"Premier Regional","Lifesearch","Nations","Bankhall",
"Premier Regional","Sole Service her","Lifesearch",
"Premier Regional","Sole Service","Nations",
"Sole Service","First Money service","Sole Service",
"Nations wide","Sole Service","Premier Region")

text <- tolower(text)

xreg <- c("sole","freedom","premier","bankhall","james","health","tennet",
          "nations","woolwich","life","money")

categories <- c("SS", "FD", "PR", "BK", "JM", "HT", "TT", "NT", "WW", "LF", "MY")

我想根据‘xreg’向量中的搜索词搜索‘文本’向量。然后，在找到匹配项后，我想将这些单词放入“类别”向量中提到的类别中。

比如，查找单词‘state’，在有匹配的地方，记下该单词的索引，或者简单地用单词创建一个数据框架，然后单独一个列来说明它应该属于的类别。在“独家”的情况下，把它放在“SS”类别中。“自由”把它归为“FD”类等等。

到目前为止，解决方案：，我可以逐个搜索每个关键字，它会告诉我它在哪里找到匹配的索引。

 reg_func <- function(x){grep(x,text)  
    }
    reg_func("sole")
reg_func("freedom")

这将为每个匹配的单词提供索引，然后我可以使用这些索引来更新类别。有什么办法能让我做得更快吗？而不是一次只搜索一个词？谢谢

regex

string

rstudio

回答 2

Stack Overflow用户

回答已采纳

发布于 2018-11-07 13:41:35

你可以这样做：

资料：(修改为在1. entry中有双匹配，最后一项没有匹配)

text <- c("Sole Service here, premier", "Freedom to Include","Freedom to Incl","Premier Reg",
          "Bankhall","Bankhall","Premier Regiona","St James Play",
          "Premier Regional","Health online","Premier Regional",
          "Tenet","Health on line","Tenet","Nations","Woolwich",
          "Premier Regional","Lifesearch","Nations","Bankhall",
          "Premier Regional","Sole Service her","Lifesearch",
          "Premier Regional","Sole Service","Nations",
          "Sole Service","First Money service","Sole Service",
          "Nations wide","Sole Service","Premier Region", "no match in here!!!")

#text <- tolower(text) # not needed, use ignore.case = T later

xreg <- c("sole","freedom","premier","bankhall","james","health","tennet",
          "nations","woolwich","life","money")

categories <- c("SS", "FD", "PR", "BK", "JM", "HT", "TT", "NT", "WW", "LF", "MY")

代码：

names(categories) = xreg  # create named vector

ans <- data.frame(text = I(text)) # create a data.frame where you store it all.

ans$xreg_m<-
apply(
    sapply(xreg, function(x) {grepl(x, text, ignore.case = T)}), 1, function(x) xreg[x]
      )
ans$xreg_m[!lengths(ans$xreg_m)] <- NA  # if no match is found. character(0) is returned. I want to have NA instead. character(0) has a length of 0. I'm using this knowledge to find them.

ans$categories_m<-
    sapply(ans$xreg_m, function(x) unique(unname( categories[x] )))

结果：

#                         text        xreg_m categories_m
#1  Sole Service here, premier sole, premier       SS, PR
#2          Freedom to Include       freedom           FD
#3             Freedom to Incl       freedom           FD
#4                 Premier Reg       premier           PR
#5                    Bankhall      bankhall           BK
#6                    Bankhall      bankhall           BK
#7             Premier Regiona       premier           PR
#8               St James Play         james           JM
#9            Premier Regional       premier           PR
#10              Health online        health           HT
#11           Premier Regional       premier           PR
#12                      Tenet            NA           NA
#13             Health on line        health           HT
#14                      Tenet            NA           NA
#15                    Nations       nations           NT
#16                   Woolwich      woolwich           WW
#17           Premier Regional       premier           PR
#18                 Lifesearch          life           LF
#19                    Nations       nations           NT
#20                   Bankhall      bankhall           BK
#21           Premier Regional       premier           PR
#22           Sole Service her          sole           SS
#23                 Lifesearch          life           LF
#24           Premier Regional       premier           PR
#25               Sole Service          sole           SS
#26                    Nations       nations           NT
#27               Sole Service          sole           SS
#28        First Money service         money           MY
#29               Sole Service          sole           SS
#30               Nations wide       nations           NT
#31               Sole Service          sole           SS
#32             Premier Region       premier           PR
#33        no match in here!!!            NA           NA

票数 1

Stack Overflow用户

发布于 2018-11-12 15:46:25

解释@Andre Elrico答案中使用的函数

apply(
  sapply(xreg, function(x) {grepl(x, text, ignore.case = T)}), 1, function(x) xreg[x]
)

# Apply each xreg pattern to the text vector and see if there's a match  
# result is TRUE or FALSE gives each index where there is a match
sapply(xreg, function(x) {grepl(x, text, ignore.case = T)})

结果

      sole freedom premier bankhall james health tennet nations woolwich  life money
[1,]  TRUE   FALSE    TRUE    FALSE FALSE  FALSE  FALSE   FALSE    FALSE FALSE FALSE
[2,] FALSE    TRUE   FALSE    FALSE FALSE  FALSE  FALSE   FALSE    FALSE FALSE FALSE
[3,] FALSE    TRUE   FALSE    FALSE FALSE  FALSE  FALSE   FALSE    FALSE FALSE FALSE
[4,] FALSE   FALSE    TRUE    FALSE FALSE  FALSE  FALSE   FALSE    FALSE FALSE FALSE
[5,] FALSE   FALSE   FALSE     TRUE FALSE  FALSE  FALSE   FALSE    FALSE FALSE FALSE
[6,] FALSE   FALSE   FALSE     TRUE FALSE  FALSE  FALSE   FALSE    FALSE FALSE FALSE

# Now apply each xreg element to the TRUE's from the previous result 
# and see which element of xreg it matches with
apply(
  sapply(xreg, function(x) {grepl(x, text, ignore.case = T)}), 1, function(x) xreg[x]
)

结果

[[1]]
[1] "sole"    "premier"

[[2]]
[1] "freedom"

[[3]]
[1] "freedom"

[[4]]
[1] "premier"

[[5]]
[1] "bankhall"

[[6]]
[1] "bankhall"

现在，获取每个匹配项的类别(Regex)

sapply(ans$xreg_m, function(x) unique(unname( categories[x] )))

上面写着：

# Take each element of xreg_m (our matched terms) and 
# see which element in the categories vector it matches with 
#  Then unname the result so you only get the category

票数 0

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/53189208

复制

相似问题

问匹配并放入类别-R的正则表达式
EN

回答 2

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问匹配并放入类别-R的正则表达式EN

回答 2

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问匹配并放入类别-R的正则表达式
EN