首页
学习
活动
专区
圈层
工具
发布
首页
学习
活动
专区
圈层
工具
MCP广场
社区首页 >问答首页 >按`hunspell`字典进行词干查找

按`hunspell`字典进行词干查找
EN

Stack Overflow用户
提问于 2020-04-08 18:40:39
回答 1查看 268关注 0票数 0

我从Stemming Words中获取了以下自定义词干函数:

代码语言:javascript
复制
stem_hunspell <- function(term) {
  # look up the term in the dictionary
  stems <- hunspell::hunspell_stem(term)[[1]]

  if (length(stems) == 0) { # if there are no stems, use the original term
    stem <- term
  } else { # if there are multiple stems, use the last one
    stem <- stems[[length(stems)]]
  }

  stem
}

它使用hunspell字典进行词干分析(corpus包)。

我在下面的句子中尝试了这个函数。

代码语言:javascript
复制
sentences<-c("We're taking proactive steps to tackle ...",                     
             "A number of measures we are taking to support ...",            
             "We caught him committing an indecent act.")

然后我执行了以下操作:

代码语言:javascript
复制
library(qdap)
library(tm)

sentences <- iconv(sentences, "latin1", "ASCII", sub="")

sentences <- gsub('http\\S+\\s*', '', sentences)

sentences <- bracketX(sentences,bracket='all')
sentences <- gsub("[[:punct:]]", "",sentences)

sentences <- removeNumbers(sentences)
sentences <- tolower(sentences)

# Stemming
library(corpus)

stem_hunspell <- function(term) {
# look up the term in the dictionary
stems <- hunspell::hunspell_stem(term)[[1]]

if (length(stems) == 0) { # if there are no stems, use the original term
    stem <- term
  } else { # if there are multiple stems, use the last one
    stem <- stems[[length(stems)]]
  }
  stem
}

sentences=text_tokens(sentences, stemmer = stem_hunspell)

sentences = lapply(sentences, removeWords, stopwords('en'))
sentences = lapply(sentences, stripWhitespace)

我无法解释结果:

代码语言:javascript
复制
[[1]]
[1] ""       "taking" "active" "step"   ""       "tackle"

[[2]]
[1] ""        "numb"    ""        "measure" ""        ""        "taking"  ""       
[9] "support"

[[3]]
[1] ""           "caught"     ""           "committing" ""           "decent"    
[7] "act"  

例如,为什么commit和take出现在它们的ing形式中?为什么number变成了"numb"?

EN

Stack Overflow用户

回答已采纳

发布于 2020-04-08 19:44:36

我认为答案主要是这就是hunspell的词干处理方式。我们可以用一个更简单的例子来检验这一点:

代码语言:javascript
复制
hunspell::hunspell_stem("taking")
#> [[1]]
#> [1] "taking"
hunspell::hunspell_stem("committing")
#> [[1]]
#> [1] "committing"

ing-form是hunspell提供的唯一选项。对我来说,这也没有多大意义,我的建议是使用不同的词干分析器。在此期间,我认为您还可以从切换到quanteda而不是tm中获益

代码语言:javascript
复制
library(quanteda)
sentences <- c("We're taking proactive steps to tackle ...",                     
               "A number of measures we are taking to support ...",            
               "We caught him committing an indecent act.")

tokens(sentences, remove_numbers = TRUE) %>% 
  tokens_tolower() %>% 
  tokens_wordstem()
#> Tokens consisting of 3 documents.
#> text1 :
#> [1] "we'r"     "take"     "proactiv" "step"     "to"       "tackl"    "."       
#> [8] "."        "."       
#> 
#> text2 :
#>  [1] "a"       "number"  "of"      "measur"  "we"      "are"     "take"   
#>  [8] "to"      "support" "."       "."       "."      
#> 
#> text3 :
#> [1] "we"     "caught" "him"    "commit" "an"     "indec"  "act"    "."

在我看来,工作流程要干净得多,结果对我来说也更有意义。quanteda在这里使用SnowballC包进行词干分析,如果需要,可以将其集成到tm工作流中。tokens对象是与输入对象顺序相同的文本,但是是标记化的(即拆分成单词)。

如果你仍然想使用hunspell,你可以使用下面的函数,它可以解决你可能遇到的一些问题("number“现在是正确的):

代码语言:javascript
复制
stem_hunspell <- function(toks) {

  # look up the term in the dictionary
  stems <- vapply(hunspell::hunspell_stem(types(toks)), "[", 1, FUN.VALUE = character(1))

  # if there are no stems, use the original term
  stems[nchar(stems) == 0] <- types(toks)[nchar(stems) == 0]

  tokens_replace(toks, types(toks), stems, valuetype = "fixed")

}

tokens(sentences, remove_numbers = TRUE, ) %>% 
  tokens_tolower() %>%
  stem_hunspell()
#> Tokens consisting of 3 documents.
#> text1 :
#> [1] "we're"  "taking" "active" "step"   "to"     "tackle" "."      "."     
#> [9] "."     
#> 
#> text2 :
#>  [1] "a"       "number"  "of"      "measure" "we"      "are"     "taking" 
#>  [8] "to"      "support" "."       "."       "."      
#> 
#> text3 :
#> [1] "we"         "caught"     "him"        "committing" "an"        
#> [6] "decent"     "act"        "."
票数 1
EN
查看全部 1 条回答
页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持
原文链接:

https://stackoverflow.com/questions/61098772

复制
相关文章

相似问题

领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档