我从Stemming Words中获取了以下自定义词干函数:
stem_hunspell <- function(term) {
# look up the term in the dictionary
stems <- hunspell::hunspell_stem(term)[[1]]
if (length(stems) == 0) { # if there are no stems, use the original term
stem <- term
} else { # if there are multiple stems, use the last one
stem <- stems[[length(stems)]]
}
stem
}它使用hunspell字典进行词干分析(corpus包)。
我在下面的句子中尝试了这个函数。
sentences<-c("We're taking proactive steps to tackle ...",
"A number of measures we are taking to support ...",
"We caught him committing an indecent act.")然后我执行了以下操作:
library(qdap)
library(tm)
sentences <- iconv(sentences, "latin1", "ASCII", sub="")
sentences <- gsub('http\\S+\\s*', '', sentences)
sentences <- bracketX(sentences,bracket='all')
sentences <- gsub("[[:punct:]]", "",sentences)
sentences <- removeNumbers(sentences)
sentences <- tolower(sentences)
# Stemming
library(corpus)
stem_hunspell <- function(term) {
# look up the term in the dictionary
stems <- hunspell::hunspell_stem(term)[[1]]
if (length(stems) == 0) { # if there are no stems, use the original term
stem <- term
} else { # if there are multiple stems, use the last one
stem <- stems[[length(stems)]]
}
stem
}
sentences=text_tokens(sentences, stemmer = stem_hunspell)
sentences = lapply(sentences, removeWords, stopwords('en'))
sentences = lapply(sentences, stripWhitespace)我无法解释结果:
[[1]]
[1] "" "taking" "active" "step" "" "tackle"
[[2]]
[1] "" "numb" "" "measure" "" "" "taking" ""
[9] "support"
[[3]]
[1] "" "caught" "" "committing" "" "decent"
[7] "act" 例如,为什么commit和take出现在它们的ing形式中?为什么number变成了"numb"?
发布于 2020-04-08 19:44:36
我认为答案主要是这就是hunspell的词干处理方式。我们可以用一个更简单的例子来检验这一点:
hunspell::hunspell_stem("taking")
#> [[1]]
#> [1] "taking"
hunspell::hunspell_stem("committing")
#> [[1]]
#> [1] "committing"ing-form是hunspell提供的唯一选项。对我来说,这也没有多大意义,我的建议是使用不同的词干分析器。在此期间,我认为您还可以从切换到quanteda而不是tm中获益
library(quanteda)
sentences <- c("We're taking proactive steps to tackle ...",
"A number of measures we are taking to support ...",
"We caught him committing an indecent act.")
tokens(sentences, remove_numbers = TRUE) %>%
tokens_tolower() %>%
tokens_wordstem()
#> Tokens consisting of 3 documents.
#> text1 :
#> [1] "we'r" "take" "proactiv" "step" "to" "tackl" "."
#> [8] "." "."
#>
#> text2 :
#> [1] "a" "number" "of" "measur" "we" "are" "take"
#> [8] "to" "support" "." "." "."
#>
#> text3 :
#> [1] "we" "caught" "him" "commit" "an" "indec" "act" "."在我看来,工作流程要干净得多,结果对我来说也更有意义。quanteda在这里使用SnowballC包进行词干分析,如果需要,可以将其集成到tm工作流中。tokens对象是与输入对象顺序相同的文本,但是是标记化的(即拆分成单词)。
如果你仍然想使用hunspell,你可以使用下面的函数,它可以解决你可能遇到的一些问题("number“现在是正确的):
stem_hunspell <- function(toks) {
# look up the term in the dictionary
stems <- vapply(hunspell::hunspell_stem(types(toks)), "[", 1, FUN.VALUE = character(1))
# if there are no stems, use the original term
stems[nchar(stems) == 0] <- types(toks)[nchar(stems) == 0]
tokens_replace(toks, types(toks), stems, valuetype = "fixed")
}
tokens(sentences, remove_numbers = TRUE, ) %>%
tokens_tolower() %>%
stem_hunspell()
#> Tokens consisting of 3 documents.
#> text1 :
#> [1] "we're" "taking" "active" "step" "to" "tackle" "." "."
#> [9] "."
#>
#> text2 :
#> [1] "a" "number" "of" "measure" "we" "are" "taking"
#> [8] "to" "support" "." "." "."
#>
#> text3 :
#> [1] "we" "caught" "him" "committing" "an"
#> [6] "decent" "act" "."https://stackoverflow.com/questions/61098772
复制相似问题