我使用tm软件包对修复数据进行文本分析,将数据读入数据帧,转换为语料库对象,采用多种方法对数据进行清理,使用lower、stipWhitespace、removestopwords等。
取回stemCompletion的语料库对象。
使用stemDocument函数执行tm_map,我的目标词被封住了
得到了预期的结果。
当我使用stemCompletion函数运行tm_map操作时,它不能工作,因此出错。
UseMethod中的错误(“单词”):没有适用于类“字符”对象的“单词”方法
执行trackback()以显示和获得如下步骤
> traceback()
9: FUN(X[[1L]], ...)
8: lapply(dictionary, words)
7: unlist(lapply(dictionary, words))
6: unique(unlist(lapply(dictionary, words)))
5: FUN(X[[1L]], ...)
4: lapply(X, FUN, ...)
3: mclapply(content(x), FUN, ...)
2: tm_map.VCorpus(c, stemCompletion, dictionary = c_orig)
1: tm_map(c, stemCompletion, dictionary = c_orig)如何解决此错误?
发布于 2014-08-19 19:39:13
在使用tm v0.6时,我也收到了同样的错误。我怀疑发生这种情况是因为tm包的这个版本的默认转换中没有stemCompletion:
> getTransformations
function ()
c("removeNumbers", "removePunctuation", "removeWords", "stemDocument",
"stripWhitespace")
<environment: namespace:tm>现在,tolower函数也有同样的问题,但是可以通过使用content_transformer函数来操作。我对stemCompletion尝试了类似的方法,但没有成功。
注意,尽管stemCompletion不是默认的转换,但当手动填充词干词时,它仍然有效:
> stemCompletion("compani",dictCorpus)
compani
"companies" 为了继续我的工作,我手动地用单个空格在语料库中划分每个文档,通过stemCompletion将它们输入,并将它们与以下内容连在一起(笨拙且不优雅!)职能:
stemCompletion_mod <- function(x,dict=dictCorpus) {
PlainTextDocument(stripWhitespace(paste(stemCompletion(unlist(strsplit(as.character(x)," ")),dictionary=dict, type="shortest"),sep="", collapse=" ")))
}在这里,dictCorpus只是清洗过的语料库的一个副本,但在它被终止之前。额外的stripWhitespace是特定于我的语料库,但可能是良性的一般语料库。您可能需要将type选项从“最短”更改为“最短”。
对于完整的示例,让我们使用tm包中的crude数据设置一个虚拟语料库:
> data("crude")
> docs = Corpus(VectorSource(crude))
> docs <- tm_map(docs, content_transformer(tolower))
> docs <- tm_map(docs, removeNumbers)
> docs <- tm_map(docs, removeWords, stopwords("english"))
> docs <- tm_map(docs, removePunctuation)
> docs <- tm_map(docs, stripWhitespace)
> docs <- tm_map(docs, PlainTextDocument)
> dictCorpus <- docs
> docs <- tm_map(docs, stemDocument)
> # Define modified stemCompletion function
> stemCompletion_mod <- function(x,dict=dictCorpus) {
PlainTextDocument(stripWhitespace(paste(stemCompletion(unlist(strsplit(as.character(x)," ")),dictionary=dict, type="shortest"),sep="", collapse=" ")))
}
> # Original doc in crude data
> crude[[1]]
<<PlainTextDocument (metadata: 15)>>
Diamond Shamrock Corp said that
effective today it had cut its contract prices for crude oil by
1.50 dlrs a barrel.
The reduction brings its posted price for West Texas
Intermediate to 16.00 dlrs a barrel, the copany said.
"The price reduction today was made in the light of falling
oil product prices and a weak crude oil market," a company
spokeswoman said.
Diamond is the latest in a line of U.S. oil companies that
have cut its contract, or posted, prices over the last two days
citing weak oil markets.
Reuter
> # Stemmed example in crude data
> docs[[1]]
<<PlainTextDocument (metadata: 7)>>
diamond shamrock corp said effect today cut contract price crude oil dlrs barrel
reduct bring post price west texa intermedi dlrs barrel copani said price reduct today
made light fall oil product price weak crude oil market compani spokeswoman said diamond
latest line us oil compani cut contract post price last two day cite weak oil market reuter
> # Stem comlpeted example in crude data
> stemCompletion_mod(docs[[1]],dictCorpus)
<<PlainTextDocument (metadata: 7)>>
diamond shamrock corp said effect today cut contract price crude oil dlrs barrel
reduction brings posted price west texas intermediate dlrs barrel NA said price reduction today
made light fall oil product price weak crude oil market companies spokeswoman said diamond
latest line us oil companies cut contract posted price last two day cited weak oil market reuter注意:这个例子很奇怪,因为拼写错误的单词"copany“是映射在这个过程中的:-> "copani”-> "NA“。不知道该怎么纠正..。
要在整个语料库中运行stemCompletion_mod,我只需使用sapply (或与parSapply一起使用斯诺包)。
也许比我更有经验的人可以建议进行更简单的修改,使stemCompletion在tm包的0.6版中工作。
发布于 2014-09-08 21:54:58
我成功地完成了以下工作流程:
content_transformer对语料库的每个文档应用匿名函数,stemCompletion,paste连接到文档中。POC演示代码:
tm_map(c, content_transformer(function(x, d)
paste(stemCompletion(strsplit(stemDocument(x), ' ')[[1]], d), collapse = ' ')), d)PS:使用c作为变量名来存储语料库并不是一个好主意,因为base::c
发布于 2015-05-23 15:30:17
谢谢,cdxsza。你的方法对我有效。
给所有将要使用
stemCompletion的人一个提示: 函数使用字典中的一个单词完成一个空字符串,这是意外的。参见下面的示例,其中第一个“星期一”是在字符串开头为空白生成的。
stemCompletion(unlist(strsplit(" mond tues ", " ")), dict=c("monday", "tuesday"))
[1] "monday" "monday" "tuesday" 通过删除空字符串
"",可以轻松地修复stemCompletion,如下所示。
stemCompletion2 <- function(x, dictionary) {
x <- unlist(strsplit(as.character(x), " "))
x <- x[x != ""]
x <- stemCompletion(x, dictionary=dictionary)
x <- paste(x, sep="", collapse=" ")
PlainTextDocument(stripWhitespace(x))
}
myCorpus <- lapply(myCorpus, stemCompletion2, dictionary=myCorpusCopy)
myCorpus <- Corpus(VectorSource(myCorpus))参见http://www.rdatamining.com/docs/RDataMining-slides-text-mining.pdf幻灯片第12页中的详细示例
问候
赵燕昌
RdataMining.com
https://stackoverflow.com/questions/25206049
复制相似问题