在版本0.6-2和0.7-x之间,tm包的行为发生了变化。在新版本中,DocumentTermMatrix不保留单词内破折号,这是一个错误还是有一个新的选项来强制执行?下面是一个示例,使用安装了不同路径的两个tm版本。我运行的是R 3.3.3。
> string1 <- "big data data analysis machine learning project management"
> string2 <- "big-data data-analysis machine-learning project-management"
>
> two_strings <- c(string1, string2)
>
> library("tm", lib.loc="~/R/x86_64-pc-linux-gnu-library/3.3/tm_0.6-2")
> myCorpus <- Corpus(VectorSource(two_strings))
> dtm_0.6 <- DocumentTermMatrix(myCorpus)
> inspect(dtm_0.6)
<<DocumentTermMatrix (documents: 2, terms: 11)>>
Non-/sparse entries: 11/11
Sparsity : 50%
Maximal term length: 18
Weighting : term frequency (tf)
Terms
Docs analysis big big-data data data-analysis learning machine machine-learning
1 1 1 0 2 0 1 1 0
2 0 0 1 0 1 0 0 1
Terms
Docs management project project-management
1 1 1 0
2 0 0 1
因此,在旧版本0.6-2中,第二个字符串中的破折号被正确保留。使用新版本0.7-3:
> detach("package:tm", unload=TRUE)
> library("tm", lib.loc="~/R/x86_64-pc-linux-gnu-library/3.3/tm_0.7-3")
> dtm_0.7 <- DocumentTermMatrix(myCorpus)
> inspect(dtm_0.7)
<<DocumentTermMatrix (documents: 2, terms: 7)>>
Non-/sparse entries: 14/0
Sparsity : 0%
Maximal term length: 10
Weighting : term frequency (tf)
Sample :
Terms
Docs analysis big data learning machine management project
1 1 1 2 1 1 1 1
2 1 1 2 1 1 1 1
我尝试强制保留破折号,如下所示,但无济于事:
> dtm_test <- DocumentTermMatrix(myCorpus,
+ control = list(removePunctuation = list(preserve_intra_word_dashes = TRUE)))
> inspect(dtm_test)
<<DocumentTermMatrix (documents: 2, terms: 7)>>
Non-/sparse entries: 14/0
Sparsity : 0%
Maximal term length: 10
Weighting : term frequency (tf)
Sample :
Terms
Docs analysis big data learning machine management project
1 1 1 2 1 1 1 1
2 1 1 2 1 1 1 1
有什么建议吗?谢谢!
发布于 2018-01-03 17:05:32
答案来自tm作者本人。Ingo Feinerer谢谢!在这里重现:
从0.7开始,默认语料库是"SimpleCorpus“(如果支持;这取决于来源)。看到了吗?简单语料库
这会触发某种行为(参见?TermDocumentMatrix中的)。
使用VCorpus而不是语料库来强制执行旧行为:
inspect(TermDocumentMatrix(Corpus(VectorSource(two_strings)))) inspect(TermDocumentMatrix(VCorpus(VectorSource(two_strings))))
返回到上面的示例,使用now VCorpus:
> library("tm", lib.loc="~/R/x86_64-pc-linux-gnu-library/3.3/tm_0.7-3")
> myVCorpus <- VCorpus(VectorSource(two_strings))
> dtm_0.7 <- DocumentTermMatrix(myVCorpus)
> inspect(dtm_0.7)
<<DocumentTermMatrix (documents: 2, terms: 11)>>
Non-/sparse entries: 11/11
Sparsity : 50%
Maximal term length: 18
Weighting : term frequency (tf)
Sample :
Terms
Docs analysis big big-data data data-analysis learning machine machine-learning
1 1 1 0 2 0 1 1 0
2 0 0 1 0 1 0 0 1
Terms
Docs management project
1 1 1
2 0 0
https://stackoverflow.com/questions/48060546
复制相似问题