所以我有一个非常大的术语文档矩阵:
> class(ph.DTM)
[1] "TermDocumentMatrix" "simple_triplet_matrix"
> ph.DTM
A term-document matrix (109996 terms, 262811 documents)
Non-/sparse entries: 3705693/28904453063
Sparsity : 100%
Maximal term length: 191
Weighting : term frequency (tf)
如何获得每一项的rowSum (频率)?我试过了:
> apply(ph.DTM, 1, sum)
Error in vector(typeof(x$v), nr * nc) : vector size cannot be NA
In addition: Warning message:
In nr * nc : NAs produced by integer overflow
显然,我知道removeSparseTerms
ph.DTM2 <- removeSparseTerms(ph.DTM, 0.99999)
这会将尺寸缩小一点:
> ph.DTM2
A term-document matrix (28842 terms, 262811 documents)
Non-/sparse entries: 3612620/7576382242
Sparsity : 100%
Maximal term length: 24
Weighting : term frequency (tf)
但是我仍然不能对它应用任何与矩阵相关的函数:
> as.matrix(ph.DTM2)
Error in vector(typeof(x$v), nr * nc) : vector size cannot be NA
In addition: Warning message:
In nr * nc : NAs produced by integer overflow
如何才能在此对象上获得简单的行求和??谢谢!!
发布于 2014-02-21 07:01:42
好的,经过更多的谷歌搜索,我发现了slam
包,它支持:
ph.DTM3 <- rollup(ph.DTM, 2, na.rm=TRUE, FUN = sum)
这是可行的。
发布于 2015-06-16 18:14:53
正如@badpanda在其中一条评论中提到的那样,slam
现在具有用于稀疏数组的row_sums
和col_sums
函数:
slam::row_sums(dtm, na.rm = T)
slam::col_sums(tdm, na.rm = T)
发布于 2014-02-21 07:41:25
我认为:
rowSums(as.matrix(ph.DTM))
也行得通。
https://stackoverflow.com/questions/21921422
复制相似问题