我的n-gram标记器无法正常工作。unigram似乎工作得很好,但只要我将二元语法分词器应用于语料库,它就会返回与unigram分词器相同的单词列表。代码如下。
##Loading the data may be part of the problem
blogs <- readLines("en_US.blogs.txt",
encoding = "UTF-8", skipNul=TRUE)
news <- readLines("en_US.news.txt",
encoding = "UTF-8", skipNul=TRUE)
twitter <- readLines("en_US.twitter.txt",
encoding = "UTF-8", skipNul=TRUE)
blogs_sample <- SampleData(blogs, 0.01)
writeLines(blogs_sample, "blogs_sample.txt")
news_sample <- SampleData(news, 0.01)
writeLines(news_sample, "news_sample.txt")
twitter_sample <- SampleData(twitter, 0.01)
writeLines(twitter_sample, "twitter_sample.txt")这可能是问题所在,因为当我在TM包中使用DirSource时,我不确定实际的语料库是什么样子的。
corpus <- Corpus(DirSource("/Users/calvin.hutto/Desktop/R/Coursera
Capstone/final/en_US/sample", encoding = "UTF-8"),
readerControl = list(language = "en_US"))
UnigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 1, max = 1))
BigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 2, max = 2))
TrigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 3, max = 3))
tdm_1 <- TermDocumentMatrix(corpus, control = list (tokenize = UnigramTokenizer))
tdm_2 <- TermDocumentMatrix(corpus, control = list (tokenize = BigramTokenizer))
tdm_3 <- TermDocumentMatrix(corpus, control = list (tokenize = TrigramTokenizer))因此,当我检查二元文法tdm和单文法tdm的头部时,它们都呈现了相同的单个单词列表。
任何帮助都会被提醒!
R version 3.4.0 (2017-04-21)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: OS X El Capitan 10.11.6
Matrix products: default
BLAS:
locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] tm_0.7-1 NLP_0.1-10
loaded via a namespace (and not attached):
[1] Rcpp_0.12.10 digest_0.6.12 crayon_1.3.2 SnowballC_0.5.1 slam_0.1-40 bitops_1.0-6 R6_2.2.2
[8] magrittr_1.5 swirl_2.4.3 httr_1.2.1 stringi_1.1.5 testthat_1.0.2 tools_3.4.0 stringr_1.2.0
[15] RCurl_1.95-4.8 yaml_2.1.14 parallel_3.4.0 compiler_3.4.0 发布于 2017-07-28 23:46:35
看起来很复杂。一种更简单的方法如何?
require(readtext)
require(quanteda)
mycorpus <- corpus(readtext("/Users/calvin.hutto/Desktop/R/Coursera Capstone/final/en_US/sample/*.txt"))
mydfm <- dfm(mycorpus, ngrams = 1:2, remove_punct = TRUE)
head(mydfm)我不能显示输出,因为我没有您的数据,但这应该可以很好地工作。
https://stackoverflow.com/questions/45271003
复制相似问题