文章/答案/技术大牛

发布

社区首页 >问答首页 >ngrams在text2vec中使用hash_vectorizer

问ngrams在text2vec中使用hash_vectorizer
EN

Stack Overflow用户

提问于 2017-12-14 22:38:44

回答 1查看 451关注 0票数 1

当我尝试在text2vec中使用hash_vectorizer函数创建ngram时，我注意到它并没有改变我的dtm的尺寸，而只是改变了值。

h_vectorizer = hash_vectorizer(hash_size = 2 ^ 14, ngram = c(2L, 10L))
dtm_train = create_dtm(it_train, h_vectorizer)
dim(dtm_train)

在上面的代码中，维度不会改变是2-10还是9-10。

vocab = create_vocabulary(it_train, ngram = c(1L, 4L))
ngram_vectorizer = vocab_vectorizer(vocab)
dtm_train = create_dtm(it_train, ngram_vectorizer)

在上面的代码中，尺寸发生了变化，但我也想使用hash_vectorizor，因为它节省了空间。我该如何使用它呢？

hash

text-mining

text2vec

回答 1

Stack Overflow用户

发布于 2017-12-14 23:40:47

当使用散列时，您需要提前设置输出矩阵的大小。您可以通过设置hash_size = 2 ^ 14来完成此操作。这与模型中指定的ngram窗口无关地保持相同。但是，输出矩阵中的计数会发生变化。

(回应下面的评论：)下面你可以找到一个最小的例子，用两个非常简单的字符串来演示在hash_vectorizer中使用的两个不同的ngram窗口的不同输出。对于二元语法的情况，我添加了vocab_vectorizer的输出矩阵进行比较。您意识到，您必须设置一个足够大的散列大小来考虑所有术语。如果它太小，单个术语的哈希值可能会发生冲突。

您关于总是必须比较vocab_vectorizer方法和hash_vectorizer方法的输出的评论会导致错误的方向，因为这样会失去散列方法可能产生的效率/内存优势，从而避免生成词汇表。根据您的数据和所需的输出，哈希可能会将准确性(以及dtm中术语的可解释性)与效率对立起来。因此，散列是否合理取决于您的用例(这尤其适用于大型集合的文档级分类任务)。

我希望这篇文章能让你对哈希有一个大致的了解，以及你能或不能从中得到什么。你也可以在quora，Wikipedia (或者here)上查看一些关于散列的文章。或者也可以参考text2vec.org上列出的详细的原始资源。

library(text2vec)
txt <- c("a string string", "and another string")

it = itoken(txt, progressbar = F)


#the following four example demonstrate the effect of the size of the hash
#and the use of signed hashes (i.e. the use of a secondary hash function to reduce risk of collisions)
vectorizer_small = hash_vectorizer(2 ^ 2, c(1L, 1L)) #unigrams only
hash_dtm_small = create_dtm(it, vectorizer_small)
as.matrix(hash_dtm_small)
#    [,1] [,2] [,3] [,4]
# 1    2    0    0    1
# 2    1    2    0    0  #collision of the hash values of and / another

vectorizer_small_signed = hash_vectorizer(2 ^ 2, c(1L, 1L), signed_hash = TRUE) #unigrams only
hash_dtm_small = create_dtm(it, vectorizer_small_signed)
as.matrix(hash_dtm_small)
#     [,1] [,2] [,3] [,4]
# 1    2    0    0    1
# 2    1    0    0    0 #no collision but some terms (and / another) not represented as hash value

vectorizer_medium = hash_vectorizer(2 ^ 3, c(1L, 1L)) #unigrams only
hash_dtm_medium = create_dtm(it, vectorizer_medium)
as.matrix(hash_dtm_medium)
#    [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8]
# 1    0    0    0    1    2    0    0    0
# 2    0    1    0    0    1    1    0    0 #no collision, all terms represented by hash values


vectorizer_medium = hash_vectorizer(2 ^ 3, c(1L, 1L), signed_hash = TRUE) #unigrams only
hash_dtm_medium = create_dtm(it, vectorizer_medium)
as.matrix(hash_dtm_medium)
#     [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8]
# 1    0    0    0    1    2    0    0    0
# 2    0   -1    0    0    1    1    0    0 #no collision, all terms represented as hash values
                                            #in addition second hash function generated a negative hash value


#the following two examples deomstrate the difference between 
#two hash vectorizers one with unigrams, one allowing for bigrams
#and one vocab vectorizer with bigrams
vectorizer = hash_vectorizer(2 ^ 4, c(1L, 1L)) #unigrams only
hash_dtm = create_dtm(it, vectorizer)
as.matrix(hash_dtm)
#    [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12] [,13] [,14] [,15] [,16]
# 1    0    0    0    0    0    0    0    0    0     0     0     1     2     0     0     0
# 2    0    0    0    0    0    0    0    0    0     1     0     0     1     1     0     0

vectorizer2 = hash_vectorizer(2 ^ 4, c(1L, 2L)) #unigrams + bigrams
hash_dtm2 = create_dtm(it, vectorizer2)
as.matrix(hash_dtm2)
#     [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12] [,13] [,14] [,15] [,16]
# 1    1    0    0    1    0    0    0    0    0     0     0     1     2     0     0     0
# 2    0    0    0    0    0    1    1    0    0     1     0     0     1     1     0     0

v <- create_vocabulary(it, c(1L, 2L))
vectorizer_v = vocab_vectorizer(v) #unigrams + bigrams
v_dtm = create_dtm(it, vectorizer_v)
as.matrix(v_dtm)
#   a_string and_another a another and string_string another_string string
# 1        1           0 1       0   0             1              0      2
# 2        0           1 0       1   1             0              1      1


sum(Matrix::colSums(as.matrix(hash_dtm)) > 0)
#[1] 4   - these are the four unigrams a, string, and, another
sum(Matrix::colSums(hash_dtm2) > 0)
#[1] 8   - these are the four unigrams as above plus the 4 bigrams string_string, a_string, and_another, another_string 
sum(Matrix::colSums(v_dtm) > 0)
#[1] 8 - same as hash_dtm2

票数 2

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/47815933

复制

相似问题

问ngrams在text2vec中使用hash_vectorizer
EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问ngrams在text2vec中使用hash_vectorizerEN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问ngrams在text2vec中使用hash_vectorizer
EN