文章/答案/技术大牛

发布

社区首页 >问答首页 >LDA模型中“好”/“坏”-Cases的规定(在Python中使用gensim )

问LDA模型中“好”/“坏”-Cases的规定(在Python中使用gensim )
EN

Stack Overflow用户

提问于 2016-08-09 21:43:39

回答 1查看 354关注 0票数 0

我正在尝试分析新闻片段，以确定危机时期。为了做到这一点，我已经下载了过去7年的新闻文章，并提供了这些文章。现在，我正在对这个数据集应用LDA (潜在狄利克雷分配)模型，以便识别那些显示出经济危机迹象的国家。

我的代码基于Jordan Barber (https://rstudio-pubs-static.s3.amazonaws.com/79360_850b2a69980c4488b1db95987a24867a.html)的一篇博客文章--这是我目前为止的代码：

import os, csv

#create list with text blocks in rows, based on csv file
list=[]

with open('Testfile.csv', 'r') as csvfile:
    emails = csv.reader(csvfile)
    for row in emails:
         list.append(row)

#create doc_set
doc_set=[]

for row in list:
    doc_set.append(row[0])

#import plugins - need to install gensim and stop_words manually for fresh python install
from nltk.tokenize import RegexpTokenizer
from stop_words import get_stop_words
from nltk.stem.porter import PorterStemmer
from gensim import corpora, models
import gensim

tokenizer = RegexpTokenizer(r'\w+')

# create English stop words list
en_stop = get_stop_words('en')

# Create p_stemmer of class PorterStemmer
p_stemmer = PorterStemmer()


# list for tokenized documents in loop
texts = []

# loop through document list
for i in doc_set:

    # clean and tokenize document string
    raw = i.lower()
    tokens = tokenizer.tokenize(raw)

    # remove stop words from tokens
    stopped_tokens = [i for i in tokens if not i in en_stop]

    # stem tokens
    stemmed_tokens = [p_stemmer.stem(i) for i in stopped_tokens]

    # add tokens to list
    texts.append(stemmed_tokens)


# turn our tokenized documents into a id <-> term dictionary
dictionary = corpora.Dictionary(texts)

# convert tokenized documents into a document-term matrix
corpus = [dictionary.doc2bow(text) for text in texts]

# generate LDA model
ldamodel = gensim.models.ldamodel.LdaModel(corpus, num_topics=5, id2word = dictionary, passes=10)

print(ldamodel.print_topics(num_topics=5, num_words=5))

# map topics to documents
doc_lda=ldamodel[corpus]

with open('doc_lda.csv', 'w') as outfile:
    writer = csv.writer(outfile)
    for row in doc_lda:
        writer.writerow(row)

本质上，我标识了一些主题(上面的代码中有5个主题-要检查)，并使用最后一行为每篇新闻文章分配一个分数，这表明了一篇文章与这些主题之一相关的概率。现在，我只能手动地对给定的主题是否与危机相关做出定性评估，这有点不幸。我更愿意做的是，告诉算法一篇文章是否在危机期间发表，并使用这条额外的信息来识别我的“危机年”和“非危机年”的主题。简单地将我的数据集拆分为只考虑我的“坏”主题(即仅危机年)在我看来是行不通的，因为我仍然需要手动选择哪些主题实际上与危机相关，哪些主题无论如何都会出现(体育新闻，…)。。

那么，有没有一种方法可以使代码适应a)合并“危机”与“非危机”的信息，以及b)自动选择最优的主题/单词数量以优化模型的预测能力？

提前谢谢你！

python

python-2.7

lda

gensim

回答 1

Stack Overflow用户

发布于 2016-08-10 18:13:06

首先对你的具体问题提出一些建议：

a)合并“危机”与“非危机”的信息

为了使用标准的LDA模型来实现这一点，我可能会在文档主题比例和文档是否处于危机/非危机时期之间寻找相互信息。

b)自动选择最优的主题/单词数来优化模型的预测能力？

如果您希望正确地执行此操作，请尝试使用许多主题数量设置，并尝试使用主题模型来预测待定文档(未包含在主题模型中的文档)的冲突/非冲突。

有许多主题模型变体可以有效地选择主题的数量(“非参数”模型)。事实证明，带超参数优化的Mallet实现有效地做了同样的事情，所以我建议使用它(提供大量的主题-超参数优化将导致许多主题只有很少的赋值单词，这些主题只是噪声)。

和一些通用注释：

有许多主题模型变体，特别是一些结合了时间的变体。这可能是一个很好的选择(因为随着时间的推移，它们会比标准LDA更好地解决主题变化-尽管标准LDA是一个很好的起点)。

我特别喜欢的一个模型使用了pitman-yor单词先验(比dirichlet更好地匹配zipf分布式单词)，考虑了主题中的突发性，并提供了关于垃圾主题的线索：https://github.com/wbuntine/topic-models

票数 0

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/38852506

复制

相似问题

问LDA模型中“好”/“坏”-Cases的规定(在Python中使用gensim )
EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问LDA模型中“好”/“坏”-Cases的规定(在Python中使用gensim )EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问LDA模型中“好”/“坏”-Cases的规定(在Python中使用gensim )
EN