版权声明:博主原创文章,微信公众号:素质云笔记,转载请注明来源“素质云博客”,谢谢合作!! https://blog.csdn.net/sinat_26917383/article/details/79357700
笔者很早就对LDA模型着迷,最近在学习gensim库发现了LDA比较有意义且项目较为完整的Tutorials,于是乎就有本系列,本系列包含三款:Latent Dirichlet Allocation、Author-Topic Model、Dynamic Topic Models
pyLDA系列模型 | 解析 | 功能 |
---|---|---|
ATM模型(Author-Topic Model) | 加入监督的’作者’,每个作者对不同主题的偏好;弊端:chained topics, intruded words, random topics, and unbalanced topics (see Mimno and co-authors 2011) | 作者主题偏好、词语主题偏好、相似作者推荐、可视化 |
LDA模型(Latent Dirichlet Allocation) | 主题模型 | 文章主题偏好、单词的主题偏好、主题内容展示、主题内容矩阵 |
DTM模型(Dynamic Topic Models) | 加入时间因素,不同主题随着时间变动 | 时间-主题词条矩阵、主题-时间词条矩阵、文档主题偏好、新文档预测、跨时间+主题属性的文档相似性 |
本篇为常规的LDA简单罗列:
.
材料 | 解释 | 示例 |
---|---|---|
corpus | 用过gensim 都懂 | [[(0, 1), (1, 1), (2, 1), (3, 1), (4, 1), (5, 1), (6, 1)], [(0, 1), (4, 1), (5, 1), (7, 1), (8, 1), (9, 2)]] |
id2word | 每个词语ID的映射表,dictionary构成,id2word = dictionary.id2token | {0: ’ 0’, 1: ’ American nstitute of Physics 1988 ‘, 2: ’ Fig’, 3: ’ The’, 4: ‘1 1’, 5: ‘2 2’, 6: ‘2 3’, 7: ‘CA 91125 ‘, 8: ‘CONCLUSIONS ‘} |
.
参考自官网教程 models.ldamodel – Latent Dirichlet Allocation
class gensim.models.ldamodel.LdaModel(corpus=None, num_topics=100, id2word=None, distributed=False, chunksize=2000, passes=1, update_every=1, alpha='symmetric', eta=None, decay=0.5, offset=1.0, eval_every=10, iterations=50, gamma_threshold=0.001, minimum_probability=0.01, random_state=None, ns_conf=None, minimum_phi_value=0.01, per_word_topics=False, callbacks=None, dtype=<type 'numpy.float32'>)
I suggest the following way to choose iterations and passes. First, enable logging (as described in many Gensim tutorials), and set eval_every = 1
in LdaModel
. When training the model look for a line in the log that looks something like this:
2016-06-21 15:40:06,753 - gensim.models.ldamodel - DEBUG - 68/1566 documents converged within 400 iterations
If you set passes = 20
you will see this line 20 times. Make sure that by the final passes, most of the documents have converged. So you want to choose both passes and iterations to be high enough for this to happen.
主函数:get_document_topics(bow, minimum_probability=None,
minimum_phi_value=None, per_word_topics=False)
model.get_document_topics(corpus[0])
>>> [(1, 0.13500942), (3, 0.18280579), (4, 0.1801268), (7, 0.50190312)]
返回某篇文档(corpus编号为0的文档),该篇文章的每个主题分布大致情况,如果太小就没有。
get_term_topics(word_id, minimum_probability=None)¶
get_term_topics 方法用于返回词典中指定词汇最有可能对应的主题,调用方式为:实例.get_term_topics(word_id, minimum_probability=None)
,
ATM模型也有该功能
model.get_topics()
get_topic_terms(topicid, topn=10)
输入主题号,返回重要词以及重要词概率
get_topic_terms 方法以(词汇 id,概率)的形式返回指定主题的重要词汇,调用方式为:get_topic_terms(topicid, topn=10)
。
# 函数一
model.get_topic_terms(1, topn=10)
>>> [(774, 0.019700538013351386),
(3215, 0.0075965808303036916),
(3094, 0.0067132528809042526),
(514, 0.0063925849599646822),
(2739, 0.0054527647598129206),
(341, 0.004987335769043616),
(752, 0.0046566448210636699),
(1218, 0.0046234352422933724),
(186, 0.0042132891022475458),
(829, 0.0041800479706789939)]
# 函数二:
>>> array([[ 9.57974777e-05, 6.17130780e-07, 6.34938224e-07, ...,
6.17080048e-07, 6.19691132e-07, 6.17090716e-07],
[ 9.81065671e-05, 3.12945042e-05, 2.80837858e-04, ...,
7.86879291e-07, 7.86479617e-07, 7.86592758e-07],
[ 4.57734625e-05, 1.33555568e-05, 2.55108081e-05, ...,
5.31796854e-07, 5.32000122e-07, 5.31934336e-07],
结果展示:
# 函数一
model.print_topic(1, topn=10)
>>> '0.025*"image" + 0.010*"object" + 0.008*"distance" + 0.007*"recognition" + 0.005*"pixel" + 0.004*"cluster" + 0.004*"class" + 0.004*"transformation" + 0.004*"constraint" + 0.004*"map"'
# 函数二
model.print_topics(num_topics=20, num_words=10)
[(0,
'0.008*"gaussian" + 0.007*"mixture" + 0.006*"density" + 0.006*"matrix" + 0.006*"likelihood" + 0.005*"noise" + 0.005*"component" + 0.005*"prior" + 0.005*"estimate" + 0.004*"log"'),
(1,
'0.025*"image" + 0.010*"object" + 0.008*"distance" + 0.007*"recognition" + 0.005*"pixel" + 0.004*"cluster" + 0.004*"class" + 0.004*"transformation" + 0.004*"constraint" + 0.004*"map"'),
(2,
'0.011*"visual" + 0.010*"cell" + 0.009*"response" + 0.008*"field" + 0.008*"motion" + 0.007*"stimulus" + 0.007*"direction" + 0.005*"orientation" + 0.005*"eye" + 0.005*"frequency"')]
输入主题号,得到每个主题哪些重要词+重要词概率
每个主题下,重要词等式
show_topic(topicid, topn=10)
>>> [('action', 0.013790729946622874),
('control', 0.013754026606322274),
('policy', 0.010037394726575378),
('q', 0.0087439205722043382),
('reinforcement', 0.0087102831394097746),
('optimal', 0.0074764680531377312),
('robot', 0.0057665635437760083),
('controller', 0.0053787501576589725)]
# second
model.show_topics(num_topics=10)
>>> [(0,
'0.014*"action" + 0.014*"control" + 0.010*"policy" + 0.009*"q" + 0.009*"reinforcement" + 0.007*"optimal" + 0.006*"robot" + 0.005*"controller" + 0.005*"dynamic" + 0.005*"environment"'),
(1,
'0.020*"image" + 0.008*"face" + 0.007*"cluster" + 0.006*"signal" + 0.005*"source" + 0.005*"matrix" + 0.005*"filter" + 0.005*"search" + 0.004*"distance" + 0.004*"o_o"')]
主题一致性指标
top_topics = model.top_topics(corpus)
tc = sum([t[1] for t in top_topics])
其中top_topics 返回的针对主题的,10个主题 * 2(每个主题重要词概率+一致性指标):
[([(0.0081142522, 'gaussian'), (0.0029860872, 'hidden')],
-0.83264680887371556),
([(0.010487712, 'layer'), (0.0023913214, 'solution')],
-0.96372771081309494)]
...
其中 tc代表计算了所有主题一致性指标之和,还可以计算平均:
avg_topic_coherence = sum([t[1] for t in top_topics]) / num_topics
print('Average topic coherence: %.4f.' % avg_topic_coherence)