【论文推荐】最新六篇主题模型相关论文—收敛率、大规模、深度主题建模、优化、情绪强度、广义动态主题模型

【导读】专知内容组整理了最近六篇主题模型(Topic Model)相关文章,为大家进行介绍,欢迎查看!

1.Convergence Rates of Latent Topic Models Under Relaxed Identifiability Conditions(在松弛可识别性条件下潜在主题模型的收敛率)


作者:Yining Wang

机构:Carnegie Mellon University

摘要:In this paper we study the frequentist convergence rate for the Latent Dirichlet Allocation (Blei et al., 2003) topic models. We show that the maximum likelihood estimator converges to one of the finitely many equivalent parameters in Wasserstein's distance metric at a rate of $n^{-1/4}$ without assuming separability or non-degeneracy of the underlying topics and/or the existence of more than three words per document, thus generalizing the previous works of Anandkumar et al. (2012, 2014) from an information-theoretical perspective. We also show that the $n^{-1/4}$ convergence rate is optimal in the worst case.

期刊:arXiv, 2018年3月18日

网址

http://www.zhuanzhi.ai/document/791bd044fe3db3e6de1b4c3a328de3ea

2.CuLDA_CGS: Solving Large-scale LDA Problems on GPUs(CuLDA_CGS:解决GPUs上的大规模LDA问题)


作者:Xiaolong Xie,Yun Liang,Xiuhong Li,Wei Tan

机构:Peking University

摘要:Latent Dirichlet Allocation(LDA) is a popular topic model. Given the fact that the input corpus of LDA algorithms consists of millions to billions of tokens, the LDA training process is very time-consuming, which may prevent the usage of LDA in many scenarios, e.g., online service. GPUs have benefited modern machine learning algorithms and big data analysis as they can provide high memory bandwidth and computation power. Therefore, many frameworks, e.g. Ten- sorFlow, Caffe, CNTK, support to use GPUs for accelerating the popular machine learning data-intensive algorithms. However, we observe that LDA solutions on GPUs are not satisfying. In this paper, we present CuLDA_CGS, a GPU-based efficient and scalable approach to accelerate large-scale LDA problems. CuLDA_CGS is designed to efficiently solve LDA problems at high throughput. To it, we first delicately design workload partition and synchronization mechanism to exploit the benefits of mul- tiple GPUs. Then, we offload the LDA sampling process to each individual GPU by optimizing from the sampling algorithm, par- allelization, and data compression perspectives. Evaluations show that compared with state-of-the-art LDA solutions, CuLDA_CGS outperforms them by a large margin (up to 7.3X) on a single GPU. CuLDA_CGS is able to achieve extra 3.0X speedup on 4 GPUs. The source code is publicly available on https://github.com/cuMF/ CuLDA_CGS.

期刊:arXiv, 2018年3月13日

网址

http://www.zhuanzhi.ai/document/34a1e75e4ab744eec51bb1b8096a13b4

3.WHAI: Weibull Hybrid Autoencoding Inference for Deep Topic Modeling(WHAI:威布尔混合自编码推理的深度主题建模)


作者:Hao Zhang,Bo Chen,Dandan Guo,Mingyuan Zhou

机构:Xidian University,The University of Texas at Austin

摘要:To train an inference network jointly with a deep generative topic model, making it both scalable to big corpora and fast in out-of-sample prediction, we develop Weibull hybrid autoencoding inference (WHAI) for deep latent Dirichlet allocation, which infers posterior samples via a hybrid of stochastic-gradient MCMC and autoencoding variational Bayes. The generative network of WHAI has a hierarchy of gamma distributions, while the inference network of WHAI is a Weibull upward-downward variational autoencoder, which integrates a deterministic-upward deep neural network, and a stochastic-downward deep generative model based on a hierarchy of Weibull distributions. The Weibull distribution can be used to well approximate a gamma distribution with an analytic Kullback-Leibler divergence, and has a simple reparameterization via the uniform noise, which help efficiently compute the gradients of the evidence lower bound with respect to the parameters of the inference network. The effectiveness and efficiency of WHAI are illustrated with experiments on big corpora.

期刊:arXiv, 2018年3月4日

网址

http://www.zhuanzhi.ai/document/bc25b1fdf3ff6db4ac6ba4fa28c63ac1

4.Application of Rényi and Tsallis Entropies to Topic Modeling Optimization(Renyi和Tsallis熵在主题建模优化中的应用)


作者:Koltcov Sergei

摘要:This is full length article (draft version) where problem number of topics in Topic Modeling is discussed. We proposed idea that Renyi and Tsallis entropy can be used for identification of optimal number in large textual collections. We also report results of numerical experiments of Semantic stability for 4 topic models, which shows that semantic stability play very important role in problem topic number. The calculation of Renyi and Tsallis entropy based on thermodynamics approach.

期刊:arXiv, 2018年3月1日

网址

http://www.zhuanzhi.ai/document/816c7644baa708ae678d14b7f8abdf28

5.Classifying Idiomatic and Literal Expressions Using Topic Models and Intensity of Emotions(使用主题模型和情绪的强度将习语和文字分类)


作者:Jing Peng,Anna Feldman,Ekaterina Vylomova

机构:Bauman State Technical University,Montclair State University

摘要:We describe an algorithm for automatic classification of idiomatic and literal expressions. Our starting point is that words in a given text segment, such as a paragraph, that are highranking representatives of a common topic of discussion are less likely to be a part of an idiomatic expression. Our additional hypothesis is that contexts in which idioms occur, typically, are more affective and therefore, we incorporate a simple analysis of the intensity of the emotions expressed by the contexts. We investigate the bag of words topic representation of one to three paragraphs containing an expression that should be classified as idiomatic or literal (a target phrase). We extract topics from paragraphs containing idioms and from paragraphs containing literals using an unsupervised clustering method, Latent Dirichlet Allocation (LDA) (Blei et al., 2003). Since idiomatic expressions exhibit the property of non-compositionality, we assume that they usually present different semantics than the words used in the local topic. We treat idioms as semantic outliers, and the identification of a semantic shift as outlier detection. Thus, this topic representation allows us to differentiate idioms from literals using local semantic contexts. Our results are encouraging.

期刊:arXiv, 2018年2月27日

网址

http://www.zhuanzhi.ai/document/3a2e1b8fb8dfebf67b9d077c7064302e

6.Scalable Generalized Dynamic Topic Models(可伸缩的广义动态主题模型)


作者:Patrick Jähnichen,Florian Wenzel,Marius Kloft,Stephan Mandt

机构:Humboldt-Universität zu Berlin

摘要:Dynamic topic models (DTMs) model the evolution of prevalent themes in literature, online media, and other forms of text over time. DTMs assume that word co-occurrence statistics change continuously and therefore impose continuous stochastic process priors on their model parameters. These dynamical priors make inference much harder than in regular topic models, and also limit scalability. In this paper, we present several new results around DTMs. First, we extend the class of tractable priors from Wiener processes to the generic class of Gaussian processes (GPs). This allows us to explore topics that develop smoothly over time, that have a long-term memory or are temporally concentrated (for event detection). Second, we show how to perform scalable approximate inference in these models based on ideas around stochastic variational inference and sparse Gaussian processes. This way we can train a rich family of DTMs to massive data. Our experiments on several large-scale datasets show that our generalized model allows us to find interesting patterns that were not accessible by previous approaches.

期刊:arXiv, 2018年3月21日

网址

http://www.zhuanzhi.ai/document/b0d86c66d398e7a43f7a4b445251e1ae

原文发布于微信公众号 - 专知(Quan_Zhuanzhi)

原文发表时间:2018-03-29

本文参与腾讯云自媒体分享计划,欢迎正在阅读的你也加入,一起分享。

发表于

我来说两句

0 条评论
登录 后参与评论

相关文章

来自专栏数据科学学习手札

(数据科学学习手札20)主成分分析原理推导&Python自编函数实现

主成分分析(principal component analysis,简称PCA)是一种经典且简单的机器学习算法,其主要目的是用较少的变量去解释原来资料中的大部...

4187
来自专栏生信小驿站

主成分分析①

principal() 含多种可选的方差旋转方法的主成分分析 fa() 可用主轴、最小残差、加权最小平方或最大似然法估计的因子分析 fa.paralle...

892
来自专栏专知

【论文推荐】最新六篇主题模型相关论文—领域特定知识库、神经变分推断、动态和静态主题模型

【导读】专知内容组既昨天推出六篇主题模型(Topic Model)相关论文,今天又推出最新六篇主题模型(Topic Model)相关论文,欢迎查看!

1214
来自专栏深度学习入门与实践

如何用卷积神经网络CNN识别手写数字集?

  前几天用CNN识别手写数字集,后来看到kaggle上有一个比赛是识别手写数字集的,已经进行了一年多了,目前有1179个有效提交,最高的是100%,我做了一下...

3829
来自专栏算法channel

算法channel关键词和文章索引

希望时间的流逝不仅仅丰富了我们的阅历,更重要的是通过提炼让我们得以升华,走向卓越。 1Tags 排序算法 链表 树 图 动态规划 ...

3365
来自专栏CSDN技术头条

【基础】常用的机器学习&数据挖掘知识点

Basis(基础): MSE(Mean Square Error均方误差),LMS(LeastMean Square最小均方),LSM(Least Square...

3048
来自专栏专知

【论文推荐】最新六篇主题模型相关论文—动态主题模型、主题趋势、大规模并行采样、随机采样、非参主题建模

【导读】专知内容组今天推出最新六篇主题模型(Topic Model)相关论文,欢迎查看!

1454
来自专栏数据科学学习手札

(数据科学学习手札22)主成分分析法在Python与R中的基本功能实现

上一篇中我们详细介绍推导了主成分分析法的原理,并基于Python通过自编函数实现了挑选主成分的过程,而在Python与R中都有比较成熟的主成分分析函数,本篇我们...

40310
来自专栏专知

面向工程师的最佳统计机器学习课程,Fall 2017 美国圣母大学,28章节详细讲述(附PPT下载,课程目录视频)

【导读】美国圣母大学2017年新开课程《给科学家和工程师的统计学习》Statistical Computing for Scientists and Engin...

38010
来自专栏深度学习自然语言处理

如何用简单易懂的例子解释隐马尔可夫模型?(入门篇)

因为文章总共超过5W字,所以我分为两部分,今天这是第一部分,先自己大致了解下什么是HMM,明天将会是具体的通俗公式讲解。加油,每天进步一丢丢O.O

1134

扫码关注云+社区