1. Topic Modeling on Health Journals with Regularized Variational Inference（基于正则化变分推断主题模型的健康杂志分析）
作者：Robert Giaquinto,Arindam Banerjee
摘要：Topic modeling enables exploration and compact representation of a corpus. The CaringBridge (CB) dataset is a massive collection of journals written by patients and caregivers during a health crisis. Topic modeling on the CB dataset, however, is challenging due to the asynchronous nature of multiple authors writing about their health journeys. To overcome this challenge we introduce the Dynamic Author-Persona topic model (DAP), a probabilistic graphical model designed for temporal corpora with multiple authors. The novelty of the DAP model lies in its representation of authors by a persona --- where personas capture the propensity to write about certain topics over time. Further, we present a regularized variational inference algorithm, which we use to encourage the DAP model's personas to be distinct. Our results show significant improvements over competing topic models --- particularly after regularization, and highlight the DAP model's unique ability to capture common journeys shared by different authors.
2. Latent nested nonparametric priors（潜在的嵌套的非参数先验）
作者：Federico Camerlenghi,David B. Dunson,Antonio Lijoi,Igor Prünster,Abel Rodríguez
摘要：Discrete random structures are important tools in Bayesian nonparametrics and the resulting models have proven effective in density estimation, clustering, topic modeling and prediction, among others. In this paper, we consider nested processes and study the dependence structures they induce. Dependence ranges between homogeneity, corresponding to full exchangeability, and maximum heterogeneity, corresponding to (unconditional) independence across samples. The popular nested Dirichlet process is shown to degenerate to the fully exchangeable case when there are ties across samples at the observed or latent level. To overcome this drawback, inherent to nesting general discrete random measures, we introduce a novel class of latent nested processes. These are obtained by adding common and group-specific completely random measures and, then, normalising to yield dependent random probability measures. We provide results on the partition distributions induced by latent nested processes, and develop an Markov Chain Monte Carlo sampler for Bayesian inferences. A test for distributional homogeneity across groups is obtained as a by product. The results and their inferential implications are showcased on synthetic and real data.
3. Between an Arena and a Sports Bar: Online Chats of eSports Spectators（在竞技场和体育酒吧之间:电子竞技观众的在线聊天）
作者：Ilya Musabirov,Denis Bulygin,Paul Okopny,Ksenia Konstantinova
摘要：ESports tournaments, such as Dota 2's The International (TI), attract millions of spectators to watch broadcasts on online streaming platforms, to communicate, and to share their experience and emotions. Unlike traditional streams, tournament broadcasts lack a streamer figure to which spectators can appeal directly. Using topic modelling and cross-correlation analysis of more than three million messages from 86 games of TI7, we uncover main topical and temporal patterns of communication. First, we disentangle contextual meanings of emotes and memes, which play a salient role in communication, and show a meta-topics semantic map of streaming slang. Second, our analysis shows a prevalence of the event-driven game communication during tournament broadcasts and particular topics associated with the event peaks. Third, we show that "copypasta" cascades and other related practices, while occupying a significant share of messages, are strongly associated with periods of lower in-game activity. Based on the analysis, we propose design ideas to support different modes of spectators' communication.
4. Knowledge-based Word Sense Disambiguation using Topic Models（基于主题模型的以知识为基础的词义消歧）
作者：Devendra Singh Chaplot,Ruslan Salakhutdinov
摘要：Word Sense Disambiguation is an open problem in Natural Language Processing which is particularly challenging and useful in the unsupervised setting where all the words in any given text need to be disambiguated without using any labeled data. Typically WSD systems use the sentence or a small window of words around the target word as the context for disambiguation because their computational complexity scales exponentially with the size of the context. In this paper, we leverage the formalism of topic model to design a WSD system that scales linearly with the number of words in the context. As a result, our system is able to utilize the whole document as the context for a word to be disambiguated. The proposed method is a variant of Latent Dirichlet Allocation in which the topic proportions for a document are replaced by synset proportions. We further utilize the information in the WordNet by assigning a non-uniform prior to synset distribution over words and a logistic-normal prior for document distribution over synsets. We evaluate the proposed method on Senseval-2, Senseval-3, SemEval-2007, SemEval-2013 and SemEval-2015 English All-Word WSD datasets and show that it outperforms the state-of-the-art unsupervised knowledge-based WSD system by a significant margin.
5. Topic Compositional Neural Language Model（神经语言模型和主题结合的方法）
作者：Wenlin Wang,Zhe Gan,Wenqi Wang,Dinghan Shen,Jiaji Huang,Wei Ping,Sanjeev Satheesh,Lawrence Carin
摘要：We propose a Topic Compositional Neural Language Model (TCNLM), a novel method designed to simultaneously capture both the global semantic meaning and the local word ordering structure in a document. The TCNLM learns the global semantic coherence of a document via a neural topic model, and the probability of each learned latent topic is further used to build a Mixture-of-Experts (MoE) language model, where each expert (corresponding to one topic) is a recurrent neural network (RNN) that accounts for learning the local structure of a word sequence. In order to train the MoE model efficiently, a matrix factorization method is applied, by extending each weight matrix of the RNN to be an ensemble of topic-dependent weight matrices. The degree to which each member of the ensemble is used is tied to the document-dependent probability of the corresponding topics. Experimental results on several corpora show that the proposed approach outperforms both a pure RNN-based model and other topic-guided language models. Further, our model yields sensible topics, and also has the capacity to generate meaningful sentences conditioned on given topics.
6. Multilingual Topic Models（多语言主题模型）
作者：Kriste Krstovski,Michael J. Kurtz,David A. Smith,Alberto Accomazzi
摘要：Scientific publications have evolved several features for mitigating vocabulary mismatch when indexing, retrieving, and computing similarity between articles. These mitigation strategies range from simply focusing on high-value article sections, such as titles and abstracts, to assigning keywords, often from controlled vocabularies, either manually or through automatic annotation. Various document representation schemes possess different cost-benefit tradeoffs. In this paper, we propose to model different representations of the same article as translations of each other, all generated from a common latent representation in a multilingual topic model. We start with a methodological overview on latent variable models for parallel document representations that could be used across many information science tasks. We then show how solving the inference problem of mapping diverse representations into a shared topic space allows us to evaluate representations based on how topically similar they are to the original article. In addition, our proposed approach provides means to discover where different concept vocabularies require improvement.
本文分享自微信公众号 - 专知（Quan_Zhuanzhi），作者：专知内容组（编）
原文出处及转载信息见文内详细说明，如有侵权，请联系 email@example.com 删除。