Showing papers on "Latent Dirichlet allocation published in 2014"

PDF

Open Access

Proceedings Article•DOI•

Scaling distributed machine learning with the parameter server

[...]

Mu Li¹, David G. Andersen¹, Jun Woo Park¹, Alexander J. Smola¹, Amr Ahmed², Vanja Josifovski², James Long², Eugene J. Shekita², Bor-Yiing Su² - Show less +5 more•Institutions (2)

Carnegie Mellon University¹, Google²

06 Oct 2014

TL;DR: In this paper, the authors propose a parameter server framework for distributed machine learning problems, where both data and workloads are distributed over worker nodes, while the server nodes maintain globally shared parameters, represented as dense or sparse vectors and matrices.

...read moreread less

Abstract: We propose a parameter server framework for distributed machine learning problems. Both data and workloads are distributed over worker nodes, while the server nodes maintain globally shared parameters, represented as dense or sparse vectors and matrices. The framework manages asynchronous data communication between nodes, and supports flexible consistency models, elastic scalability, and continuous fault tolerance.To demonstrate the scalability of the proposed framework, we show experimental results on petabytes of real data with billions of examples and parameters on problems ranging from Sparse Logistic Regression to Latent Dirichlet Allocation and Distributed Sketching.

...read moreread less

1,034 citations

Proceedings Article•DOI•

LDAvis: A method for visualizing and interpreting topics

[...]

Carson Sievert, Kenneth E. Shirley¹•Institutions (1)

AT&T¹

01 Jan 2014

TL;DR: LDAvis, a web-based interactive visualization of topics estimated using Latent Dirichlet Allocation that is built using a combination of R and D3, and a novel method for choosing which terms to present to a user to aid in the task of topic interpretation is proposed.

...read moreread less

Abstract: We present LDAvis, a web-based interactive visualization of topics estimated using Latent Dirichlet Allocation that is built using a combination of R and D3. Our visualization provides a global view of the topics (and how they differ from each other), while at the same time allowing for a deep inspection of the terms most highly associated with each individual topic. First, we propose a novel method for choosing which terms to present to a user to aid in the task of topic interpretation, in which we define the relevance of a term to a topic. Second, we present results from a user study that suggest that ranking terms purely by their probability under a topic is suboptimal for topic interpretation. Last, we describe LDAvis, our visualization system that allows users to flexibly explore topic-term relationships using relevance to better understand a fitted LDA model.

...read moreread less

836 citations

Journal Article•DOI•

Mining Marketing Meaning from Online Chatter: Strategic Brand Analysis of Big Data Using Latent Dirichlet Allocation:

[...]

Seshadri Tirunillai¹, Gerard J. Tellis²•Institutions (2)

University of Houston¹, University of Southern California²

01 Jul 2014-Journal of Marketing Research

TL;DR: In this article, the authors propose a unified framework for this purpose using unsupervised latent Dirichlet allocation, which enables marketers to track dimensions' importance over time and allows for dynamic mapping of competitive brand positions on those dimensions over time.

...read moreread less

Abstract: Online chatter, or user-generated content, constitutes an excellent emerging source for marketers to mine meaning at a high temporal frequency. This article posits that this meaning consists of extracting the key latent dimensions of consumer satisfaction with quality and ascertaining the valence, labels, validity, importance, dynamics, and heterogeneity of those dimensions. The authors propose a unified framework for this purpose using unsupervised latent Dirichlet allocation. The sample of user-generated content consists of rich data on product reviews across 15 firms in five markets over four years. The results suggest that a few dimensions with good face validity and external validity are enough to capture quality. Dynamic analysis enables marketers to track dimensions' importance over time and allows for dynamic mapping of competitive brand positions on those dimensions over time. For vertically differentiated markets (e.g., mobile phones, computers), objective dimensions dominate and are similar acr...

...read moreread less

562 citations

Proceedings Article•DOI•

Machine Reading Tea Leaves: Automatically Evaluating Topic Coherence and Topic Model Quality

[...]

Jey Han Lau¹, David Newman², Timothy Baldwin³•Institutions (3)

King's College London¹, Google², University of Melbourne³

01 Apr 2014

TL;DR: This work explores the two tasks of automatic Evaluation of single topics and automatic evaluation of whole topic models, and provides recommendations on the best strategy for performing the two task, in addition to providing an open-source toolkit for topic and topic model evaluation.

...read moreread less

Abstract: Topic models based on latent Dirichlet allocation and related methods are used in a range of user-focused tasks including document navigation and trend analysis, but evaluation of the intrinsic quality of the topic model and topics remains an open research area. In this work, we explore the two tasks of automatic evaluation of single topics and automatic evaluation of whole topic models, and provide recommendations on the best strategy for performing the two tasks, in addition to providing an open-source toolkit for topic and topic model evaluation.

...read moreread less

493 citations

Journal Article•DOI•

BTM: Topic Modeling over Short Texts

[...]

Xueqi Cheng¹, Xiaohui Yan¹, Yanyan Lan¹, Jiafeng Guo¹•Institutions (1)

Chinese Academy of Sciences¹

01 Dec 2014-IEEE Transactions on Knowledge and Data Engineering

TL;DR: This paper proposes a novel way for short text topic modeling, referred as biterm topic model (BTM), which learns topics by directly modeling the generation of word co-occurrence patterns in the corpus, making the inference effective with the rich corpus-level information.

...read moreread less

Abstract: Short texts are popular on today’s web, especially with the emergence of social media. Inferring topics from large scale short texts becomes a critical but challenging task for many content analysis tasks. Conventional topic models such as latent Dirichlet allocation (LDA) and probabilistic latent semantic analysis (PLSA) learn topics from document-level word co-occurrences by modeling each document as a mixture of topics, whose inference suffers from the sparsity of word co-occurrence patterns in short texts. In this paper, we propose a novel way for short text topic modeling, referred as biterm topic model (BTM) . BTM learns topics by directly modeling the generation of word co-occurrence patterns (i.e., biterms) in the corpus, making the inference effective with the rich corpus-level information. To cope with large scale short text data, we further introduce two online algorithms for BTM for efficient topic learning. Experiments on real-word short text collections show that BTM can discover more prominent and coherent topics, and significantly outperform the state-of-the-art baselines. We also demonstrate the appealing performance of the two online BTM algorithms on both time efficiency and topic learning.

...read moreread less

452 citations

Posted Content•

Bayesian Optimization with Unknown Constraints

[...]

Michael A. Gelbart¹, Jasper Snoek¹, Ryan P. Adams¹•Institutions (1)

Harvard University¹

22 Mar 2014-arXiv: Machine Learning

TL;DR: This paper studies Bayesian optimization for constrained problems in the general case that noise may be present in the constraint functions, and the objective and constraints may be evaluated independently.

...read moreread less

Abstract: Recent work on Bayesian optimization has shown its effectiveness in global optimization of difficult black-box objective functions. Many real-world optimization problems of interest also have constraints which are unknown a priori. In this paper, we study Bayesian optimization for constrained problems in the general case that noise may be present in the constraint functions, and the objective and constraints may be evaluated independently. We provide motivating practical examples, and present a general framework to solve such problems. We demonstrate the effectiveness of our approach on optimizing the performance of online latent Dirichlet allocation subject to topic sparsity constraints, tuning a neural network given test-time memory constraints, and optimizing Hamiltonian Monte Carlo to achieve maximal effectiveness in a fixed time, subject to passing standard convergence diagnostics.

...read moreread less

303 citations

Proceedings Article•DOI•

Unfolding physiological state: mortality modelling in intensive care units

[...]

Marzyeh Ghassemi¹, Tristan Naumann¹, Finale Doshi-Velez², Nicole J. Brimmer¹, Rohit Joshi¹, Anna Rumshisky³, Peter Szolovits¹ - Show less +3 more•Institutions (3)

Massachusetts Institute of Technology¹, Harvard University², University of Massachusetts Lowell³

24 Aug 2014

TL;DR: This work examined the use of latent variable models to decompose free-text hospital notes into meaningful features, and found that latent topic-derived features were effective in determining patient mortality under three timelines: in-hospital, 30 day post- Discharge, and 1 year post-discharge mortality.

...read moreread less

Abstract: Accurate knowledge of a patient's disease state and trajectory is critical in a clinical setting. Modern electronic healthcare records contain an increasingly large amount of data, and the ability to automatically identify the factors that influence patient outcomes stand to greatly improve the efficiency and quality of care.We examined the use of latent variable models (viz. Latent Dirichlet Allocation) to decompose free-text hospital notes into meaningful features, and the predictive power of these features for patient mortality. We considered three prediction regimes: (1) baseline prediction, (2) dynamic (time-varying) outcome prediction, and (3) retrospective outcome prediction. In each, our prediction task differs from the familiar time-varying situation whereby data accumulates; since fewer patients have long ICU stays, as we move forward in time fewer patients are available and the prediction task becomes increasingly difficult.We found that latent topic-derived features were effective in determining patient mortality under three timelines: in-hospital, 30 day post-discharge, and 1 year post-discharge mortality. Our results demonstrated that the latent topic features important in predicting hospital mortality are very different from those that are important in post-discharge mortality. In general, latent topic features were more predictive than structured features, and a combination of the two performed best.The time-varying models that combined latent topic features and baseline features had AUCs that reached 0.85, 0.80, and 0.77 for in-hospital, 30 day post-discharge and 1 year post-discharge mortality respectively. Our results agreed with other work suggesting that the first 24 hours of patient information are often the most predictive of hospital mortality. Retrospective models that used a combination of latent topic features and structured features achieved AUCs of 0.96, 0.82, and 0.81 for in-hospital, 30 day, and 1-year mortality prediction.Our work focuses on the dynamic (time-varying) setting because models from this regime could facilitate an on-going severity stratification system that helps direct care-staff resources and inform treatment strategies.

...read moreread less

206 citations

Journal Article•DOI•

Clustering scientific documents with topic modeling

[...]

Chyi-Kwei Yau¹, Alan L. Porter¹, Nils C. Newman², Arho Suominen³•Institutions (3)

Georgia Institute of Technology¹, Maastricht University², VTT Technical Research Centre of Finland³

01 Sep 2014-Scientometrics

TL;DR: This paper investigates methods, including LDA and its extensions, for separating a set of scientific publications into several clusters and explores potential scientometric applications of such text analysis capabilities.

...read moreread less

Abstract: Topic modeling is a type of statistical model for discovering the latent "topics" that occur in a collection of documents through machine learning. Currently, latent Dirichlet allocation (LDA) is a popular and common modeling approach. In this paper, we investigate methods, including LDA and its extensions, for separating a set of scientific publications into several clusters. To evaluate the results, we generate a collection of documents that contain academic papers from several different fields and see whether papers in the same field will be clustered together. We explore potential scientometric applications of such text analysis capabilities.

...read moreread less

184 citations

Journal Article•DOI•

Reducing systematic review workload through certainty-based screening

[...]

Makoto Miwa¹, James Thomas², Alison O'Mara-Eves², Sophia Ananiadou³•Institutions (3)

Toyota Technological Institute¹, Institute of Education², University of Manchester³

01 Oct 2014-Journal of Biomedical Informatics

TL;DR: In this paper, the authors applied active learning with two criteria (certainty and uncertainty) and several enhancements in both clinical medicine and social science (specifically, public health) areas, and compared the results in both.

...read moreread less

141 citations

Journal Article•DOI•

Interpreting the Public Sentiment Variations on Twitter

[...]

Shulong Tan¹, Yang Li², Huan Sun², Ziyu Guan³, Xifeng Yan², Jiajun Bu¹, Chun Chen¹, Xiaofei He¹ - Show less +4 more•Institutions (3)

Zhejiang University¹, University of California, Santa Barbara², Northwest University (China)³

01 May 2014-IEEE Transactions on Knowledge and Data Engineering

TL;DR: A Latent Dirichlet Allocation (LDA) based model is proposed, Foreground and Background LDA (FB-LDA), to distill foreground topics and filter out longstanding background topics to interpret sentiment variations.

...read moreread less

Abstract: Millions of users share their opinions on Twitter, making it a valuable platform for tracking and analyzing public sentiment. Such tracking and analysis can provide critical information for decision making in various domains. Therefore it has attracted attention in both academia and industry. Previous research mainly focused on modeling and tracking public sentiment. In this work, we move one step further to interpret sentiment variations. We observed that emerging topics (named foreground topics ) within the sentiment variation periods are highly related to the genuine reasons behind the variations. Based on this observation, we propose a Latent Dirichlet Allocation (LDA) based model, Foreground and Background LDA (FB-LDA), to distill foreground topics and filter out longstanding background topics . These foreground topics can give potential interpretations of the sentiment variations. To further enhance the readability of the mined reasons, we select the most representative tweets for foreground topics and develop another generative model called Reason Candidate and Background LDA (RCB-LDA) to rank them with respect to their “popularity” within the variation period. Experimental results show that our methods can effectively find foreground topics and rank reason candidates. The proposed models can also be applied to other tasks such as finding topic differences between two sets of documents.

...read moreread less

136 citations

Journal Article•DOI•

Scaling Politically Meaningful Dimensions Using Texts and Votes

[...]

Benjamin E. Lauderdale¹, Tom S. Clark²•Institutions (2)

London School of Economics and Political Science¹, Emory University²

01 Jul 2014-American Journal of Political Science

TL;DR: A new approach to using sources of metadata about votes to estimate the degree to which those votes are about common issues, using latent Dirichlet allocation to discover the extent to which different issues were at stake in different cases and estimating justice preferences within each of those issues is proposed.

...read moreread less

Abstract: Item response theory models for roll-call voting data provide political scientists with parsimonious descriptions of political actors' relative preferences. However, models using only voting data tend to obscure variation in preferences across different issues due to identification and labeling problems that arise in multidimensional scaling models. We propose a new approach to using sources of metadata about votes to estimate the degree to which those votes are about common issues. We demonstrate our approach with votes and opinion texts from the U.S. Supreme Court, using latent Dirichlet allocation to discover the extent to which different issues were at stake in different cases and estimating justice preferences within each of those issues. This approach can be applied using a variety of unsupervised and supervised topic models for text, community detection models for networks, or any other tool capable of generating discrete or mixture categorization of subject matter from relevant vote-specific metadata.

...read moreread less

Proceedings Article•

Von Mises-Fisher Clustering Models

[...]

Siddharth Gopal¹, Yiming Yang¹•Institutions (1)

Carnegie Mellon University¹

21 Jun 2014

TL;DR: This paper proposes a suite of models for clustering high-dimensional data on a unit sphere based on von Mises-Fisher distribution and for discovering more intuitive clusters than existing approaches and develops fast variational methods as well as collapsed Gibbs sampling techniques for posterior inference.

...read moreread less

Abstract: This paper proposes a suite of models for clustering high-dimensional data on a unit sphere based on von Mises-Fisher (vMF) distribution and for discovering more intuitive clusters than existing approaches. The proposed models include a) A Bayesian formulation of vMF mixture that enables information sharing among clusters, b) a Hierarchical vMF mixture that provides multiscale shrinkage and tree structured view of the data and c) a Temporal vMF mixture that captures evolution of clusters in temporal data. For posterior inference, we develop fast variational methods as well as collapsed Gibbs sampling techniques for all three models. Our experiments on six datasets provide strong empirical support in favour of vMF based clustering models over other popular tools such as K-means, Multinomial Mixtures and Latent Dirichlet Allocation

...read moreread less

Journal Article•DOI•

Discovery of clinical pathway patterns from event logs using probabilistic topic models

[...]

Zhengxing Huang¹, Wei Dong², Lei Ji², Chenxi Gan¹, Xudong Lu¹, Huilong Duan¹ - Show less +2 more•Institutions (2)

Zhejiang University¹, Chinese PLA General Hospital²

01 Feb 2014-Journal of Biomedical Informatics

TL;DR: This study proposes a novel approach to CP pattern discovery by modeling CPs using mixtures of an extension to the Latent Dirichlet Allocation family that jointly models various treatment activities and their occurring time stamps in CPs.

...read moreread less

Proceedings Article•

Understanding the Limiting Factors of Topic Modeling via Posterior Contraction Analysis

[...]

Jian Tang¹, Zhaoshi Meng², XuanLong Nguyen², Qiaozhu Mei², Ming Zhang¹ - Show less +1 more•Institutions (2)

Peking University¹, University of Michigan²

21 Jun 2014

TL;DR: This paper presents theorems elucidating the posterior contraction rates of the topics as the amount of data increases, and a thorough supporting empirical study using synthetic and real data sets, including news and web-based articles and tweet messages.

...read moreread less

Abstract: Topic models such as the latent Dirichlet allocation (LDA) have become a standard staple in the modeling toolbox of machine learning. They have been applied to a vast variety of data sets, contexts, and tasks to varying degrees of success. However, to date there is almost no formal theory explicating the LDA's behavior, and despite its familiarity there is very little systematic analysis of and guidance on the properties of the data that affect the inferential performance of the model. This paper seeks to address this gap, by providing a systematic analysis of factors which characterize the LDA's performance. We present theorems elucidating the posterior contraction rates of the topics as the amount of data increases, and a thorough supporting empirical study using synthetic and real data sets, including news and web-based articles and tweet messages. Based on these results we provide practical guidance on how to identify suitable data sets for topic models, and how to specify particular model parameters.

...read moreread less

Journal Article•DOI•

Studying software evolution using topic models

[...]

Stephen W. Thomas¹, Bram Adams², Ahmed E. Hassan¹, Dorothea Blostein¹•Institutions (2)

Queen's University¹, École Polytechnique de Montréal²

01 Feb 2014-Science of Computer Programming

TL;DR: A first step towards evaluating topic models in the analysis of software evolution is taken by performing a detailed manual analysis on the source code histories of two well-known and well-documented systems, JHotDraw and jEdit.

...read moreread less

Journal Article•DOI•

A Probabilistic Generative Model for Mining Cybercriminal Networks from Online Social Media

[...]

Raymond Y. K. Lau¹, Yunqing Xia², Yunming Ye³•Institutions (3)

City University of Hong Kong¹, Tsinghua University², Harbin Institute of Technology³

01 Feb 2014-IEEE Computational Intelligence Magazine

TL;DR: This is the first successful research of applying a probabilistic generative model to mine cybercriminal networks from online social media using a novel weakly supervised cybercriminal network mining method to facilitate cybercrime forensics.

...read moreread less

Abstract: There has been a rapid growth in the number of cybercrimes that cause tremendous financial loss to organizations. Recent studies reveal that cybercriminals tend to collaborate or even transact cyber-attack tools via the "dark markets" established in online social media. Accordingly, it presents unprecedented opportunities for researchers to tap into these underground cybercriminal communities to develop better insights about collaborative cybercrime activities so as to combat the ever increasing number of cybercrimes. The main contribution of this paper is the development of a novel weakly supervised cybercriminal network mining method to facilitate cybercrime forensics. In particular, the proposed method is underpinned by a probabilistic generative model enhanced by a novel context-sensitive Gibbs sampling algorithm. Evaluated based on two social media corpora, our experimental results reveal that the proposed method significantly outperforms the Latent Dirichlet Allocation (LDA) based method and the Support Vector Machine (SVM) based method by 5.23% and 16.62% in terms of Area Under the ROC Curve (AUC), respectively. It also achieves comparable performance as the state-of-the-art Partially Labeled Dirichlet Allocation (PLDA) method. To the best of our knowledge, this is the first successful research of applying a probabilistic generative model to mine cybercriminal networks from online social media.

...read moreread less

Proceedings Article•DOI•

Positive Unlabeled Learning for Deceptive Reviews Detection

[...]

Yafeng Ren¹, Donghong Ji¹, Hongbin Zhang¹•Institutions (1)

Wuhan University¹

01 Oct 2014

TL;DR: A semi-supervised model, called mixing population and individual property PU learning (MPIPUL), is proposed which outperforms the state-of-the-art baselines and spies examples and their similarity weights are incorporated into SVM (Support Vector Machine) to build an accurate classifier.

...read moreread less

Abstract: Deceptive reviews detection has attracted significant attention from both business and research communities. However, due to the difficulty of human labeling needed for supervised learning, the problem remains to be highly challenging. This paper proposed a novel angle to the problem by modeling PU (positive unlabeled) learning. A semi-supervised model, called mixing population and individual property PU learning (MPIPUL), is proposed. Firstly, some reliable negative examples are identified from the unlabeled dataset. Secondly, some representative positive examples and negative examples are generated based on LDA (Latent Dirichlet Allocation). Thirdly, for the remaining unlabeled examples (we call them spy examples), which can not be explicitly identified as positive and negative, two similarity weights are assigned, by which the probability of a spy example belonging to the positive class and the negative class are displayed. Finally, spy examples and their similarity weights are incorporated into SVM (Support Vector Machine) to build an accurate classifier. Experiments on gold-standard dataset demonstrate the effectiveness of MPIPUL which outperforms the state-of-the-art baselines.

...read moreread less

Journal Article•DOI•

TWILITE: A recommendation system for Twitter using a probabilistic model based on latent Dirichlet allocation

[...]

Younghoon Kim¹, Kyuseok Shim¹•Institutions (1)

Seoul National University¹

01 Jun 2014-Information Systems

TL;DR: This paper proposes TWILITE, a recommendation system for Twitter using probabilistic modeling based on latent Dirichlet allocation which recommends top-K users to follow andTop-K tweets to read for a user and develops an inference algorithm based on the variational EM algorithm for learning model parameters.

...read moreread less

Proceedings Article•DOI•

Improving Twitter Sentiment Analysis with Topic-Based Mixture Modeling and Semi-Supervised Training

[...]

Bing Xiang¹, Liang Zhou•Institutions (1)

IBM¹

01 Jun 2014

TL;DR: The proposed sentiment model outperforms the top system in the task of Sentiment Analysis in Twitter in SemEval-2013 in terms of averaged F scores.

...read moreread less

Abstract: In this paper, we present multiple approaches to improve sentiment analysis on Twitter data. We first establish a state-of-the-art baseline with a rich feature set. Then we build a topic-based sentiment mixture model with topic-specific data in a semi-supervised training framework. The topic information is generated through topic modeling based on an efficient implementation of Latent Dirichlet Allocation (LDA). The proposed sentiment model outperforms the top system in the task of Sentiment Analysis in Twitter in SemEval-2013 in terms of averaged F scores.

...read moreread less

Journal Article•DOI•

Configuring latent Dirichlet allocation based feature location

[...]

Lauren R. Biggers¹, Cecylia Bocovich², Riley Capshaw³, Brian P. Eddy¹, Letha H. Etzkorn⁴, Nicholas A. Kraft¹ - Show less +2 more•Institutions (4)

University of Alabama¹, University of Waterloo², Hendrix College³, University of Alabama in Huntsville⁴

01 Jun 2014-Empirical Software Engineering

TL;DR: The key findings are that exclusion of comments and literals from the corpus lowers accuracy and that heuristics for selecting LDA parameter values in the natural language context are suboptimal in the source code context.

...read moreread less

Abstract: Feature location is a program comprehension activity, the goal of which is to identify source code entities that implement a functionality. Recent feature location techniques apply text retrieval models such as latent Dirichlet allocation (LDA) to corpora built from text embedded in source code. These techniques are highly configurable, and the literature offers little insight into how different configurations affect their performance. In this paper we present a study of an LDA based feature location technique (FLT) in which we measure the performance effects of using different configurations to index corpora and to retrieve 618 features from 6 open source Java systems. In particular, we measure the effects of the query, the text extractor configuration, and the LDA parameter values on the accuracy of the LDA based FLT. Our key findings are that exclusion of comments and literals from the corpus lowers accuracy and that heuristics for selecting LDA parameter values in the natural language context are suboptimal in the source code context. Based on the results of our case study, we offer specific recommendations for configuring the LDA based FLT.

...read moreread less

Book Chapter•DOI•

Biomedical text mining: State-of-the-art, open problems and future challenges

[...]

Andreas Holzinger¹, Johannes Schantl¹, Miriam Schroettner¹, Christin Seifert², Karin Verspoor³ - Show less +1 more•Institutions (3)

University of Graz¹, University of Passau², University of Melbourne³

01 Jan 2014

TL;DR: This paper provides a short, concise overview of some selected text mining methods, focusing on statistical methods, i.e. Latent Semantic Analysis, Probabilistic Latent seminar analysis, Hierarchical Latent Dirichlet Allocation, Principal Component Analysis, and Support Vector Machines, along with some examples from the biomedical domain.

...read moreread less

Abstract: Text is a very important type of data within the biomedical domain. For example, patient records contain large amounts of text which has been entered in a non-standardized format, consequently posing a lot of challenges to processing of such data. For the clinical doctor the written text in the medical findings is still the basis for decision making – neither images nor multimedia data. However, the steadily increasing volumes of unstructured information need machine learning approaches for data mining, i.e. text mining. This paper provides a short, concise overview of some selected text mining methods, focusing on statistical methods, i.e. Latent Semantic Analysis, Probabilistic Latent Semantic Analysis, Latent Dirichlet Allocation, Hierarchical Latent Dirichlet Allocation, Principal Component Analysis, and Support Vector Machines, along with some examples from the biomedical domain. Finally, we provide some open problems and future challenges, particularly from the clinical domain, that we expect to stimulate future research.

...read moreread less

Proceedings Article•DOI•

Topic Modeling of Multimodal Data: An Autoregressive Approach

[...]

Yin Zheng¹, Yu-Jin Zhang¹, Hugo Larochelle•Institutions (1)

Tsinghua University¹

23 Jun 2014

TL;DR: This work proposes SupDocNADE, a supervised extension of DocNADE that increases the discriminative power of the hidden topic features by incorporating label information into the training objective of the model and shows how to employ Sup docNADE to learn a joint representation from image visual words, annotation words and class label information.

...read moreread less

Abstract: Topic modeling based on latent Dirichlet allocation (LDA) has been a framework of choice to deal with multimodal data, such as in image annotation tasks. Recently, a new type of topic model called the Document Neural Autoregressive Distribution Estimator (DocNADE) was proposed and demonstrated state-of-the-art performance for text document modeling. In this work, we show how to successfully apply and extend this model to multimodal data, such as simultaneous image classification and annotation. Specifically, we propose SupDocNADE, a supervised extension of DocNADE, that increases the discriminative power of the hidden topic features by incorporating label information into the training objective of the model and show how to employ SupDocNADE to learn a joint representation from image visual words, annotation words and class label information. We also describe how to leverage information about the spatial position of the visual words for SupDocNADE to achieve better performance in a simple, yet effective manner. We test our model on the LabelMe and UIUC-Sports datasets and show that it compares favorably to other topic models such as the supervised variant of LDA and a Spatial Matching Pyramid (SPM) approach.

...read moreread less

Journal Article•DOI•

Redundancy-aware topic modeling for patient record notes.

[...]

Raphael Cohen¹, Iddo Aviram¹, Michael Elhadad¹, Noémie Elhadad²•Institutions (2)

Ben-Gurion University of the Negev¹, Columbia University²

13 Feb 2014-PLOS ONE

TL;DR: A novel variant of Latent Dirichlet Allocation topic modeling, Red-LDA, which takes into account the inherent redundancy of patient records when modeling content of clinical notes and produces superior models to all three baseline strategies.

...read moreread less

Abstract: The clinical notes in a given patient record contain much redundancy, in large part due to clinicians’ documentation habit of copying from previous notes in the record and pasting into a new note. Previous work has shown that this redundancy has a negative impact on the quality of text mining and topic modeling in particular. In this paper we describe a novel variant of Latent Dirichlet Allocation (LDA) topic modeling, Red-LDA, which takes into account the inherent redundancy of patient records when modeling content of clinical notes. To assess the value of Red-LDA, we experiment with three baselines and our novel redundancy-aware topic modeling method: given a large collection of patient records, (i) apply vanilla LDA to all documents in all input records; (ii) identify and remove all redundancy by chosing a single representative document for each record as input to LDA; (iii) identify and remove all redundant paragraphs in each record, leaving partial, non-redundant documents as input to LDA; and (iv) apply Red-LDA to all documents in all input records. Both quantitative evaluation carried out through log-likelihood on held-out data and topic coherence of produced topics and qualitative assessement of topics carried out by physicians show that Red-LDA produces superior models to all three baseline strategies. This research contributes to the emerging field of understanding the characteristics of the electronic health record and how to account for them in the framework of data mining. The code for the two redundancy-elimination baselines and Red-LDA is made publicly available to the community.

...read moreread less

Journal Article•DOI•

Identifying technological topics and institution-topic distribution probability for patent competitive intelligence analysis: a case study in LTE technology

[...]

Bo Wang¹, Shengbo Liu¹, Kun Ding¹, Zeyuan Liu¹, Jing Xu² - Show less +1 more•Institutions (2)

Dalian University of Technology¹, Sichuan University²

01 Oct 2014-Scientometrics

TL;DR: An extended latent Dirichlet allocation (LDA) model is presented in this paper for patent competitive intelligence analysis and reveals emerging hot spots of LTE technology, and finds that major companies in this field have been focused on different technological fields with different competitive positions.

...read moreread less

Abstract: An extended latent Dirichlet allocation (LDA) model is presented in this paper for patent competitive intelligence analysis. After part-of-speech tagging and defining the noun phrase extraction rules, technological words have been extracted from patent titles and abstracts. This allows us to go one step further and perform patent analysis at content level. Then LDA model is used for identifying underlying topic structures based on latent relationships of technological words extracted. This helped us to review research hot spots and directions in subclasses of patented technology in a certain field. For the extension of the traditional LDA model, another institution-topic probability level is added to the original LDA model. Direct competing enterprises' distribution probability and their technological positions are identified in each topic. Then a case study is carried on within one of the core patented technology in next generation telecommunication technology-LTE. This empirical study reveals emerging hot spots of LTE technology, and finds that major companies in this field have been focused on different technological fields with different competitive positions.

...read moreread less

Proceedings Article•DOI•

Twitter Opinion Topic Model: Extracting Product Opinions from Tweets by Leveraging Hashtags and Sentiment Lexicon

[...]

Kar Wai Lim¹, Wray Buntine²•Institutions (2)

Australian National University¹, Monash University²

03 Nov 2014

TL;DR: Wang et al. as mentioned in this paper proposed an LDA-based opinion model named Twitter Opinion Topic Model (TOTM) for opinion mining and sentiment analysis, which leverages hashtags, mentions, emoticons and strong sentiment words that are present in tweets in its discovery process.

...read moreread less

Abstract: Aspect-based opinion mining is widely applied to review data to aggregate or summarize opinions of a product, and the current state-of-the-art is achieved with Latent Dirichlet Allocation (LDA)-based model. Although social media data like tweets are laden with opinions, their "dirty" nature (as natural language) has discouraged researchers from applying LDA-based opinion model for product review mining. Tweets are often informal, unstructured and lacking labeled data such as categories and ratings, making it challenging for product opinion mining. In this paper, we propose an LDA-based opinion model named Twitter Opinion Topic Model (TOTM) for opinion mining and sentiment analysis. TOTM leverages hashtags, mentions, emoticons and strong sentiment words that are present in tweets in its discovery process. It improves opinion prediction by modeling the target-opinion interaction directly, thus discovering target specific opinion words, neglected in existing approaches. Moreover, we propose a new formulation of incorporating sentiment prior information into a topic model, by utilizing an existing public sentiment lexicon. This is novel in that it learns and updates with the data. We conduct experiments on 9 million tweets on electronic products, and demonstrate the improved performance of TOTM in both quantitative evaluations and qualitative analysis. We show that aspect-based opinion analysis on massive volume of tweets provides useful opinions on products.

...read moreread less

Journal Article•DOI•

Incorporating appraisal expression patterns into topic modeling for aspect and sentiment word identification

[...]

Xiaolin Zheng¹, Zhen Lin², Xiaowei Wang³, Kwei-Jay Lin⁴, Meina Song⁵ - Show less +1 more•Institutions (5)

Stanford University¹, University of Illinois at Urbana–Champaign², Zhejiang University³, University of California, Irvine⁴, Beijing University of Posts and Telecommunications⁵

01 May 2014-Knowledge Based Systems

TL;DR: An unsupervised dependency analysis-based approach is presented to extract Appraisal Expression Patterns (AEPs) from reviews, which represent the manner in which people express opinions regarding products or services and can be regarded as a condensed representation of the syntactic relationship between aspect and sentiment words.

...read moreread less

Abstract: With the considerable growth of user-generated content, online reviews are becoming extremely valuable sources for mining customers' opinions on products and services However, most of the traditional opinion mining methods are coarse-grained and cannot understand natural languages Thus, aspect-based opinion mining and summarization are of great interest in academic and industrial research In this paper, we study an approach to extract product and service aspect words, as well as sentiment words, automatically from reviews An unsupervised dependency analysis-based approach is presented to extract Appraisal Expression Patterns (AEPs) from reviews, which represent the manner in which people express opinions regarding products or services and can be regarded as a condensed representation of the syntactic relationship between aspect and sentiment words AEPs are high-level, domain-independent types of information, and have excellent domain adaptability An AEP-based Latent Dirichlet Allocation (AEP-LDA) model is also proposed This is a sentence-level, probabilistic generative model which assumes that all words in a sentence are drawn from one topic – a generally true assumption, based on our observation The model also assumes that every review corpus is composed of several mutually corresponding aspect and sentiment topics, as well as a background word topic The AEP information is incorporated into the AEP-LDA model for mining aspect and sentiment words simultaneously The experimental results on reviews of restaurants, hotels, MP3 players, and cameras show that the AEP-LDA model outperforms other approaches in identifying aspect and sentiment words

...read moreread less

Journal Article•DOI•

Cross domain recommendation based on multi-type media fusion

[...]

Shulong Tan¹, Jiajun Bu¹, Xuzhen Qin¹, Chun Chen¹, Deng Cai¹ - Show less +1 more•Institutions (1)

Zhejiang University¹

01 Mar 2014-Neurocomputing

TL;DR: This paper proposes a Bayesian hierarchical approach based on Latent Dirichlet Allocation (LDA) to transfer user interests cross domains or media, and combines multi-type media information: media descriptions, user-generated text data and ratings.

...read moreread less

Proceedings Article•

Bayesian optimization with unknown constraints

[...]

Michael A. Gelbart¹, Jasper Snoek¹, Ryan P. Adams¹•Institutions (1)

Harvard University¹

23 Jul 2014

TL;DR: In this article, Bayesian optimization for constrained problems is studied in the general case that noise may be present in the constraint functions, and the objective and constraints may be evaluated independently.

...read moreread less

Posted Content•

A high-reproducibility and high-accuracy method for automated topic classification

[...]

Andrea Lancichinetti, M. Irmak Sirer, Jane X. Wang, Daniel E. Acuna, Konrad P. Kording, Luís A. Nunes Amaral - Show less +2 more

03 Feb 2014-arXiv: Machine Learning

TL;DR: In this article, the authors perform a systematic theoretical and numerical analysis that demonstrates that current optimization techniques for LDA often yield results which are not accurate in inferring the most suitable model parameters.

...read moreread less

Abstract: Much of human knowledge sits in large databases of unstructured text. Leveraging this knowledge requires algorithms that extract and record metadata on unstructured text documents. Assigning topics to documents will enable intelligent search, statistical characterization, and meaningful classification. Latent Dirichlet allocation (LDA) is the state-of-the-art in topic classification. Here, we perform a systematic theoretical and numerical analysis that demonstrates that current optimization techniques for LDA often yield results which are not accurate in inferring the most suitable model parameters. Adapting approaches for community detection in networks, we propose a new algorithm which displays high-reproducibility and high-accuracy, and also has high computational efficiency. We apply it to a large set of documents in the English Wikipedia and reveal its hierarchical structure. Our algorithm promises to make "big data" text analysis systems more reliable.

...read moreread less

Journal Article•DOI•

Mixture of latent trait analyzers for model-based clustering of categorical data

[...]

Isabella Gollini¹, Thomas Brendan Murphy²•Institutions (2)

Maynooth University¹, University College Dublin²

01 Jul 2014-Statistics and Computing

TL;DR: A variational approach for fitting the mixture of latent trait models is developed and it is shown to yield intuitive clustering results and it gives a much better fit than either latent class analysis or latent trait analysis alone.

...read moreread less

Abstract: Model-based clustering methods for continuous data are well established and commonly used in a wide range of applications. However, model-based clustering methods for categorical data are less standard. Latent class analysis is a commonly used method for model-based clustering of binary data and/or categorical data, but due to an assumed local independence structure there may not be a correspondence between the estimated latent classes and groups in the population of interest. The mixture of latent trait analyzers model extends latent class analysis by assuming a model for the categorical response variables that depends on both a categorical latent class and a continuous latent trait variable; the discrete latent class accommodates group structure and the continuous latent trait accommodates dependence within these groups. Fitting the mixture of latent trait analyzers model is potentially difficult because the likelihood function involves an integral that cannot be evaluated analytically. We develop a variational approach for fitting the mixture of latent trait models and this provides an efficient model fitting strategy. The mixture of latent trait analyzers model is demonstrated on the analysis of data from the National Long Term Care Survey (NLTCS) and voting in the U.S. Congress. The model is shown to yield intuitive clustering results and it gives a much better fit than either latent class analysis or latent trait analysis alone.

...read moreread less

Collapse