Showing papers on "Latent Dirichlet allocation published in 2015"

PDF

Open Access

Book Chapter•DOI•

A Comprehensive Survey of Neighborhood-based Recommendation Methods

[...]

Christian Desrosiers¹, George Karypis²•Institutions (2)

École de technologie supérieure¹, University of Minnesota²

01 Jan 2015

TL;DR: This chapter presents a comprehensive survey of neighborhood-based methods for the item recommendation problem, and the main benefits of such methods, as well as their principal characteristics, are described.

...read moreread less

Abstract: Among collaborative recommendation approaches, methods based on nearest-neighbors still enjoy a huge amount of popularity, due to their simplicity, their efficiency, and their ability to produce accurate and personalized recommendations. This chapter presents a comprehensive survey of neighborhood-based methods for the item recommendation problem. In particular, the main benefits of such methods, as well as their principal characteristics, are described. Furthermore, this document addresses the essential decisions that are required while implementing a neighborhood-based recommender system, and gives practical information on how to make such decisions. Finally, the problems of sparsity and limited coverage, often observed in large commercial recommender systems, are discussed, and a few solutions to overcome these problems are presented.

...read moreread less

701 citations

Posted Content•

Document embedding with paragraph vectors

[...]

Andrew M. Dai, Chris Olah, Quoc V. Le

29 Jul 2015-arXiv: Computation and Language

TL;DR: This work observes that the Paragraph Vector method performs significantly better than other methods, and proposes a simple improvement to enhance embedding quality, and shows that much like word embeddings, vector operations on Paragraph Vectors can perform useful semantic results.

...read moreread less

Abstract: Paragraph Vectors has been recently proposed as an unsupervised method for learning distributed representations for pieces of texts. In their work, the authors showed that the method can learn an embedding of movie review texts which can be leveraged for sentiment analysis. That proof of concept, while encouraging, was rather narrow. Here we consider tasks other than sentiment analysis, provide a more thorough comparison of Paragraph Vectors to other document modelling algorithms such as Latent Dirichlet Allocation, and evaluate performance of the method as we vary the dimensionality of the learned representation. We benchmarked the models on two document similarity data sets, one from Wikipedia, one from arXiv. We observe that the Paragraph Vector method performs significantly better than other methods, and propose a simple improvement to enhance embedding quality. Somewhat surprisingly, we also show that much like word embeddings, vector operations on Paragraph Vectors can perform useful semantic results.

...read moreread less

325 citations

Proceedings Article•DOI•

Monolingual and Cross-Lingual Information Retrieval Models Based on (Bilingual) Word Embeddings

[...]

Ivan Vulić¹, Marie-Francine Moens¹•Institutions (1)

Katholieke Universiteit Leuven¹

09 Aug 2015

TL;DR: A novel word representation learning model called Bilingual Word Embeddings Skip-Gram (BWESG) is presented which is the first model able to learn bilingual word embeddings solely on the basis of document-aligned comparable data.

...read moreread less

Abstract: We propose a new unified framework for monolingual (MoIR) and cross-lingual information retrieval (CLIR) which relies on the induction of dense real-valued word vectors known as word embeddings (WE) from comparable data. To this end, we make several important contributions: (1) We present a novel word representation learning model called Bilingual Word Embeddings Skip-Gram (BWESG) which is the first model able to learn bilingual word embeddings solely on the basis of document-aligned comparable data; (2) We demonstrate a simple yet effective approach to building document embeddings from single word embeddings by utilizing models from compositional distributional semantics. BWESG induces a shared cross-lingual embedding vector space in which both words, queries, and documents may be presented as dense real-valued vectors; (3) We build novel ad-hoc MoIR and CLIR models which rely on the induced word and document embeddings and the shared cross-lingual embedding space; (4) Experiments for English and Dutch MoIR, as well as for English-to-Dutch and Dutch-to-English CLIR using benchmarking CLEF 2001-2003 collections and queries demonstrate the utility of our WE-based MoIR and CLIR models. The best results on the CLEF collections are obtained by the combination of the WE-based approach and a unigram language model. We also report on significant improvements in ad-hoc IR tasks of our WE-based framework over the state-of-the-art framework for learning text representations from comparable data based on latent Dirichlet allocation (LDA).

...read moreread less

303 citations

Journal Article•DOI•

Improving Topic Models with Latent Feature Word Representations

[...]

Dat Quoc Nguyen¹, Richard Billingsley¹, Lan Du¹, Mark Johnson¹•Institutions (1)

Macquarie University¹

02 Jun 2015-Transactions of the Association for Computational Linguistics

TL;DR: This article extended two Dirichlet multinomial topic models by incorporating latent feature vector representations of words trained on very large corpora to improve the word-topic mapping learnt on a smaller corpus.

...read moreread less

Abstract: Probabilistic topic models are widely used to discover latent topics in document collections, while latent feature vector representations of words have been used to obtain high performance in many NLP tasks. In this paper, we extend two different Dirichlet multinomial topic models by incorporating latent feature vector representations of words trained on very large corpora to improve the word-topic mapping learnt on a smaller corpus. Experimental results show that by using information from the external corpora, our new models produce significant improvements on topic coherence, document clustering and document classification tasks, especially on datasets with few or short documents.

...read moreread less

276 citations

Journal Article•DOI•

An analysis of the coherence of descriptors in topic modeling

[...]

Derek O'Callaghan¹, Derek Greene¹, Joe Carthy¹, Pádraig Cunningham¹•Institutions (1)

University College Dublin¹

01 Aug 2015-Expert Systems With Applications

TL;DR: Two out of three coherence measures find NMF to regularly produce more coherent topics, with higher levels of generality and redundancy observed with the LDA topic descriptors, suggesting that this may be a more suitable topic modeling method when analyzing certain corpora, such as those associated with niche or non-mainstream domains.

...read moreread less

Abstract: We evaluate the coherence and generality of topic descriptors found by LDA and NMF.Six new and existing corpora were specifically compiled for this evaluation.A new coherence measure using word2vec-modeled term vector similarity is proposed.NMF regularly produces more coherent topics, where term weighting is influential.NMF may be more suitable for topic modeling of niche or non-mainstream corpora. In recent years, topic modeling has become an established method in the analysis of text corpora, with probabilistic techniques such as latent Dirichlet allocation (LDA) commonly employed for this purpose. However, it might be argued that adequate attention is often not paid to the issue of topic coherence, the semantic interpretability of the top terms usually used to describe discovered topics. Nevertheless, a number of studies have proposed measures for analyzing such coherence, where these have been largely focused on topics found by LDA, with matrix decomposition techniques such as Non-negative Matrix Factorization (NMF) being somewhat overlooked in comparison. This motivates the current work, where we compare and analyze topics found by popular variants of both NMF and LDA in multiple corpora in terms of both their coherence and associated generality, using a combination of existing and new measures, including one based on distributional semantics. Two out of three coherence measures find NMF to regularly produce more coherent topics, with higher levels of generality and redundancy observed with the LDA topic descriptors. In all cases, we observe that the associated term weighting strategy plays a major role. The results observed with NMF suggest that this may be a more suitable topic modeling method when analyzing certain corpora, such as those associated with niche or non-mainstream domains.

...read moreread less

247 citations

Journal Article•DOI•

Business intelligence in banking

[...]

Sérgio Moro¹, Paulo Cortez², Paulo Rita¹•Institutions (2)

ISCTE – University Institute of Lisbon¹, University of Minho²

15 Feb 2015-Expert Systems With Applications

TL;DR: Recent literature in the search for trends in business intelligence applications for the banking industry is analyzed, showing that credit in banking is clearly the main application trend, particularly predicting risk and thus supporting credit approval or denial.

...read moreread less

Abstract: A recent review on the application of business intelligence to the banking domain.Coverage of the last twelve years of scientific literature on those subjects.Usage of text mining and the latent Dirichlet allocation to analyze articles.Provide new insights and future research trends which may benefit banking business. This paper analyzes recent literature in the search for trends in business intelligence applications for the banking industry. Searches were performed in relevant journals resulting in 219 articles published between 2002 and 2013. To analyze such a large number of manuscripts, text mining techniques were used in pursuit for relevant terms on both business intelligence and banking domains. Moreover, the latent Dirichlet allocation modeling was used in order to group articles in several relevant topics. The analysis was conducted using a dictionary of terms belonging to both banking and business intelligence domains. Such procedure allowed for the identification of relationships between terms and topics grouping articles, enabling to emerge hypotheses regarding research directions. To confirm such hypotheses, relevant articles were collected and scrutinized, allowing to validate the text mining procedure. The results show that credit in banking is clearly the main application trend, particularly predicting risk and thus supporting credit approval or denial. There is also a relevant interest in bankruptcy and fraud prediction. Customer retention seems to be associated, although weakly, with targeting, justifying bank offers to reduce churn. In addition, a large number of articles focused more on business intelligence techniques and its applications, using the banking industry just for evaluation, thus, not clearly acclaiming for benefits in the banking business. By identifying these current research topics, this study also highlights opportunities for future research.

...read moreread less

244 citations

Journal Article•DOI•

A Survey of Topic Modeling in Text Mining

[...]

Rubayyi Alghamdi, Khalid Alfalqi

01 Jan 2015-International Journal of Advanced Computer Science and Applications

TL;DR: Different models, such as topic over time (TOT), dynamic topic models (DTM), multiscale topic tomography, dynamic topic correlation detection, detecting topic evolution in scientific literature, etc. are discussed.

...read moreread less

Abstract: Topic models provide a convenient way to analyze large of unclassified text. A topic contains a cluster of words that frequently occur together. A topic modeling can connect words with similar meanings and distinguish between uses of words with multiple meanings. This paper provides two categories that can be under the field of topic modeling. First one discusses the area of methods of topic modeling, which has four methods that can be considerable under this category. These methods are Latent semantic analysis (LSA), Probabilistic latent semantic analysis (PLSA), Latent Dirichlet allocation (LDA), and Correlated topic model (CTM). The second category is called topic evolution models, which model topics by considering an important factor time. In the second category, different models are discussed, such as topic over time (TOT), dynamic topic models (DTM), multiscale topic tomography, dynamic topic correlation detection, detecting topic evolution in scientific literature, etc.

...read moreread less

243 citations

Journal Article•DOI•

Friendbook: A Semantic-Based Friend Recommendation System for Social Networks

[...]

Zhibo Wang¹, Jilong Liao², Qing Cao², Hairong Qi², Zhi Wang³ - Show less +1 more•Institutions (3)

Zhejiang University¹, University of Tennessee², Wuhan University³

01 Mar 2015-IEEE Transactions on Mobile Computing

TL;DR: This paper presents Friendbook, a novel semantic-based friend recommendation system for social networks, which recommends friends to users based on their life styles instead of social graphs, and proposes a similarity metric to measure the similarity of life styles between users, and calculates users' impact in terms oflife styles with a friend-matching graph.

...read moreread less

Abstract: Existing social networking services recommend friends to users based on their social graphs, which may not be the most appropriate to reflect a user’s preferences on friend selection in real life. In this paper, we present Friendbook, a novel semantic-based friend recommendation system for social networks, which recommends friends to users based on their life styles instead of social graphs. By taking advantage of sensor-rich smartphones, Friendbook discovers life styles of users from user-centric sensor data, measures the similarity of life styles between users, and recommends friends to users if their life styles have high similarity. Inspired by text mining, we model a user’s daily life as life documents , from which his/her life styles are extracted by using the Latent Dirichlet Allocation algorithm. We further propose a similarity metric to measure the similarity of life styles between users, and calculate users’ impact in terms of life styles with a friend-matching graph . Upon receiving a request, Friendbook returns a list of people with highest recommendation scores to the query user. Finally, Friendbook integrates a feedback mechanism to further improve the recommendation accuracy. We have implemented Friendbook on the Android-based smartphones, and evaluated its performance on both small-scale experiments and large-scale simulations. The results show that the recommendations accurately reflect the preferences of users in choosing friends.

...read moreread less

241 citations

Journal Article•DOI•

A heuristic approach to determine an appropriate number of topics in topic modeling

[...]

Weizhong Zhao¹, Weizhong Zhao², James J. Chen¹, Roger Perkins¹, Zhichao Liu¹, Weigong Ge¹, Yijun Ding¹, Wen Zou¹ - Show less +4 more•Institutions (2)

Food and Drug Administration¹, Xiangtan University²

25 Sep 2015-BMC Bioinformatics

TL;DR: A heuristic approach based on analysis of variation of statistical perplexity during topic modelling is proposed to estimate the most appropriate number of topics, and the rate of perplexity change (RPC) as a function of numbers of topics is proposed as a suitable selector.

...read moreread less

Abstract: Topic modelling is an active research field in machine learning. While mainly used to build models from unstructured textual data, it offers an effective means of data mining where samples represent documents, and different biological endpoints or omics data represent words. Latent Dirichlet Allocation (LDA) is the most commonly used topic modelling method across a wide number of technical fields. However, model development can be arduous and tedious, and requires burdensome and systematic sensitivity studies in order to find the best set of model parameters. Often, time-consuming subjective evaluations are needed to compare models. Currently, research has yielded no easy way to choose the proper number of topics in a model beyond a major iterative approach. Based on analysis of variation of statistical perplexity during topic modelling, a heuristic approach is proposed in this study to estimate the most appropriate number of topics. Specifically, the rate of perplexity change (RPC) as a function of numbers of topics is proposed as a suitable selector. We test the stability and effectiveness of the proposed method for three markedly different types of grounded-truth datasets: Salmonella next generation sequencing, pharmacological side effects, and textual abstracts on computational biology and bioinformatics (TCBB) from PubMed. The proposed RPC-based method is demonstrated to choose the best number of topics in three numerical experiments of widely different data types, and for databases of very different sizes. The work required was markedly less arduous than if full systematic sensitivity studies had been carried out with number of topics as a parameter. We understand that additional investigation is needed to substantiate the method's theoretical basis, and to establish its generalizability in terms of dataset characteristics.

...read moreread less

230 citations

Journal Article•DOI•

Scene Classification Based on the Multifeature Fusion Probabilistic Topic Model for High Spatial Resolution Remote Sensing Imagery

[...]

Yanfei Zhong¹, Qiqi Zhu¹, Liangpei Zhang¹•Institutions (1)

Wuhan University¹

08 Jun 2015-IEEE Transactions on Geoscience and Remote Sensing

TL;DR: A semantic allocation level (SAL) multifeature fusion strategy based on PTM, namely, SAL-PTM (S AL-pLSA and SAL-LDA) for HSR imagery is proposed, and the experimental results confirmed that SAL- PTM is superior to the single-feature methods and CAT-PTm in the scene classification of H SR imagery.

...read moreread less

Abstract: Scene classification has been proved to be an effective method for high spatial resolution (HSR) remote sensing image semantic interpretation. The probabilistic topic model (PTM) has been successfully applied to natural scenes by utilizing a single feature (e.g., the spectral feature); however, it is inadequate for HSR images due to the complex structure of the land-cover classes. Although several studies have investigated techniques that combine multiple features, the different features are usually quantized after simple concatenation (CAT-PTM). Unfortunately, due to the inadequate fusion capacity of $\boldsymbol{k}$ -means clustering, the words of the visual dictionary obtained by CAT-PTM are highly correlated. In this paper, a semantic allocation level (SAL) multifeature fusion strategy based on PTM, namely, SAL-PTM (SAL-pLSA and SAL-LDA) for HSR imagery is proposed. In SAL-PTM: 1) the complementary spectral, texture, and scale-invariant-feature-transform features are effectively combined; 2) the three features are extracted and quantized separately by $\boldsymbol{k}$ -means clustering, which can provide appropriate low-level feature descriptions for the semantic representations; and 3)the latent semantic allocations of the three features are captured separately by PTM, which follows the core idea of PTM-based scene classification. The probabilistic latent semantic analysis (pLSA) and latent Dirichlet allocation (LDA) models were compared to test the effect of different PTMs for HSR imagery. A U.S. Geological Survey data set and the UC Merced data set were utilized to evaluate SAL-PTM in comparison with the conventional methods. The experimental results confirmed that SAL-PTM is superior to the single-feature methods and CAT-PTM in the scene classification of HSR imagery.

...read moreread less

217 citations

Journal Article•DOI•

A Multiscale and Hierarchical Feature Extraction Method for Terrestrial Laser Scanning Point Cloud Classification

[...]

Zhen Wang¹, Liqiang Zhang¹, Tian Fang², P. Takis Mathiopoulos³, Xiaohua Tong⁴, Huamin Qu², Zhiqiang Xiao¹, Fang Li¹, Dong Chen⁵ - Show less +5 more•Institutions (5)

Beijing Normal University¹, Hong Kong University of Science and Technology², ASTRON³, Tongji University⁴, Nanjing Forestry University⁵

01 May 2015-IEEE Transactions on Geoscience and Remote Sensing

TL;DR: A novel multiscale and hierarchical framework is introduced, which describes the classification of TLS point clouds of cluttered urban scenes, and novel features of point clusters are constructed by employing the latent Dirichlet allocation (LDA).

...read moreread less

Abstract: The effective extraction of shape features is an important requirement for the accurate and efficient classification of terrestrial laser scanning (TLS) point clouds. However, the challenge of how to obtain robust and discriminative features from noisy and varying density TLS point clouds remains. This paper introduces a novel multiscale and hierarchical framework, which describes the classification of TLS point clouds of cluttered urban scenes. In this framework, we propose multiscale and hierarchical point clusters (MHPCs). In MHPCs, point clouds are first resampled into different scales. Then, the resampled data set of each scale is aggregated into several hierarchical point clusters, where the point cloud of all scales in each level is termed a point-cluster set. This representation not only accounts for the multiscale properties of point clouds but also well captures their hierarchical structures. Based on the MHPCs, novel features of point clusters are constructed by employing the latent Dirichlet allocation (LDA). An LDA model is trained according to a training set. The LDA model then extracts a set of latent topics, i.e., a feature of topics, for a point cluster. Finally, to apply the introduced features for point-cluster classification, we train an AdaBoost classifier in each point-cluster set and obtain the corresponding classifiers to separate the TLS point clouds with varying point density and data missing into semantic regions. Compared with other methods, our features achieve the best classification results for buildings, trees, people, and cars from TLS point clouds, particularly for small and moving objects, such as people and cars.

...read moreread less

Book Chapter•DOI•

Latent Dirichlet Allocation: Extracting Topics from Software Engineering Data

[...]

Joshua Charles Campbell¹, Abram Hindle¹, Eleni Stroulia¹•Institutions (1)

University of Alberta¹

01 Jan 2015

TL;DR: In this article, the authors provide an overview of the theory underlying latent Dirichlet allocation (LDA), the most popular topic analysis method today, and illustrate how to employ LDA on a textual data set.

...read moreread less

Abstract: Topic analysis is a powerful tool that extracts “topics” from document collections. Unlike manual tagging, which is effort intensive and requires expertise in the documents’ subject matter, topic analysis (in its simplest form) is an automated process. Relying on the assumption that each document in a collection refers to a small number of topics, it extracts bags of words attributable to these topics. These topics can be used to support document retrieval or to relate documents to each other through their associated topics. Given the variety and amount of textual information included in software repositories, in issue reports, in commit and source-code comments, and in other forms of documentation, this method has found many applications in the software-engineering field of mining software repositories. This chapter provides an overview of the theory underlying latent Dirichlet allocation (LDA), the most popular topic-analysis method today. Next, it illustrates, with a brief tutorial introduction, how to employ LDA on a textual data set. Third, it reviews the software-engineering literature for uses of LDA for analyzing textual software-development assets, in order to support developers’ activities. Finally, we discuss the interpretability of the automatically extracted topics, and their correlation with tags provided by subject-matter experts.

...read moreread less

Proceedings Article•DOI•

Interleaved text/image Deep Mining on a large-scale radiology database

[...]

Hoo-Chang Shin¹, Le Lu¹, Lauren Kim¹, Ari Seff¹, Jianhua Yao¹, Ronald M. Summers¹ - Show less +2 more•Institutions (1)

National Institutes of Health¹

07 Jun 2015

TL;DR: In this article, an interleaved text/image deep learning system is presented to extract and mine the semantic interactions of radiology images and reports from a national research hospital's picture archiving and communication system.

...read moreread less

Abstract: Despite tremendous progress in computer vision, effective learning on very large-scale (> 100K patients) medical image databases has been vastly hindered. We present an interleaved text/image deep learning system to extract and mine the semantic interactions of radiology images and reports from a national research hospital's picture archiving and communication system. Instead of using full 3D medical volumes, we focus on a collection of representative ∼216K 2D key images/slices (selected by clinicians for diagnostic reference) with text-driven scalar and vector labels. Our system interleaves between unsupervised learning (e.g., latent Dirichlet allocation, recurrent neural net language models) on document- and sentence-level texts to generate semantic labels and supervised learning via deep convolutional neural networks (CNNs) to map from images to label spaces. Disease-related key words can be predicted for radiology images in a retrieval manner. We have demonstrated promising quantitative and qualitative results. The large-scale datasets of extracted key images and their categorization, embedded vector labels and sentence descriptions can be harnessed to alleviate the deep learning “data-hungry” obstacle in the medical domain.

...read moreread less

Journal Article•DOI•

Filtering big data from social media - Building an early warning system for adverse drug reactions

[...]

Ming Yang¹, Melody Y. Kiang², Wei Shang•Institutions (2)

Central University of Finance and Economics¹, California State University, Long Beach²

01 Apr 2015-Journal of Biomedical Informatics

TL;DR: The overall design of the system provides satisfactory performance in identifying ADR related posts for post-marketing drug surveillance and points out a potentially fruitful direction for building other early warning systems that need to filter big data from social media networks.

...read moreread less

Proceedings Article•DOI•

Analyzing Tracklets for the Detection of Abnormal Crowd Behavior

[...]

Hossein Mousavi, Sadegh Mohammadi, Alessandro Perina, Ryad Chellali, Vittorio Mur - Show less +1 more

05 Jan 2015

TL;DR: A novel video descriptor, referred to as Histogram of Oriented Tracklets, for recognizing abnormal situation in crowded scenes is presented, which quantized orientation and magnitude in a 2-dimensional histogram which encodes the motion patterns expected in each cuboid.

...read moreread less

Abstract: This paper presents a novel video descriptor, referred to as Histogram of Oriented Track lets, for recognizing abnormal situation in crowded scenes. Unlike standard approaches that use optical flow, which estimates motion vectors only from two successive frames, we built our descriptor over long-range motion trajectories which is called track lets in the literature. Following the standard procedure, we divided video sequences in spatio-temporal cuboids within which we collected statistics on the track lets passing through them. In particular, we quantized orientation and magnitude in a 2-dimensional histogram which encodes the motion patterns expected in each cuboid. We classify frames as normal and abnormal by using Latent Dirichlet Allocation and Support Vector Machines. We evaluated the effectiveness of the proposed descriptors on three datasets: UCSD, Violence in Crowds and UMN. The experiments demonstrated (i) very promising results in abnormality detection, (ii) setting new state-of-the-art on two of them, and (iii) outperforming former descriptors based on the optical flow, dense trajectories and the social force model.

...read moreread less

Proceedings Article•

Relational stacked denoising autoencoder for tag recommendation

[...]

Hao Wang¹, Xingjian Shi¹, Dit-Yan Yeung¹•Institutions (1)

Hong Kong University of Science and Technology¹

25 Jan 2015

TL;DR: Experiments conducted on three real datasets show that both learning more effective representation and learning from relational data are beneficial steps to take to advance the state of the art.

...read moreread less

Abstract: Tag recommendation has become one of the most important ways of organizing and indexing online resources like articles, movies, and music. Since tagging information is usually very sparse, effective learning of the content representation for these resources is crucial to accurate tag recommendation. Recently, models proposed for tag recommendation, such as collaborative topic regression and its variants, have demonstrated promising accuracy. However, a limitation of these models is that, by using topic models like latent Dirichlet allocation as the key component, the learned representation may not be compact and effective enough. Moreover, since relational data exist as an auxiliary data source in many applications, it is desirable to incorporate such data into tag recommendation models. In this paper, we start with a deep learning model called stacked denoising autoencoder (SDAE) in an attempt to learn more effective content representation. We propose a probabilistic formulation for SDAE and then extend it to a relational SDAE (RSDAE) model. RSDAE jointly performs deep representation learning and relational learning in a principled way under a probabilistic framework. Experiments conducted on three real datasets show that both learning more effective representation and learning from relational data are beneficial steps to take to advance the state of the art.

...read moreread less

in the Smart City

[...]

Gerry Derksen, Piotr Michura, Stan Ruecker

01 Jan 2015

TL;DR: This chapter proposes that conversational modeling has the potential to radically alter the understanding and practice of citizenship and discusses two experimental platforms that take different approaches to this problem.

...read moreread less

Abstract: If we think of the smart city as a reading environment, we can use it to change what it means to be a citizen, to improve how public topics are addressed, and to democratize how decisions are made. The starting point is text, supplemented with the various other kinds of data that can be gathered through digital means. In this chapter, we discuss two experimental platforms that take different approaches. First is the Data Stories project, where we have been sequencing text from various dynamic sources through a thematic clustering algorithm (Latent Dirichlet Allocation), feeding those thematic clusters into a narrative generator, then putting those results into a storyboarding system. Using the output, we can examine patterns emerging from a variety of text streams, such as Twitter, Facebook, news feeds, and so on. More importantly, however, we can allow people to manipulate the parameters, so that using the same text stream can produce multiple simultaneous valid outputs, depending on the perspective that the reader wishes to take on the feed. Providing a method for encouraging this kind of interpretive or hermeneutic inquiry is a promising strategy for supporting civil discourse. Our second project, Conversational Modeling, is building on previous research to investigate the various ways in which discussions, which occur sequentially through time, can be profitably modeled as 3-D objects of various kinds. These models can subsequently be used for recollection, communication, and analysis, but they may also have a generative potential. As a means of dealing with the structure and substance of discussions in civil society, we propose that conversational modeling has the potential to radically alter our understanding and practice of citizenship.

...read moreread less

Book Chapter•DOI•

Music Recommender Systems

[...]

Markus Schedl¹, Peter Knees¹, Brian McFee², Dmitry Bogdanov³, Marius Kaminskas⁴ - Show less +1 more•Institutions (4)

Johannes Kepler University of Linz¹, New York University², Pompeu Fabra University³, University College Cork⁴

01 Jan 2015

TL;DR: This chapter gives an introduction to music recommender systems research, highlighting the distinctive characteristics of music, as compared to other kinds of media, and pointing to the most important challenges faced by music recommendation research.

...read moreread less

Abstract: This chapter gives an introduction to music recommender systems research. We highlight the distinctive characteristics of music, as compared to other kinds of media. We then provide a literature survey of content-based music recommendation, contextual music recommendation, hybrid methods, and sequential music recommendation, followed by overview of evaluation strategies and commonly used data sets. We conclude by pointing to the most important challenges faced by music recommendation research.

...read moreread less

Proceedings Article•DOI•

Recurrent neural network language model adaptation for multi-genre broadcast speech recognition

[...]

Xie Chen¹, Tian Tan², Xunying Liu³, Pierre Lanchantin³, Moquan Wan³, Mark J. F. Gales³, Philip C. Woodland³ - Show less +3 more•Institutions (3)

California Institute of Technology¹, Shanghai Jiao Tong University², University of Cambridge³

01 Jan 2015

TL;DR: Experiments using a state-of-theart LVCSR system showed adaptation could yield perplexity reductions of 8% relatively over the baseline RNNLM and small but consistent word error rate reductions.

...read moreread less

Abstract: Copyright © 2015 ISCA. Recurrent neural network language models (RNNLMs) have recently become increasingly popular for many applications including speech recognition. In previous research RNNLMs have normally been trained on well-matched in-domain data. The adaptation of RNNLMs remains an open research area to be explored. In this paper, genre and topic based RNNLMadaptation techniques are investigated for a multi-genre broadcast transcription task. A number of techniques including Probabilistic Latent Semantic Analysis, Latent Dirichlet Allocation and Hierarchical Dirichlet Processes are used to extract show level topic information. These were then used as additional input to the RNNLM during training, which can facilitate unsupervised test time adaptation. Experiments using a state-of-theart LVCSR system trained on 1000 hours of speech and more than 1 billion words of text showed adaptation could yield perplexity reductions of 8% relatively over the baseline RNNLM and small but consistent word error rate reductions.

...read moreread less

Proceedings Article•DOI•

Incorporating Word Correlation Knowledge into Topic Modeling

[...]

Pengtao Xie¹, Diyi Yang¹, Eric P. Xing¹•Institutions (1)

Carnegie Mellon University¹

01 Jan 2015

TL;DR: A Markov Random Field regularized Latent Dirichlet Allocation model, which defines a MRF on the latent topic layer of LDA to encourage words labeled as similar to share the same topic label, and can accommodate the subtlety that whether two words are similar depends on which topic they appear in.

...read moreread less

Abstract: This paper studies how to incorporate the external word correlation knowledge to improve the coherence of topic modeling. Existing topic models assume words are generated independently and lack the mechanism to utilize the rich similarity relationships among words to learn coherent topics. To solve this problem, we build a Markov Random Field (MRF) regularized Latent Dirichlet Allocation (LDA) model, which defines a MRF on the latent topic layer of LDA to encourage words labeled as similar to share the same topic label. Under our model, the topic assignment of each word is not independent, but rather affected by the topic labels of its correlated words. Similar words have better chance to be put into the same topic due to the regularization of MRF, hence the coherence of topics can be boosted. In addition, our model can accommodate the subtlety that whether two words are similar depends on which topic they appear in, which allows word with multiple senses to be put into different topics properly. We derive a variational inference method to infer the posterior probabilities and learn model parameters and present techniques to deal with the hardto-compute partition function in MRF. Experiments on two datasets demonstrate the effectiveness of our model.

...read moreread less

Journal Article•DOI•

Additive regularization of topic models

[...]

Konstantin Vorontsov, Anna Potapenko¹•Institutions (1)

National Research University – Higher School of Economics¹

01 Oct 2015-Machine Learning

TL;DR: This paper introduces an alternative semi-probabilistic approach, which it is called additive regularization of topic models (ARTM), which regularizes an ill-posed problem of stochastic matrix factorization by maximizing a weighted sum of the log-likelihood and additional criteria.

...read moreread less

Abstract: Probabilistic topic modeling of text collections has been recently developed mainly within the framework of graphical models and Bayesian inference. In this paper we introduce an alternative semi-probabilistic approach, which we call additive regularization of topic models (ARTM). Instead of building a purely probabilistic generative model of text we regularize an ill-posed problem of stochastic matrix factorization by maximizing a weighted sum of the log-likelihood and additional criteria. This approach enables us to combine probabilistic assumptions with linguistic and problem-specific requirements in a single multi-objective topic model. In the theoretical part of the work we derive the regularized EM-algorithm and provide a pool of regularizers, which can be applied together in any combination. We show that many models previously developed within Bayesian framework can be inferred easier within ARTM and in some cases generalized. In the experimental part we show that a combination of sparsing, smoothing, and decorrelation improves several quality measures at once with almost no loss of the likelihood.

...read moreread less

Proceedings Article•DOI•

Learning Distributed Representations from Reviews for Collaborative Filtering

[...]

Amjad Almahairi¹, Kyle Kastner¹, Kyunghyun Cho¹, Aaron Courville¹•Institutions (1)

Université de Montréal¹

16 Sep 2015

TL;DR: In this article, a bag-of-words product-ofexperts model and a recurrent neural network were used to improve the performance of collaborative filtering. But the model's ability to act as a regularizer of the product representations was questioned.

...read moreread less

Abstract: Recent work has shown that collaborative filter-based recommender systems can be improved by incorporating side information, such as natural language reviews, as a way of regularizing the derived product representations. Motivated by the success of this approach, we introduce two different models of reviews and study their effect on collaborative filtering performance. While the previous state-of-the-art approach is based on a latent Dirichlet allocation (LDA) model of reviews, the models we explore are neural network based: a bag-of-words product-of-experts model and a recurrent neural network. We demonstrate that the increased flexibility offered by the product-of-experts model allowed it to achieve state-of-the-art performance on the Amazon review dataset, outperforming the LDA-based approach. However, interestingly, the greater modeling power offered by the recurrent neural network appears to undermine the model's ability to act as a regularizer of the product representations.

...read moreread less

Proceedings Article•DOI•

Unsupervised Topic Modeling for Short Texts Using Distributed Representations of Words

[...]

Rangarajan Sridhar¹, Vivek Kumar•Institutions (1)

Stanley Medical College¹

01 Jun 2015

TL;DR: An unsupervised topic model for short texts that performs soft clustering over distributed representations of words using Gaussian mixture models whose components capture the notion of latent topics and which outperforms LDA on short texts through both subjective and objective evaluation.

...read moreread less

Abstract: We present an unsupervised topic model for short texts that performs soft clustering over distributed representations of words. We model the low-dimensional semantic vector space represented by the dense distributed representations of words using Gaussian mixture models (GMMs) whose components capture the notion of latent topics. While conventional topic modeling schemes such as probabilistic latent semantic analysis (pLSA) and latent Dirichlet allocation (LDA) need aggregation of short messages to avoid data sparsity in short documents, our framework works on large amounts of raw short texts (billions of words). In contrast with other topic modeling frameworks that use word cooccurrence statistics, our framework uses a vector space model that overcomes the issue of sparse word co-occurrence patterns. We demonstrate that our framework outperforms LDA on short texts through both subjective and objective evaluation. We also show the utility of our framework in learning topics and classifying short texts on Twitter data for English, Spanish, French, Portuguese and Russian.

...read moreread less

Proceedings Article•DOI•

From Argumentation Mining to Stance Classification

[...]

Parinaz Sobhani¹, Diana Inkpen¹, Stan Matwin²•Institutions (2)

University of Ottawa¹, Polish Academy of Sciences²

01 Jun 2015

TL;DR: A novel framework for argument tagging based on topic modeling is proposed and it is shown that using Non-Negative Matrix Factorization instead of Latent Dirichlet Allocation achieves better results for argument classification, close to the results of a supervised classifier.

...read moreread less

Abstract: Argumentation mining and stance classification were recently introduced as interesting tasks in text mining. In this paper, a novel framework for argument tagging based on topic modeling is proposed. Unlike other machine learning approaches for argument tagging which often require large set of labeled data, the proposed model is minimally supervised and merely a one-to-one mapping between the pre-defined argument set and the extracted topics is required. These extracted arguments are subsequently exploited for stance classification. Additionally, a manuallyannotated corpus for stance classification and argument tagging of online news comments is introduced and made available. Experiments on our collected corpus demonstrate the benefits of using topic-modeling for argument tagging. We show that using Non-Negative Matrix Factorization instead of Latent Dirichlet Allocation achieves better results for argument classification, close to the results of a supervised classifier. Furthermore, the statistical model that leverages automatically-extracted arguments as features for stance classification shows promising results.

...read moreread less

Proceedings Article•DOI•

Effective Multi-Query Expansions: Robust Landmark Retrieval

[...]

Yang Wang¹, Xuemin Lin¹, Lin Wu², Wenjie Zhang¹•Institutions (2)

University of New South Wales¹, University of Adelaide²

13 Oct 2015

TL;DR: A novel framework, namely multi- query expansions, to retrieve semantically robust landmarks by two steps is proposed, and a novel technique to generate the robust yet compact pattern set from the multi-query photos is proposed.

...read moreread less

Abstract: Given a query photo issued by a user (q-user), the landmark retrieval is to return a set of photos with their landmarks similar to those of the query, while the existing studies on the landmark retrieval focus on exploiting geometries of landmarks for similarity matches between candidate photos and a query photo. We observe that the same landmarks provided by different users may convey different geometry information depending on the viewpoints and/or angles, and may subsequently yield very different results. In fact, dealing with the landmarks with shapes caused by the photography of q-users is often nontrivial and has never been studied. Motivated by this, in this paper we propose a novel framework, namely multi-query expansions, to retrieve semantically robust landmarks by two steps. Firstly, we identify the top-k photos regarding the latent topics of a query landmark to construct multi-query set so as to remedy its possible shape. For this purpose, we significantly extend the techniques of Latent Dirichlet Allocation. Secondly, we propose a novel technique to generate the robust yet compact pattern set from the multi-query photos. To ensure redundancy-free and enhance the efficiency, we adopt the existing minimum-description-length-principle based pattern mining techniques to remove similar query photos from the (k+1) selected query photos. Then, a landmark retrieval rule is developed to calculate the ranking scores between mined pattern set and each photo in the database, which are ranked to serve as the final ranking list of landmark retrieval. Extensive experiments are conducted on real-world landmark datasets, validating the significantly higher accuracy of our approach.

...read moreread less

Journal Article•DOI•

Implicit feature identification in Chinese reviews using explicit topic mining model

[...]

Hua Xu¹, Zhang Fan¹, Wei Wang¹•Institutions (1)

Tsinghua University¹

01 Mar 2015-Knowledge Based Systems

TL;DR: Experiments show that the explicit topic model, which incorporates pre-existing knowledge, outperforms traditional feature selection methods and other existing methods by a large margin and the identification task can be completed better.

...read moreread less

Abstract: The essential work of feature-specific opinion mining is centered on the product features. Previous related research work has often taken into account explicit features but ignored implicit features, However, implicit feature identification, which can help us better understand the reviews, is an essential aspect of feature-specific opinion mining. This paper is mainly centered on implicit feature identification in Chinese product reviews. We think that based on the explicit synonymous feature group and the sentences which contain explicit features, several Support Vector Machine (SVM) classifiers can be established to classify the non-explicit sentences. Nevertheless, instead of simply using traditional feature selection methods, we believe an explicit topic model in which each topic is pre-defined could perform better. In this paper, we first extend a popular topic modeling method, called Latent Dirichlet Allocation (LDA), to construct an explicit topic model. Then some types of prior knowledge, such as: must-links, cannot-links and relevance-based prior knowledge, are extracted and incorporated into the explicit topic model automatically. Experiments show that the explicit topic model, which incorporates pre-existing knowledge, outperforms traditional feature selection methods and other existing methods by a large margin and the identification task can be completed better.

...read moreread less

Journal Article•DOI•

An Empirical Comparison of Four Text Mining Methods

[...]

Sangno Lee¹, Jaeki Song¹, Yong Jin Kim²•Institutions (2)

Texas Tech University¹, Sogang University²

11 Dec 2015-Journal of Computer Information Systems

TL;DR: Light is shed on the theory that underlies text mining methods and guidance is provided for researchers who seek to apply these methods.

...read moreread less

Abstract: The amount of textual data that is available for researchers and businesses to analyze is increasing at a dramatic rate. This reality has led IS researchers to investigate various text mining techniques. This essay examines four text mining methods that are frequently used in order to identify their characteristics and limitations. The four methods that we examine are (1) latent semantic analysis, (2) probabilistic latent semantic analysis, (3) latent Dirichlet allocation, and (4) correlated topic model. We review these four methods and compare them with topic detection and spam filtering to reveal their peculiarity. Our paper sheds light on the theory that underlies text mining methods and provides guidance for researchers who seek to apply these methods.

...read moreread less

Proceedings Article•

The Security of Latent Dirichlet Allocation

[...]

Shike Mei¹, Xiaojin Zhu¹•Institutions (1)

University of Wisconsin-Madison¹

21 Feb 2015

TL;DR: A novel bilevel optimization formulation is given to identify the optimal poisoning attack on Latent Dirichlet allocation (LDA), and an ecient solution (up to local optima) is presented using descent method and implicit functions.

...read moreread less

Abstract: Latent Dirichlet allocation (LDA) is an increasingly popular tool for data analysis in many domains If LDA output aects decision making (especially when money is involved), there is an incentive for attackers to compromise it We ask the question: how can an attacker minimally poison the corpus so that LDA produces topics that the attacker wants the LDA user to see? Answering this question is important to characterize such attacks, and to develop defenses in the future We give a novel bilevel optimization formulation to identify the optimal poisoning attack We present an ecient solution (up to local optima) using descent method and implicit functions We demonstrate poisoning attacks on LDA with extensive experiments, and discuss possible defenses

...read moreread less

Journal Article•DOI•

Text analysis tools for identification of emerging topics and research gaps in conservation science.

[...]

Martin J. Westgate¹, Philip S. Barton¹, Jennifer C. Pierson¹, David B. Lindenmayer¹•Institutions (1)

Australian National University¹

01 Dec 2015-Conservation Biology

TL;DR: This work shows how a common text-mining method (latent Dirichlet allocation, or topic modeling) and statistical tests familiar to ecologists can be used to investigate trends and identify potential research gaps in the scientific literature, increasing scientists' capacity for research synthesis.

...read moreread less

Abstract: Keeping track of conceptual and methodological developments is a critical skill for research scientists, but this task is increasingly difficult due to the high rate of academic publication. As a crisis discipline, conservation science is particularly in need of tools that facilitate rapid yet insightful synthesis. We show how a common text-mining method (latent Dirichlet allocation, or topic modeling) and statistical tests familiar to ecologists (cluster analysis, regression, and network analysis) can be used to investigate trends and identify potential research gaps in the scientific literature. We tested these methods on the literature on ecological surrogates and indicators. Analysis of topic popularity within this corpus showed a strong emphasis on monitoring and management of fragmented ecosystems, while analysis of research gaps suggested a greater role for genetic surrogates and indicators. Our results show that automated text analysis methods need to be used with care, but can provide information that is complementary to that given by systematic reviews and meta-analyses, increasing scientists' capacity for research synthesis.

...read moreread less

Journal Article•DOI•

Towards topic-to-question generation

[...]

Yllias Chali¹, Sadid A. Hasan²•Institutions (2)

University of Lethbridge¹, Philips²

01 Mar 2015-Computational Linguistics

TL;DR: The proposed topic-to-question generation approach can significantly outperform the state-of-the-art results and use of syntactic tree kernels for the automatic judgment of the syntactic correctness of the questions is proposed.

...read moreread less

Abstract: This paper is concerned with automatic generation of all possible questions from a topic of interest. Specifically, we consider that each topic is associated with a body of texts containing useful information about the topic. Then, questions are generated by exploiting the named entity information and the predicate argument structures of the sentences present in the body of texts. The importance of the generated questions is measured using Latent Dirichlet Allocation by identifying the subtopics which are closely related to the original topic in the given body of texts and applying the Extended String Subsequence Kernel to calculate their similarity with the questions. We also propose the use of syntactic tree kernels for the automatic judgment of the syntactic correctness of the questions. The questions are ranked by considering both their importance in the context of the given body of texts and syntactic correctness. To the best of our knowledge, no previous study has accomplished this task in our setting. A series of experiments demonstrate that the proposed topic-to-question generation approach can significantly outperform the state-of-the-art results.

...read moreread less

Collapse