scispace - formally typeset
Search or ask a question

Showing papers on "Latent Dirichlet allocation published in 2015"


Book ChapterDOI
01 Jan 2015
TL;DR: This chapter presents a comprehensive survey of neighborhood-based methods for the item recommendation problem, and the main benefits of such methods, as well as their principal characteristics, are described.
Abstract: Among collaborative recommendation approaches, methods based on nearest-neighbors still enjoy a huge amount of popularity, due to their simplicity, their efficiency, and their ability to produce accurate and personalized recommendations. This chapter presents a comprehensive survey of neighborhood-based methods for the item recommendation problem. In particular, the main benefits of such methods, as well as their principal characteristics, are described. Furthermore, this document addresses the essential decisions that are required while implementing a neighborhood-based recommender system, and gives practical information on how to make such decisions. Finally, the problems of sparsity and limited coverage, often observed in large commercial recommender systems, are discussed, and a few solutions to overcome these problems are presented.

701 citations


Posted Content
TL;DR: This work observes that the Paragraph Vector method performs significantly better than other methods, and proposes a simple improvement to enhance embedding quality, and shows that much like word embeddings, vector operations on Paragraph Vectors can perform useful semantic results.
Abstract: Paragraph Vectors has been recently proposed as an unsupervised method for learning distributed representations for pieces of texts. In their work, the authors showed that the method can learn an embedding of movie review texts which can be leveraged for sentiment analysis. That proof of concept, while encouraging, was rather narrow. Here we consider tasks other than sentiment analysis, provide a more thorough comparison of Paragraph Vectors to other document modelling algorithms such as Latent Dirichlet Allocation, and evaluate performance of the method as we vary the dimensionality of the learned representation. We benchmarked the models on two document similarity data sets, one from Wikipedia, one from arXiv. We observe that the Paragraph Vector method performs significantly better than other methods, and propose a simple improvement to enhance embedding quality. Somewhat surprisingly, we also show that much like word embeddings, vector operations on Paragraph Vectors can perform useful semantic results.

325 citations


Proceedings ArticleDOI
09 Aug 2015
TL;DR: A novel word representation learning model called Bilingual Word Embeddings Skip-Gram (BWESG) is presented which is the first model able to learn bilingual word embeddings solely on the basis of document-aligned comparable data.
Abstract: We propose a new unified framework for monolingual (MoIR) and cross-lingual information retrieval (CLIR) which relies on the induction of dense real-valued word vectors known as word embeddings (WE) from comparable data. To this end, we make several important contributions: (1) We present a novel word representation learning model called Bilingual Word Embeddings Skip-Gram (BWESG) which is the first model able to learn bilingual word embeddings solely on the basis of document-aligned comparable data; (2) We demonstrate a simple yet effective approach to building document embeddings from single word embeddings by utilizing models from compositional distributional semantics. BWESG induces a shared cross-lingual embedding vector space in which both words, queries, and documents may be presented as dense real-valued vectors; (3) We build novel ad-hoc MoIR and CLIR models which rely on the induced word and document embeddings and the shared cross-lingual embedding space; (4) Experiments for English and Dutch MoIR, as well as for English-to-Dutch and Dutch-to-English CLIR using benchmarking CLEF 2001-2003 collections and queries demonstrate the utility of our WE-based MoIR and CLIR models. The best results on the CLEF collections are obtained by the combination of the WE-based approach and a unigram language model. We also report on significant improvements in ad-hoc IR tasks of our WE-based framework over the state-of-the-art framework for learning text representations from comparable data based on latent Dirichlet allocation (LDA).

303 citations


Journal ArticleDOI
TL;DR: This article extended two Dirichlet multinomial topic models by incorporating latent feature vector representations of words trained on very large corpora to improve the word-topic mapping learnt on a smaller corpus.
Abstract: Probabilistic topic models are widely used to discover latent topics in document collections, while latent feature vector representations of words have been used to obtain high performance in many NLP tasks. In this paper, we extend two different Dirichlet multinomial topic models by incorporating latent feature vector representations of words trained on very large corpora to improve the word-topic mapping learnt on a smaller corpus. Experimental results show that by using information from the external corpora, our new models produce significant improvements on topic coherence, document clustering and document classification tasks, especially on datasets with few or short documents.

276 citations


Journal ArticleDOI
TL;DR: Two out of three coherence measures find NMF to regularly produce more coherent topics, with higher levels of generality and redundancy observed with the LDA topic descriptors, suggesting that this may be a more suitable topic modeling method when analyzing certain corpora, such as those associated with niche or non-mainstream domains.
Abstract: We evaluate the coherence and generality of topic descriptors found by LDA and NMF.Six new and existing corpora were specifically compiled for this evaluation.A new coherence measure using word2vec-modeled term vector similarity is proposed.NMF regularly produces more coherent topics, where term weighting is influential.NMF may be more suitable for topic modeling of niche or non-mainstream corpora. In recent years, topic modeling has become an established method in the analysis of text corpora, with probabilistic techniques such as latent Dirichlet allocation (LDA) commonly employed for this purpose. However, it might be argued that adequate attention is often not paid to the issue of topic coherence, the semantic interpretability of the top terms usually used to describe discovered topics. Nevertheless, a number of studies have proposed measures for analyzing such coherence, where these have been largely focused on topics found by LDA, with matrix decomposition techniques such as Non-negative Matrix Factorization (NMF) being somewhat overlooked in comparison. This motivates the current work, where we compare and analyze topics found by popular variants of both NMF and LDA in multiple corpora in terms of both their coherence and associated generality, using a combination of existing and new measures, including one based on distributional semantics. Two out of three coherence measures find NMF to regularly produce more coherent topics, with higher levels of generality and redundancy observed with the LDA topic descriptors. In all cases, we observe that the associated term weighting strategy plays a major role. The results observed with NMF suggest that this may be a more suitable topic modeling method when analyzing certain corpora, such as those associated with niche or non-mainstream domains.

247 citations


Journal ArticleDOI
TL;DR: Recent literature in the search for trends in business intelligence applications for the banking industry is analyzed, showing that credit in banking is clearly the main application trend, particularly predicting risk and thus supporting credit approval or denial.
Abstract: A recent review on the application of business intelligence to the banking domain.Coverage of the last twelve years of scientific literature on those subjects.Usage of text mining and the latent Dirichlet allocation to analyze articles.Provide new insights and future research trends which may benefit banking business. This paper analyzes recent literature in the search for trends in business intelligence applications for the banking industry. Searches were performed in relevant journals resulting in 219 articles published between 2002 and 2013. To analyze such a large number of manuscripts, text mining techniques were used in pursuit for relevant terms on both business intelligence and banking domains. Moreover, the latent Dirichlet allocation modeling was used in order to group articles in several relevant topics. The analysis was conducted using a dictionary of terms belonging to both banking and business intelligence domains. Such procedure allowed for the identification of relationships between terms and topics grouping articles, enabling to emerge hypotheses regarding research directions. To confirm such hypotheses, relevant articles were collected and scrutinized, allowing to validate the text mining procedure. The results show that credit in banking is clearly the main application trend, particularly predicting risk and thus supporting credit approval or denial. There is also a relevant interest in bankruptcy and fraud prediction. Customer retention seems to be associated, although weakly, with targeting, justifying bank offers to reduce churn. In addition, a large number of articles focused more on business intelligence techniques and its applications, using the banking industry just for evaluation, thus, not clearly acclaiming for benefits in the banking business. By identifying these current research topics, this study also highlights opportunities for future research.

244 citations


Journal ArticleDOI
TL;DR: Different models, such as topic over time (TOT), dynamic topic models (DTM), multiscale topic tomography, dynamic topic correlation detection, detecting topic evolution in scientific literature, etc. are discussed.
Abstract: Topic models provide a convenient way to analyze large of unclassified text. A topic contains a cluster of words that frequently occur together. A topic modeling can connect words with similar meanings and distinguish between uses of words with multiple meanings. This paper provides two categories that can be under the field of topic modeling. First one discusses the area of methods of topic modeling, which has four methods that can be considerable under this category. These methods are Latent semantic analysis (LSA), Probabilistic latent semantic analysis (PLSA), Latent Dirichlet allocation (LDA), and Correlated topic model (CTM). The second category is called topic evolution models, which model topics by considering an important factor time. In the second category, different models are discussed, such as topic over time (TOT), dynamic topic models (DTM), multiscale topic tomography, dynamic topic correlation detection, detecting topic evolution in scientific literature, etc.

243 citations


Journal ArticleDOI
TL;DR: This paper presents Friendbook, a novel semantic-based friend recommendation system for social networks, which recommends friends to users based on their life styles instead of social graphs, and proposes a similarity metric to measure the similarity of life styles between users, and calculates users' impact in terms oflife styles with a friend-matching graph.
Abstract: Existing social networking services recommend friends to users based on their social graphs, which may not be the most appropriate to reflect a user’s preferences on friend selection in real life. In this paper, we present Friendbook, a novel semantic-based friend recommendation system for social networks, which recommends friends to users based on their life styles instead of social graphs. By taking advantage of sensor-rich smartphones, Friendbook discovers life styles of users from user-centric sensor data, measures the similarity of life styles between users, and recommends friends to users if their life styles have high similarity. Inspired by text mining, we model a user’s daily life as life documents , from which his/her life styles are extracted by using the Latent Dirichlet Allocation algorithm. We further propose a similarity metric to measure the similarity of life styles between users, and calculate users’ impact in terms of life styles with a friend-matching graph . Upon receiving a request, Friendbook returns a list of people with highest recommendation scores to the query user. Finally, Friendbook integrates a feedback mechanism to further improve the recommendation accuracy. We have implemented Friendbook on the Android-based smartphones, and evaluated its performance on both small-scale experiments and large-scale simulations. The results show that the recommendations accurately reflect the preferences of users in choosing friends.

241 citations


Journal ArticleDOI
TL;DR: A heuristic approach based on analysis of variation of statistical perplexity during topic modelling is proposed to estimate the most appropriate number of topics, and the rate of perplexity change (RPC) as a function of numbers of topics is proposed as a suitable selector.
Abstract: Topic modelling is an active research field in machine learning. While mainly used to build models from unstructured textual data, it offers an effective means of data mining where samples represent documents, and different biological endpoints or omics data represent words. Latent Dirichlet Allocation (LDA) is the most commonly used topic modelling method across a wide number of technical fields. However, model development can be arduous and tedious, and requires burdensome and systematic sensitivity studies in order to find the best set of model parameters. Often, time-consuming subjective evaluations are needed to compare models. Currently, research has yielded no easy way to choose the proper number of topics in a model beyond a major iterative approach. Based on analysis of variation of statistical perplexity during topic modelling, a heuristic approach is proposed in this study to estimate the most appropriate number of topics. Specifically, the rate of perplexity change (RPC) as a function of numbers of topics is proposed as a suitable selector. We test the stability and effectiveness of the proposed method for three markedly different types of grounded-truth datasets: Salmonella next generation sequencing, pharmacological side effects, and textual abstracts on computational biology and bioinformatics (TCBB) from PubMed. The proposed RPC-based method is demonstrated to choose the best number of topics in three numerical experiments of widely different data types, and for databases of very different sizes. The work required was markedly less arduous than if full systematic sensitivity studies had been carried out with number of topics as a parameter. We understand that additional investigation is needed to substantiate the method's theoretical basis, and to establish its generalizability in terms of dataset characteristics.

230 citations


Journal ArticleDOI
TL;DR: A semantic allocation level (SAL) multifeature fusion strategy based on PTM, namely, SAL-PTM (S AL-pLSA and SAL-LDA) for HSR imagery is proposed, and the experimental results confirmed that SAL- PTM is superior to the single-feature methods and CAT-PTm in the scene classification of H SR imagery.
Abstract: Scene classification has been proved to be an effective method for high spatial resolution (HSR) remote sensing image semantic interpretation. The probabilistic topic model (PTM) has been successfully applied to natural scenes by utilizing a single feature (e.g., the spectral feature); however, it is inadequate for HSR images due to the complex structure of the land-cover classes. Although several studies have investigated techniques that combine multiple features, the different features are usually quantized after simple concatenation (CAT-PTM). Unfortunately, due to the inadequate fusion capacity of $\boldsymbol{k}$ -means clustering, the words of the visual dictionary obtained by CAT-PTM are highly correlated. In this paper, a semantic allocation level (SAL) multifeature fusion strategy based on PTM, namely, SAL-PTM (SAL-pLSA and SAL-LDA) for HSR imagery is proposed. In SAL-PTM: 1) the complementary spectral, texture, and scale-invariant-feature-transform features are effectively combined; 2) the three features are extracted and quantized separately by $\boldsymbol{k}$ -means clustering, which can provide appropriate low-level feature descriptions for the semantic representations; and 3)the latent semantic allocations of the three features are captured separately by PTM, which follows the core idea of PTM-based scene classification. The probabilistic latent semantic analysis (pLSA) and latent Dirichlet allocation (LDA) models were compared to test the effect of different PTMs for HSR imagery. A U.S. Geological Survey data set and the UC Merced data set were utilized to evaluate SAL-PTM in comparison with the conventional methods. The experimental results confirmed that SAL-PTM is superior to the single-feature methods and CAT-PTM in the scene classification of HSR imagery.

217 citations


Journal ArticleDOI
TL;DR: A novel multiscale and hierarchical framework is introduced, which describes the classification of TLS point clouds of cluttered urban scenes, and novel features of point clusters are constructed by employing the latent Dirichlet allocation (LDA).
Abstract: The effective extraction of shape features is an important requirement for the accurate and efficient classification of terrestrial laser scanning (TLS) point clouds. However, the challenge of how to obtain robust and discriminative features from noisy and varying density TLS point clouds remains. This paper introduces a novel multiscale and hierarchical framework, which describes the classification of TLS point clouds of cluttered urban scenes. In this framework, we propose multiscale and hierarchical point clusters (MHPCs). In MHPCs, point clouds are first resampled into different scales. Then, the resampled data set of each scale is aggregated into several hierarchical point clusters, where the point cloud of all scales in each level is termed a point-cluster set. This representation not only accounts for the multiscale properties of point clouds but also well captures their hierarchical structures. Based on the MHPCs, novel features of point clusters are constructed by employing the latent Dirichlet allocation (LDA). An LDA model is trained according to a training set. The LDA model then extracts a set of latent topics, i.e., a feature of topics, for a point cluster. Finally, to apply the introduced features for point-cluster classification, we train an AdaBoost classifier in each point-cluster set and obtain the corresponding classifiers to separate the TLS point clouds with varying point density and data missing into semantic regions. Compared with other methods, our features achieve the best classification results for buildings, trees, people, and cars from TLS point clouds, particularly for small and moving objects, such as people and cars.

Book ChapterDOI
01 Jan 2015
TL;DR: In this article, the authors provide an overview of the theory underlying latent Dirichlet allocation (LDA), the most popular topic analysis method today, and illustrate how to employ LDA on a textual data set.
Abstract: Topic analysis is a powerful tool that extracts “topics” from document collections. Unlike manual tagging, which is effort intensive and requires expertise in the documents’ subject matter, topic analysis (in its simplest form) is an automated process. Relying on the assumption that each document in a collection refers to a small number of topics, it extracts bags of words attributable to these topics. These topics can be used to support document retrieval or to relate documents to each other through their associated topics. Given the variety and amount of textual information included in software repositories, in issue reports, in commit and source-code comments, and in other forms of documentation, this method has found many applications in the software-engineering field of mining software repositories. This chapter provides an overview of the theory underlying latent Dirichlet allocation (LDA), the most popular topic-analysis method today. Next, it illustrates, with a brief tutorial introduction, how to employ LDA on a textual data set. Third, it reviews the software-engineering literature for uses of LDA for analyzing textual software-development assets, in order to support developers’ activities. Finally, we discuss the interpretability of the automatically extracted topics, and their correlation with tags provided by subject-matter experts.

Proceedings ArticleDOI
07 Jun 2015
TL;DR: In this article, an interleaved text/image deep learning system is presented to extract and mine the semantic interactions of radiology images and reports from a national research hospital's picture archiving and communication system.
Abstract: Despite tremendous progress in computer vision, effective learning on very large-scale (> 100K patients) medical image databases has been vastly hindered. We present an interleaved text/image deep learning system to extract and mine the semantic interactions of radiology images and reports from a national research hospital's picture archiving and communication system. Instead of using full 3D medical volumes, we focus on a collection of representative ∼216K 2D key images/slices (selected by clinicians for diagnostic reference) with text-driven scalar and vector labels. Our system interleaves between unsupervised learning (e.g., latent Dirichlet allocation, recurrent neural net language models) on document- and sentence-level texts to generate semantic labels and supervised learning via deep convolutional neural networks (CNNs) to map from images to label spaces. Disease-related key words can be predicted for radiology images in a retrieval manner. We have demonstrated promising quantitative and qualitative results. The large-scale datasets of extracted key images and their categorization, embedded vector labels and sentence descriptions can be harnessed to alleviate the deep learning “data-hungry” obstacle in the medical domain.

Journal ArticleDOI
TL;DR: The overall design of the system provides satisfactory performance in identifying ADR related posts for post-marketing drug surveillance and points out a potentially fruitful direction for building other early warning systems that need to filter big data from social media networks.

Proceedings ArticleDOI
05 Jan 2015
TL;DR: A novel video descriptor, referred to as Histogram of Oriented Tracklets, for recognizing abnormal situation in crowded scenes is presented, which quantized orientation and magnitude in a 2-dimensional histogram which encodes the motion patterns expected in each cuboid.
Abstract: This paper presents a novel video descriptor, referred to as Histogram of Oriented Track lets, for recognizing abnormal situation in crowded scenes. Unlike standard approaches that use optical flow, which estimates motion vectors only from two successive frames, we built our descriptor over long-range motion trajectories which is called track lets in the literature. Following the standard procedure, we divided video sequences in spatio-temporal cuboids within which we collected statistics on the track lets passing through them. In particular, we quantized orientation and magnitude in a 2-dimensional histogram which encodes the motion patterns expected in each cuboid. We classify frames as normal and abnormal by using Latent Dirichlet Allocation and Support Vector Machines. We evaluated the effectiveness of the proposed descriptors on three datasets: UCSD, Violence in Crowds and UMN. The experiments demonstrated (i) very promising results in abnormality detection, (ii) setting new state-of-the-art on two of them, and (iii) outperforming former descriptors based on the optical flow, dense trajectories and the social force model.

Proceedings Article
25 Jan 2015
TL;DR: Experiments conducted on three real datasets show that both learning more effective representation and learning from relational data are beneficial steps to take to advance the state of the art.
Abstract: Tag recommendation has become one of the most important ways of organizing and indexing online resources like articles, movies, and music. Since tagging information is usually very sparse, effective learning of the content representation for these resources is crucial to accurate tag recommendation. Recently, models proposed for tag recommendation, such as collaborative topic regression and its variants, have demonstrated promising accuracy. However, a limitation of these models is that, by using topic models like latent Dirichlet allocation as the key component, the learned representation may not be compact and effective enough. Moreover, since relational data exist as an auxiliary data source in many applications, it is desirable to incorporate such data into tag recommendation models. In this paper, we start with a deep learning model called stacked denoising autoencoder (SDAE) in an attempt to learn more effective content representation. We propose a probabilistic formulation for SDAE and then extend it to a relational SDAE (RSDAE) model. RSDAE jointly performs deep representation learning and relational learning in a principled way under a probabilistic framework. Experiments conducted on three real datasets show that both learning more effective representation and learning from relational data are beneficial steps to take to advance the state of the art.

01 Jan 2015
TL;DR: This chapter proposes that conversational modeling has the potential to radically alter the understanding and practice of citizenship and discusses two experimental platforms that take different approaches to this problem.
Abstract: If we think of the smart city as a reading environment, we can use it to change what it means to be a citizen, to improve how public topics are addressed, and to democratize how decisions are made. The starting point is text, supplemented with the various other kinds of data that can be gathered through digital means. In this chapter, we discuss two experimental platforms that take different approaches. First is the Data Stories project, where we have been sequencing text from various dynamic sources through a thematic clustering algorithm (Latent Dirichlet Allocation), feeding those thematic clusters into a narrative generator, then putting those results into a storyboarding system. Using the output, we can examine patterns emerging from a variety of text streams, such as Twitter, Facebook, news feeds, and so on. More importantly, however, we can allow people to manipulate the parameters, so that using the same text stream can produce multiple simultaneous valid outputs, depending on the perspective that the reader wishes to take on the feed. Providing a method for encouraging this kind of interpretive or hermeneutic inquiry is a promising strategy for supporting civil discourse. Our second project, Conversational Modeling, is building on previous research to investigate the various ways in which discussions, which occur sequentially through time, can be profitably modeled as 3-D objects of various kinds. These models can subsequently be used for recollection, communication, and analysis, but they may also have a generative potential. As a means of dealing with the structure and substance of discussions in civil society, we propose that conversational modeling has the potential to radically alter our understanding and practice of citizenship.

Book ChapterDOI
01 Jan 2015
TL;DR: This chapter gives an introduction to music recommender systems research, highlighting the distinctive characteristics of music, as compared to other kinds of media, and pointing to the most important challenges faced by music recommendation research.
Abstract: This chapter gives an introduction to music recommender systems research. We highlight the distinctive characteristics of music, as compared to other kinds of media. We then provide a literature survey of content-based music recommendation, contextual music recommendation, hybrid methods, and sequential music recommendation, followed by overview of evaluation strategies and commonly used data sets. We conclude by pointing to the most important challenges faced by music recommendation research.

Proceedings ArticleDOI
01 Jan 2015
TL;DR: Experiments using a state-of-theart LVCSR system showed adaptation could yield perplexity reductions of 8% relatively over the baseline RNNLM and small but consistent word error rate reductions.
Abstract: Copyright © 2015 ISCA. Recurrent neural network language models (RNNLMs) have recently become increasingly popular for many applications including speech recognition. In previous research RNNLMs have normally been trained on well-matched in-domain data. The adaptation of RNNLMs remains an open research area to be explored. In this paper, genre and topic based RNNLMadaptation techniques are investigated for a multi-genre broadcast transcription task. A number of techniques including Probabilistic Latent Semantic Analysis, Latent Dirichlet Allocation and Hierarchical Dirichlet Processes are used to extract show level topic information. These were then used as additional input to the RNNLM during training, which can facilitate unsupervised test time adaptation. Experiments using a state-of-theart LVCSR system trained on 1000 hours of speech and more than 1 billion words of text showed adaptation could yield perplexity reductions of 8% relatively over the baseline RNNLM and small but consistent word error rate reductions.

Proceedings ArticleDOI
01 Jan 2015
TL;DR: A Markov Random Field regularized Latent Dirichlet Allocation model, which defines a MRF on the latent topic layer of LDA to encourage words labeled as similar to share the same topic label, and can accommodate the subtlety that whether two words are similar depends on which topic they appear in.
Abstract: This paper studies how to incorporate the external word correlation knowledge to improve the coherence of topic modeling. Existing topic models assume words are generated independently and lack the mechanism to utilize the rich similarity relationships among words to learn coherent topics. To solve this problem, we build a Markov Random Field (MRF) regularized Latent Dirichlet Allocation (LDA) model, which defines a MRF on the latent topic layer of LDA to encourage words labeled as similar to share the same topic label. Under our model, the topic assignment of each word is not independent, but rather affected by the topic labels of its correlated words. Similar words have better chance to be put into the same topic due to the regularization of MRF, hence the coherence of topics can be boosted. In addition, our model can accommodate the subtlety that whether two words are similar depends on which topic they appear in, which allows word with multiple senses to be put into different topics properly. We derive a variational inference method to infer the posterior probabilities and learn model parameters and present techniques to deal with the hardto-compute partition function in MRF. Experiments on two datasets demonstrate the effectiveness of our model.

Journal ArticleDOI
TL;DR: This paper introduces an alternative semi-probabilistic approach, which it is called additive regularization of topic models (ARTM), which regularizes an ill-posed problem of stochastic matrix factorization by maximizing a weighted sum of the log-likelihood and additional criteria.
Abstract: Probabilistic topic modeling of text collections has been recently developed mainly within the framework of graphical models and Bayesian inference. In this paper we introduce an alternative semi-probabilistic approach, which we call additive regularization of topic models (ARTM). Instead of building a purely probabilistic generative model of text we regularize an ill-posed problem of stochastic matrix factorization by maximizing a weighted sum of the log-likelihood and additional criteria. This approach enables us to combine probabilistic assumptions with linguistic and problem-specific requirements in a single multi-objective topic model. In the theoretical part of the work we derive the regularized EM-algorithm and provide a pool of regularizers, which can be applied together in any combination. We show that many models previously developed within Bayesian framework can be inferred easier within ARTM and in some cases generalized. In the experimental part we show that a combination of sparsing, smoothing, and decorrelation improves several quality measures at once with almost no loss of the likelihood.

Proceedings ArticleDOI
16 Sep 2015
TL;DR: In this article, a bag-of-words product-ofexperts model and a recurrent neural network were used to improve the performance of collaborative filtering. But the model's ability to act as a regularizer of the product representations was questioned.
Abstract: Recent work has shown that collaborative filter-based recommender systems can be improved by incorporating side information, such as natural language reviews, as a way of regularizing the derived product representations. Motivated by the success of this approach, we introduce two different models of reviews and study their effect on collaborative filtering performance. While the previous state-of-the-art approach is based on a latent Dirichlet allocation (LDA) model of reviews, the models we explore are neural network based: a bag-of-words product-of-experts model and a recurrent neural network. We demonstrate that the increased flexibility offered by the product-of-experts model allowed it to achieve state-of-the-art performance on the Amazon review dataset, outperforming the LDA-based approach. However, interestingly, the greater modeling power offered by the recurrent neural network appears to undermine the model's ability to act as a regularizer of the product representations.

Proceedings ArticleDOI
01 Jun 2015
TL;DR: An unsupervised topic model for short texts that performs soft clustering over distributed representations of words using Gaussian mixture models whose components capture the notion of latent topics and which outperforms LDA on short texts through both subjective and objective evaluation.
Abstract: We present an unsupervised topic model for short texts that performs soft clustering over distributed representations of words. We model the low-dimensional semantic vector space represented by the dense distributed representations of words using Gaussian mixture models (GMMs) whose components capture the notion of latent topics. While conventional topic modeling schemes such as probabilistic latent semantic analysis (pLSA) and latent Dirichlet allocation (LDA) need aggregation of short messages to avoid data sparsity in short documents, our framework works on large amounts of raw short texts (billions of words). In contrast with other topic modeling frameworks that use word cooccurrence statistics, our framework uses a vector space model that overcomes the issue of sparse word co-occurrence patterns. We demonstrate that our framework outperforms LDA on short texts through both subjective and objective evaluation. We also show the utility of our framework in learning topics and classifying short texts on Twitter data for English, Spanish, French, Portuguese and Russian.

Proceedings ArticleDOI
01 Jun 2015
TL;DR: A novel framework for argument tagging based on topic modeling is proposed and it is shown that using Non-Negative Matrix Factorization instead of Latent Dirichlet Allocation achieves better results for argument classification, close to the results of a supervised classifier.
Abstract: Argumentation mining and stance classification were recently introduced as interesting tasks in text mining. In this paper, a novel framework for argument tagging based on topic modeling is proposed. Unlike other machine learning approaches for argument tagging which often require large set of labeled data, the proposed model is minimally supervised and merely a one-to-one mapping between the pre-defined argument set and the extracted topics is required. These extracted arguments are subsequently exploited for stance classification. Additionally, a manuallyannotated corpus for stance classification and argument tagging of online news comments is introduced and made available. Experiments on our collected corpus demonstrate the benefits of using topic-modeling for argument tagging. We show that using Non-Negative Matrix Factorization instead of Latent Dirichlet Allocation achieves better results for argument classification, close to the results of a supervised classifier. Furthermore, the statistical model that leverages automatically-extracted arguments as features for stance classification shows promising results.

Proceedings ArticleDOI
13 Oct 2015
TL;DR: A novel framework, namely multi- query expansions, to retrieve semantically robust landmarks by two steps is proposed, and a novel technique to generate the robust yet compact pattern set from the multi-query photos is proposed.
Abstract: Given a query photo issued by a user (q-user), the landmark retrieval is to return a set of photos with their landmarks similar to those of the query, while the existing studies on the landmark retrieval focus on exploiting geometries of landmarks for similarity matches between candidate photos and a query photo. We observe that the same landmarks provided by different users may convey different geometry information depending on the viewpoints and/or angles, and may subsequently yield very different results. In fact, dealing with the landmarks with shapes caused by the photography of q-users is often nontrivial and has never been studied. Motivated by this, in this paper we propose a novel framework, namely multi-query expansions, to retrieve semantically robust landmarks by two steps. Firstly, we identify the top-k photos regarding the latent topics of a query landmark to construct multi-query set so as to remedy its possible shape. For this purpose, we significantly extend the techniques of Latent Dirichlet Allocation. Secondly, we propose a novel technique to generate the robust yet compact pattern set from the multi-query photos. To ensure redundancy-free and enhance the efficiency, we adopt the existing minimum-description-length-principle based pattern mining techniques to remove similar query photos from the (k+1) selected query photos. Then, a landmark retrieval rule is developed to calculate the ranking scores between mined pattern set and each photo in the database, which are ranked to serve as the final ranking list of landmark retrieval. Extensive experiments are conducted on real-world landmark datasets, validating the significantly higher accuracy of our approach.

Journal ArticleDOI
TL;DR: Experiments show that the explicit topic model, which incorporates pre-existing knowledge, outperforms traditional feature selection methods and other existing methods by a large margin and the identification task can be completed better.
Abstract: The essential work of feature-specific opinion mining is centered on the product features. Previous related research work has often taken into account explicit features but ignored implicit features, However, implicit feature identification, which can help us better understand the reviews, is an essential aspect of feature-specific opinion mining. This paper is mainly centered on implicit feature identification in Chinese product reviews. We think that based on the explicit synonymous feature group and the sentences which contain explicit features, several Support Vector Machine (SVM) classifiers can be established to classify the non-explicit sentences. Nevertheless, instead of simply using traditional feature selection methods, we believe an explicit topic model in which each topic is pre-defined could perform better. In this paper, we first extend a popular topic modeling method, called Latent Dirichlet Allocation (LDA), to construct an explicit topic model. Then some types of prior knowledge, such as: must-links, cannot-links and relevance-based prior knowledge, are extracted and incorporated into the explicit topic model automatically. Experiments show that the explicit topic model, which incorporates pre-existing knowledge, outperforms traditional feature selection methods and other existing methods by a large margin and the identification task can be completed better.

Journal ArticleDOI
TL;DR: Light is shed on the theory that underlies text mining methods and guidance is provided for researchers who seek to apply these methods.
Abstract: The amount of textual data that is available for researchers and businesses to analyze is increasing at a dramatic rate. This reality has led IS researchers to investigate various text mining techniques. This essay examines four text mining methods that are frequently used in order to identify their characteristics and limitations. The four methods that we examine are (1) latent semantic analysis, (2) probabilistic latent semantic analysis, (3) latent Dirichlet allocation, and (4) correlated topic model. We review these four methods and compare them with topic detection and spam filtering to reveal their peculiarity. Our paper sheds light on the theory that underlies text mining methods and provides guidance for researchers who seek to apply these methods.

Proceedings Article
21 Feb 2015
TL;DR: A novel bilevel optimization formulation is given to identify the optimal poisoning attack on Latent Dirichlet allocation (LDA), and an ecient solution (up to local optima) is presented using descent method and implicit functions.
Abstract: Latent Dirichlet allocation (LDA) is an increasingly popular tool for data analysis in many domains If LDA output aects decision making (especially when money is involved), there is an incentive for attackers to compromise it We ask the question: how can an attacker minimally poison the corpus so that LDA produces topics that the attacker wants the LDA user to see? Answering this question is important to characterize such attacks, and to develop defenses in the future We give a novel bilevel optimization formulation to identify the optimal poisoning attack We present an ecient solution (up to local optima) using descent method and implicit functions We demonstrate poisoning attacks on LDA with extensive experiments, and discuss possible defenses

Journal ArticleDOI
TL;DR: This work shows how a common text-mining method (latent Dirichlet allocation, or topic modeling) and statistical tests familiar to ecologists can be used to investigate trends and identify potential research gaps in the scientific literature, increasing scientists' capacity for research synthesis.
Abstract: Keeping track of conceptual and methodological developments is a critical skill for research scientists, but this task is increasingly difficult due to the high rate of academic publication. As a crisis discipline, conservation science is particularly in need of tools that facilitate rapid yet insightful synthesis. We show how a common text-mining method (latent Dirichlet allocation, or topic modeling) and statistical tests familiar to ecologists (cluster analysis, regression, and network analysis) can be used to investigate trends and identify potential research gaps in the scientific literature. We tested these methods on the literature on ecological surrogates and indicators. Analysis of topic popularity within this corpus showed a strong emphasis on monitoring and management of fragmented ecosystems, while analysis of research gaps suggested a greater role for genetic surrogates and indicators. Our results show that automated text analysis methods need to be used with care, but can provide information that is complementary to that given by systematic reviews and meta-analyses, increasing scientists' capacity for research synthesis.

Journal ArticleDOI
TL;DR: The proposed topic-to-question generation approach can significantly outperform the state-of-the-art results and use of syntactic tree kernels for the automatic judgment of the syntactic correctness of the questions is proposed.
Abstract: This paper is concerned with automatic generation of all possible questions from a topic of interest. Specifically, we consider that each topic is associated with a body of texts containing useful information about the topic. Then, questions are generated by exploiting the named entity information and the predicate argument structures of the sentences present in the body of texts. The importance of the generated questions is measured using Latent Dirichlet Allocation by identifying the subtopics which are closely related to the original topic in the given body of texts and applying the Extended String Subsequence Kernel to calculate their similarity with the questions. We also propose the use of syntactic tree kernels for the automatic judgment of the syntactic correctness of the questions. The questions are ranked by considering both their importance in the context of the given body of texts and syntactic correctness. To the best of our knowledge, no previous study has accomplished this task in our setting. A series of experiments demonstrate that the proposed topic-to-question generation approach can significantly outperform the state-of-the-art results.