scispace - formally typeset
Search or ask a question

Showing papers on "Probabilistic latent semantic analysis published in 2008"


Proceedings ArticleDOI
23 Jun 2008
TL;DR: A discriminatively trained, multiscale, deformable part model for object detection, which achieves a two-fold improvement in average precision over the best performance in the 2006 PASCAL person detection challenge and outperforms the best results in the 2007 challenge in ten out of twenty categories.
Abstract: This paper describes a discriminatively trained, multiscale, deformable part model for object detection. Our system achieves a two-fold improvement in average precision over the best performance in the 2006 PASCAL person detection challenge. It also outperforms the best results in the 2007 challenge in ten out of twenty categories. The system relies heavily on deformable parts. While deformable part models have become quite popular, their value had not been demonstrated on difficult benchmarks such as the PASCAL challenge. Our system also relies heavily on new methods for discriminative training. We combine a margin-sensitive approach for data mining hard negative examples with a formalism we call latent SVM. A latent SVM, like a hidden CRF, leads to a non-convex training problem. However, a latent SVM is semi-convex and the training problem becomes convex once latent information is specified for the positive examples. We believe that our training methods will eventually make possible the effective use of more latent information such as hierarchical (grammar) models and models involving latent three dimensional pose.

2,893 citations


Journal ArticleDOI
TL;DR: A novel unsupervised learning method for human action categories that can recognize and localize multiple actions in long and complex video sequences containing multiple motions.
Abstract: We present a novel unsupervised learning method for human action categories. A video sequence is represented as a collection of spatial-temporal words by extracting space-time interest points. The algorithm automatically learns the probability distributions of the spatial-temporal words and the intermediate topics corresponding to human action categories. This is achieved by using latent topic models such as the probabilistic Latent Semantic Analysis (pLSA) model and Latent Dirichlet Allocation (LDA). Our approach can handle noisy feature points arisen from dynamic background and moving cameras due to the application of the probabilistic models. Given a novel video sequence, the algorithm can categorize and localize the human action(s) contained in the video. We test our algorithm on three challenging datasets: the KTH human motion dataset, the Weizmann human action dataset, and a recent dataset of figure skating actions. Our results reflect the promise of such a simple approach. In addition, our algorithm can recognize and localize multiple actions in long and complex video sequences containing multiple motions.

1,440 citations


Journal ArticleDOI
TL;DR: This work introduces a novel vocabulary using dense color SIFT descriptors and investigates the classification performance under changes in the size of the visual vocabulary, the number of latent topics learned, and the type of discriminative classifier used (k-nearest neighbor or SVM).
Abstract: We investigate whether dimensionality reduction using a latent generative model is beneficial for the task of weakly supervised scene classification. In detail, we are given a set of labeled images of scenes (for example, coast, forest, city, river, etc.), and our objective is to classify a new image into one of these categories. Our approach consists of first discovering latent ";topics"; using probabilistic Latent Semantic Analysis (pLSA), a generative model from the statistical text literature here applied to a bag of visual words representation for each image, and subsequently, training a multiway classifier on the topic distribution vector for each image. We compare this approach to that of representing each image by a bag of visual words vector directly and training a multiway classifier on these vectors. To this end, we introduce a novel vocabulary using dense color SIFT descriptors and then investigate the classification performance under changes in the size of the visual vocabulary, the number of latent topics learned, and the type of discriminative classifier used (k-nearest neighbor or SVM). We achieve superior classification performance to recent publications that have used a bag of visual word representation, in all cases, using the authors' own data sets and testing protocols. We also investigate the gain in adding spatial information. We show applications to image retrieval with relevance feedback and to scene classification in videos.

778 citations


Proceedings ArticleDOI
24 Aug 2008
TL;DR: This work addresses the problem of joint modeling of text and citations in the topic modeling framework with two different models called the Pairwise-Link-LDA and the Link-PLSA-Lda models, which combine the LDA and PLSA models into a single graphical model.
Abstract: In this work, we address the problem of joint modeling of text and citations in the topic modeling framework. We present two different models called the Pairwise-Link-LDA and the Link-PLSA-LDA models.The Pairwise-Link-LDA model combines the ideas of LDA [4] and Mixed Membership Block Stochastic Models [1] and allows modeling arbitrary link structure. However, the model is computationally expensive, since it involves modeling the presence or absence of a citation (link) between every pair of documents. The second model solves this problem by assuming that the link structure is a bipartite graph. As the name indicates, Link-PLSA-LDA model combines the LDA and PLSA models into a single graphical model.Our experiments on a subset of Citeseer data show that both these models are able to predict unseen data better than the baseline model of Erosheva and Lafferty [8], by capturing the notion of topical similarity between the contents of the cited and citing documents. Our experiments on two different data sets on the link prediction task show that the Link-PLSA-LDA model performs the best on the citation prediction task, while also remaining highly scalable. In addition, we also present some interesting visualizations generated by each of the models.

456 citations


Journal ArticleDOI
TL;DR: It is shown that PLSI and NMF (with the I-divergence objective function) optimize the same objective function, although PLSi andNMF are different algorithms as verified by experiments.

311 citations


Journal ArticleDOI
TL;DR: Novel methods for automatic object detection in high-resolution images by combining spectral information with structural information exploited by using image segmentation are presented.
Abstract: The object-based analysis of remotely sensed imagery provides valuable spatial and structural information that is complementary to pixel-based spectral information in classification. In this paper, we present novel methods for automatic object detection in high-resolution images by combining spectral information with structural information exploited by using image segmentation. The proposed segmentation algorithm uses morphological operations applied to individual spectral bands using structuring elements in increasing sizes. These operations produce a set of connected components forming a hierarchy of segments for each band. A generic algorithm is designed to select meaningful segments that maximize a measure consisting of spectral homogeneity and neighborhood connectivity. Given the observation that different structures appear more clearly at different scales in different spectral bands, we describe a new algorithm for unsupervised grouping of candidate segments belonging to multiple hierarchical segmentations to find coherent sets of segments that correspond to actual objects. The segments are modeled by using their spectral and textural content, and the grouping problem is solved by using the probabilistic latent semantic analysis algorithm that builds object models by learning the object-conditional probability distributions. The automatic labeling of a segment is done by computing the similarity of its feature distribution to the distribution of the learned object models using the Kullback-Leibler divergence. The performances of the unsupervised segmentation and object detection algorithms are evaluated qualitatively and quantitatively using three different data sets with comparative experiments, and the results show that the proposed methods are able to automatically detect, group, and label segments belonging to the same object classes.

236 citations


Proceedings ArticleDOI
15 Oct 2008
TL;DR: In this article, the authors present an LDA-based static technique for bug localization based on the latent Dirichlet allocation (LDA) model, which has significant advantages over both LSI and probabilistic LSI.
Abstract: In bug localization, a developer uses information about a bug to locate the portion of the source code to modify to correct the bug Developers expend considerable effort performing this task Some recent static techniques for automatic bug localization have been built around modern information retrieval (IR) models such as latent semantic indexing (LSI); however, latent Dirichlet allocation (LDA), a modular and extensible IR model, has significant advantages over both LSI and probabilistic LSI (pLSI) In this paper we present an LDA-based static technique for automating bug localization We describe the implementation of our technique and three case studies that measure its effectiveness For two of the case studies we directly compare our results to those from similar studies performed using LSI The results demonstrate our LDA-based technique performs at least as well as the LSI-based techniques for all bugs and performs better, often significantly so, than the LSI-based techniques for most bugs

232 citations


Proceedings ArticleDOI
20 Jul 2008
TL;DR: A novel cross-domain text classification algorithm which extends the traditional probabilistic latent semantic analysis (PLSA) algorithm to integrate labeled and unlabeled data, which come from different but related domains, into a unified probabilism model, called Topic-bridged PLSA, or TPLSA.
Abstract: In many Web applications, such as blog classification and new-sgroup classification, labeled data are in short supply. It often happens that obtaining labeled data in a new domain is expensive and time consuming, while there may be plenty of labeled data in a related but different domain. Traditional text classification ap-proaches are not able to cope well with learning across different domains. In this paper, we propose a novel cross-domain text classification algorithm which extends the traditional probabilistic latent semantic analysis (PLSA) algorithm to integrate labeled and unlabeled data, which come from different but related domains, into a unified probabilistic model. We call this new model Topic-bridged PLSA, or TPLSA. By exploiting the common topics between two domains, we transfer knowledge across different domains through a topic-bridge to help the text classification in the target domain. A unique advantage of our method is its ability to maximally mine knowledge that can be transferred between domains, resulting in superior performance when compared to other state-of-the-art text classification approaches. Experimental eval-uation on different kinds of datasets shows that our proposed algorithm can improve the performance of cross-domain text classification significantly.

208 citations


Journal ArticleDOI
TL;DR: The authors formulates a metatheoretical framework for latent variable modeling and argues that the difference between observed and latent variables is purely epistemic in nature: we treat a variable as observed when the inference from data structure to variable structure can be made with certainty and as latent when this inference is prone to error.
Abstract: This paper formulates a metatheoretical framework for latent variable modeling. It does so by spelling out the difference between observed and latent variables. This difference is argued to be purely epistemic in nature: We treat a variable as observed when the inference from data structure to variable structure can be made with certainty and as latent when this inference is prone to error. This difference in epistemic accessibility is argued to be directly related to the data-generating process, i.e., the process that produces the concrete data patterns on which statistical analyses are executed. For a variable to count as observed through a set of data patterns, the relation between variable structure and data structure should be (a) deterministic, (b) causally isolated, and (c) of equivalent cardinality. When any of these requirements is violated, (part of) the variable structure should be considered latent. It is argued that, on these criteria, observed variables are rare to nonexistent in psychology;...

179 citations


Proceedings ArticleDOI
26 Oct 2008
TL;DR: This paper proposes a novel algorithm called Laplacian Probabilistic Latent Semantic Indexing (LapPLSI) for topic modeling, which models the document space as a submanifold embedded in the ambient space and directly performs the topic modeling on this document manifold in question.
Abstract: Topic modeling has been a key problem for document analysis. One of the canonical approaches for topic modeling is Probabilistic Latent Semantic Indexing, which maximizes the joint probability of documents and terms in the corpus. The major disadvantage of PLSI is that it estimates the probability distribution of each document on the hidden topics independently and the number of parameters in the model grows linearly with the size of the corpus, which leads to serious problems with overfitting. Latent Dirichlet Allocation (LDA) is proposed to overcome this problem by treating the probability distribution of each document over topics as a hidden random variable. Both of these two methods discover the hidden topics in the Euclidean space. However, there is no convincing evidence that the document space is Euclidean, or flat. Therefore, it is more natural and reasonable to assume that the document space is a manifold, either linear or nonlinear. In this paper, we consider the problem of topic modeling on intrinsic document manifold. Specifically, we propose a novel algorithm called Laplacian Probabilistic Latent Semantic Indexing (LapPLSI) for topic modeling. LapPLSI models the document space as a submanifold embedded in the ambient space and directly performs the topic modeling on this document manifold in question. We compare the proposed LapPLSI approach with PLSI and LDA on three text data sets. Experimental results show that LapPLSI provides better representation in the sense of semantic structure.

173 citations



Journal ArticleDOI
TL;DR: This paper presents a novel approach for sports video semantic event detection based on analysis and alignment of Webcast text and broadcast video, and employs a conditional random field model (CRFM) to align text event and video event.
Abstract: Sports video semantic event detection is essential for sports video summarization and retrieval. Extensive research efforts have been devoted to this area in recent years. However, the existing sports video event detection approaches heavily rely on either video content itself, which face the difficulty of high-level semantic information extraction from video content using computer vision and image processing techniques, or manually generated video ontology, which is domain specific and difficult to be automatically aligned with the video content. In this paper, we present a novel approach for sports video semantic event detection based on analysis and alignment of Webcast text and broadcast video. Webcast text is a text broadcast channel for sports game which is co-produced with the broadcast video and is easily obtained from the Web. We first analyze Webcast text to cluster and detect text events in an unsupervised way using probabilistic latent semantic analysis (pLSA). Based on the detected text event and video structure analysis, we employ a conditional random field model (CRFM) to align text event and video event by detecting event moment and event boundary in the video. Incorporation of Webcast text into sports video analysis significantly facilitates sports video semantic event detection. We conducted experiments on 33 hours of soccer and basketball games for Webcast analysis, broadcast video analysis and text/video semantic alignment. The results are encouraging and compared with the manually labeled ground truth.

Proceedings ArticleDOI
05 Jul 2008
TL;DR: A range of approaches for embedding data in a non-Euclidean latent space for the Gaussian Process latent variable model allows to learn transitions between motion styles even though such transitions are not present in the data.
Abstract: In dimensionality reduction approaches, the data are typically embedded in a Euclidean latent space. However for some data sets this is inappropriate. For example, in human motion data we expect latent spaces that are cylindrical or a toroidal, that are poorly captured with a Euclidean space. In this paper, we present a range of approaches for embedding data in a non-Euclidean latent space. Our focus is the Gaussian Process latent variable model. In the context of human motion modeling this allows us to (a) learn models with interpretable latent directions enabling, for example, style/content separation, and (b) generalise beyond the data set enabling us to learn transitions between motion styles even though such transitions are not present in the data.

Proceedings ArticleDOI
07 Jul 2008
TL;DR: Two ways of improving image classification based on bag-of-words representation are proposed and new techniques to eliminate useless words are proposed, one based on geometric properties of the keypoints, the other on the use of probabilistic Latent Semantic Analysis (pLSA).
Abstract: In this paper, we propose two ways of improving image classification based on bag-of-words representation [25]. Two shortcomings of this representation are the loss of the spatial information of visual words and the presence of noisy visual words due to the coarseness of the vocabulary building process. On the one hand, we propose a new representation of images that goes further in the analogy with textual data: visual sentences, that allows us to "read" visual words in a certain order, as in the case of text. We can therefore consider simple spatial relations between words. We also present a new image classification scheme that exploits these relations. It is based on the use of language models, a very popular tool from speech and text analysis communities. On the other hand, we propose new techniques to eliminate useless words, one based on geometric properties of the keypoints, the other on the use of probabilistic Latent Semantic Analysis (pLSA). Experiments show that our techniques can significantly improve image classification, compared to a classical Support Vector Machine-based classification.

Proceedings Article
01 Jan 2008
TL;DR: A new model is proposed that can be used to provide a user with highly influential blog postings on the topic of the user’s interest and which shows that the new PageRank results in superior performance than the traditional PageRank on key-word search.
Abstract: In this work, we address the twin problems of unsupervised topic discovery and estimation of topic specific influence of blogs. We propose a new model that can be used to provide a user with highly influential blog postings on the topic of the user’s interest. We adopt the framework of an unsupervised model called Latent Dirichlet Allocation(Blei, Ng, & Jordan 2003), known for its effectiveness in topic discovery. An extension of this model, which we call Link-LDA (Erosheva, Fienberg, & Lafferty 2004), defines a generative model for hyperlinks and thereby models topic specific influence of documents, the problem of our interest. However, this model does not exploit the topical relationship between the documents on either side of a hyperlink, i.e., the notion that documents tend to link to other documents on the same topic. We propose a new model, called Link-PLSA-LDA, that combines PLSA (Hoffman 1999) and LDA (Blei, Ng, & Jordan 2003) into a single framework, and explicitly models the topical relationship between the linking and the linked document. The output of the new model on blog data reveals very interesting visualizations of topics and influential blogs on each topic. We also perform quantitative evaluation of the model using log-likelihood of unseen data and on the task of link prediction. Both experiments show that that the new model performs better, suggesting its superiority over Link-LDA in modeling topics and topic specific influence of blogs. Introduction Proliferation of blogs in the recent past has posed several new, interesting challenges to researchers in the information retrieval and data mining community. In particular, there is an increasing need for automatic techniques to help the users quickly access blogs that are not only informative and popular, but also relevant to the user’s topics of interest. Significant progress has been made in the recent past, towards this objective. For example Java et al (Java et al. 2006) studied the performance of various algorithms such as PageRank, HITS and in-degree, on modeling influence of blogs. Kale et al (Kale et al. 2006) exploited the polarity (agreement/disagreement) of the hyperlinks and applied a trust propagation algorithm to model the propagation of influence between blogs. Copyright c © 2008, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved. The above mentioned papers address modeling influence in general, but it is also important to model influence of blogs with respect to the topic of the user’s interest. This problem has been addressed by the work of Haveliwala (Haveliwala 2002) in the context of key-word search. In this paper, PageRanks of documents are pre-computed for a certain number of topics. At query time, for each document matching the query, its PageRanks for various topics are combined based on the similarity of the query to each topic, to obtain a topic-sensitive PageRank. The author shows that the new PageRank results in superior performance than the traditional PageRank on key-word search. The topics used in the algorithm are, however, obtained from an external repository. Ideally, it would be very useful to mine these topics automatically as well. The problem of automatic topic mining from blogs has been addressed by Glance et al (Natalie S. Glance & Tomokiyo 2006), where the authors used a combination of NLP techniques, clustering and heuristics to mine topics and trends from blogs. However, this work does not address modeling the influence of blog postings with respect to the topics discovered. In our work, we aim at addressing both these problems simultaneously, i.e., topic discovery as well as modeling topic specific influence of blogs, in a completely unsupervised fashion. Towards this objective, we employ the probabilistic framework of latent topic models such as the Latent Dirichlet Allocation (Blei, Ng, & Jordan 2003), and propose a new model in this framework. The rest of the paper is organized as follows. In section , we discuss some of the past work done on joint models of topics and influence in the framework of latent topic models. We describe our new model in section . In section , we report the results of our experiments on blog data. We conclude the discussion in section with a few remarks on directions for future work. Note that in the rest of the paper, we use the terms ‘citation’ and ‘hyperlink’ interchangeably. Likewise, note that the term ‘citing’ is synonymous to ‘linking’ and so is ‘cited’ to ‘linked’. The reader is also recommended to refer to table 1 for some frequent notation used in this paper. M Total number of documents M← Number of cited documents M→ Number of citing documents V Vocabulary size K Number of topics N← Total number of words in the cited set d A citing document d A cited document ∆(p) A simplex of dimension (p− 1) c(d, d) citation from d to d Dir(·|α) Dirichlet distribution with parameter α Mult(·|β) Multinomial distribution with parameter β Ld Number of hyperlinks in document d Nd Number of words in document d βkw Probability of word w w.r.t. topic k Ωkd′ Probability of hyperlink to document d w.r.t. topic k πk Probability of topic k in the cited document set.

Journal ArticleDOI
TL;DR: The proposed multiple imputation method, which is implemented in Latent GOLD software for latent class analysis, is illustrated with two examples and compared to well-established methods such as maximum likelihood estimation with incomplete data and multiple imputations using a saturated log-linear model.
Abstract: We propose using latent class analysis as an alternative to log-linear analysis for the multiple imputation of incomplete categorical data. Similar to log-linear models, latent class models can be used to describe complex association structures between the variables used in the imputation model. However, unlike log-linear models, latent class models can be used to build large imputation models containing more than a few categorical variables. To obtain imputations reflecting uncertainty about the unknown model parameters, we use a nonparametric bootstrap procedure as an alternative to the more common full Bayesian approach. The proposed multiple imputation method, which is implemented in Latent GOLD software for latent class analysis, is illustrated with two examples. In a simulated data example, we compare the new method to well-established methods such as maximum likelihood estimation with incomplete data and multiple imputation using a saturated log-linear model. This example shows that the proposed me...

Journal ArticleDOI
TL;DR: In this article, the authors proposed probabilistic latent semantic analysis (pLSA) for non-negative decomposition and elucidation of interpretable component spectra and abundance maps.
Abstract: Imaging mass spectrometry (IMS) is a promising technology which allows for detailed analysis of spatial distributions of (bio)molecules in organic samples. In many current applications, IMS relies heavily on (semi)automated exploratory data analysis procedures to decompose the data into characteristic component spectra and corresponding abundance maps, visualizing spectral and spatial structure. The most commonly used techniques are principal component analysis (PCA) and independent component analysis (ICA). Both methods operate in an unsupervised manner. However, their decomposition estimates usually feature negative counts and are not amenable to direct physical interpretation. We propose probabilistic latent semantic analysis (pLSA) for non-negative decomposition and the elucidation of interpretable component spectra and abundance maps. We compare this algorithm to PCA, ICA, and non-negative PARAFAC (parallel factors analysis) and show on simulated and real-world data that pLSA and non-negative PARAFAC are superior to PCA or ICA in terms of complementarity of the resulting components and reconstruction accuracy. We further combine pLSA decomposition with a statistical complexity estimation scheme based on the Akaike information criterion (AIC) to automatically estimate the number of components present in a tissue sample data set and show that this results in sensible complexity estimates.

Proceedings ArticleDOI
25 Oct 2008
TL;DR: A conditional loglinear model is presented for string-to-string transduction that employs overlapping features over latent alignment sequences, and which learns latent classes and latent string pair regions from incomplete training data, and it is demonstrated that latent variables can dramatically improve results, even when trained on small data sets.
Abstract: String-to-string transduction is a central problem in computational linguistics and natural language processing. It occurs in tasks as diverse as name transliteration, spelling correction, pronunciation modeling and inflectional morphology. We present a conditional loglinear model for string-to-string transduction, which employs overlapping features over latent alignment sequences, and which learns latent classes and latent string pair regions from incomplete training data. We evaluate our approach on morphological tasks and demonstrate that latent variables can dramatically improve results, even when trained on small data sets. On the task of generating morphological forms, we outperform a baseline method reducing the error rate by up to 48%. On a lemmatization task, we reduce the error rates in Wicentowski (2002) by 38--92%.

Proceedings ArticleDOI
01 Jan 2008
TL;DR: A semantic scene segmentation model is proposed to decompose a wide-area scene into regions where behaviours share similar characteristic and are represented as classes of video events bearing similar features to infer global behaviour patterns.
Abstract: We present a novel framework for inferring global behaviour patterns through modelling behaviour correlations in a wide-area scene and detecting any anomaly in behaviours occurring both locally and globally. Specifically, we propose a semantic scene segmentation model to decompose a wide-area scene into regions where behaviours share similar characteristic and are represented as classes of video events bearing similar features. To model behavioural correlations globally, we investigate both a probabilistic Latent Semantic Analysis (pLSA) model and a two-stage hierarchical pLSA model for global behaviour inference and anomaly detection. The proposed framework is validated by experiments using complex crowded outdoor scenes.

Proceedings ArticleDOI
07 Jul 2008
TL;DR: A novel statistical group analysis is presented that highlights relevant patterns of photo-to-group sharing practices in Flickr groups, and a novel topic-based representation model for groups is proposed, computed from aggregated group tags.
Abstract: There is an explosion of community-generated multimedia content available online. In particular, Flickr constitutes a 200-million photo sharing system where users participate following a variety of social motivations and themes. Flickr groups are increasingly used to facilitate the explicit definition of communities sharing common interests, which translates into large amounts of content (e.g. pictures and associated tags) about specific subjects. However, to our knowledge, an in-depth analysis of user behavior in Flickr groups remains open, as does the existence of effective tools to find relevant groups. Using a sample of about 7 million user-photos and about 51000 Flickr groups, we present a novel statistical group analysis that highlights relevant patterns of photo-to-group sharing practices. Furthermore, we propose a novel topic-based representation model for groups, computed from aggregated group tags. Groups are represented as multinomial distributions over semantically meaningful latent topics learned via unsupervised probabilistic topic modeling. We show this representation to be useful for automatically discovering groups of groups and topic expert-groups, for designing new group-search strategies, and for obtaining new insights of the semantic structure of Flickr groups.

Journal ArticleDOI
TL;DR: A novel Bayesian PLSA framework is presented and an incremental PLSA algorithm is constructed to accomplish the parameter estimation as well as the hyperparameter updating, which is capable of performing dynamic document indexing and modeling.
Abstract: Due to the vast growth of data collections, the statistical document modeling has become increasingly important in language processing areas. Probabilistic latent semantic analysis (PLSA) is a popular approach whereby the semantics and statistics can be effectively captured for modeling. However, PLSA is highly sensitive to task domain, which is continuously changing in real-world documents. In this paper, a novel Bayesian PLSA framework is presented. We focus on exploiting the incremental learning algorithm for solving the updating problem of new domain articles. This algorithm is developed to improve document modeling by incrementally extracting up-to-date latent semantic information to match the changing domains at run time. By adequately representing the priors of PLSA parameters using Dirichlet densities, the posterior densities belong to the same distribution so that a reproducible prior/posterior mechanism is activated for incremental learning from constantly accumulated documents. An incremental PLSA algorithm is constructed to accomplish the parameter estimation as well as the hyperparameter updating. Compared to standard PLSA using maximum likelihood estimate, the proposed approach is capable of performing dynamic document indexing and modeling. We also present the maximum a posteriori PLSA for corrective training. Experiments on information retrieval and document categorization demonstrate the superiority of using Bayesian PLSA methods.

Proceedings Article
01 Jan 2008
TL;DR: In this cross-lingual extension of ESA, the cross-language links of Wikipedia are used in order to map the ESA vectors between different languages, thus allowing retrieval across languages.
Abstract: We have participated on the monolingual and bilingual CLEF Ad-Hoc Retrieval Tasks, using a novel extension of the by now well-known Explicit Semantic Analysis (ESA) approach. We call this extension Cross-Language Explicit Semantic Analysis (CL-ESA) as it allows to apply ESA in a cross-lingual information retrieval setting. In essence, ESA represents documents as vectors in the space of Wikipedia articles, using the tfidf measure to capture how “important” a Wikipedia article is for a specific word. The interesting property of ESA is that arbitrary documents can be represented as a vector with respect to the Wikipedia article space. ESA thus replaces the standard BOW model for retrieval. In our cross-lingual extension of ESA, the cross-language links of Wikipedia are used in order to map the ESA vectors between different languages, thus allowing retrieval across languages. Our results are far behind the ones of other systems on the monolingual and ad-hoc retrieval tasks, but our motivation was to find out the potential of the CL-ESA approach using a first and unoptimized implementation thereof.

Journal ArticleDOI
TL;DR: An interesting application of SVD to text do cuments is described, where words of similar meaning get mapped to similar low dimensional locations by taking the top k singular values/ vectors.
Abstract: We now described an interesting application of SVD to text do cuments. Suppose we represent documents as a bag of words, soXij is the number of times word j occurs in document i, for j = 1 : W andi = 1 : D, where W is the number of words and D is the number of documents. To find a document that contains a g iven word, we can use standard search procedures, but this can get confuse d by ynonomy (different words with the same meaning) andpolysemy (same word with different meanings). An alternative approa ch is to assume that X was generated by some low dimensional latent representation X̂ ∈ IR, whereK is the number of latent dimensions. If we compare documents in the latent space, we should get improved retrie val performance, because words of similar meaning get mapped to similar low dimensional locations. We can compute a low dimensional representation of X by computing the SVD, and then taking the top k singular values/ vectors: 1

Proceedings ArticleDOI
24 Aug 2008
TL;DR: This work proposes a visualization method based on a topic model for discrete data such as documents that can be obtained by fitting the model to a given set of documents using the EM algorithm, resulting in documents with similar topics being embedded close together.
Abstract: We propose a visualization method based on a topic model for discrete data such as documents. Unlike conventional visualization methods based on pairwise distances such as multi-dimensional scaling, we consider a mapping from the visualization space into the space of documents as a generative process of documents. In the model, both documents and topics are assumed to have latent coordinates in a two- or three-dimensional Euclidean space, or visualization space. The topic proportions of a document are determined by the distances between the document and the topics in the visualization space, and each word is drawn from one of the topics according to its topic proportions. A visualization, i.e. latent coordinates of documents, can be obtained by fitting the model to a given set of documents using the EM algorithm, resulting in documents with similar topics being embedded close together. We demonstrate the effectiveness of the proposed model by visualizing document and movie data sets, and quantitatively compare it with conventional visualization methods.

Proceedings ArticleDOI
22 Apr 2008
TL;DR: Inspired by the success of partitioning approach used in the database design, a novel clustering semantic algorithm was used to eliminate irrelevant services with respect to a query and Probabilistic Latent Semantic Analysis (PLSA) was utilized to capture the semantics hidden behind the words in a query, and the descriptions in the services, so that service matching can be carried out at the concept level.
Abstract: Efficiently finding Web services on the Web is a challenging issue in service-oriented computing. Currently, UDDI is a standard for publishing and discovery of Web services, and UDDI registries also provide keyword searches for Web services. However, the search functionality is very simple and fails to account for relationships between Web services. Firstly, users are overwhelmed by the huge number of irrelevant returned services. Secondly, the intentions of users and the semantics in Web services are ignored. Inspired by the success of partitioning approach used in the database design, we used a novel clustering semantic algorithm to eliminate irrelevant services with respect to a query. Then we utilized Probabilistic Latent Semantic Analysis (PLSA), a machine learning method, to capture the semantics hidden behind the words in a query, and the descriptions in the services, so that service matching can be carried out at the concept level. This paper reports upon the preliminary experimental evaluation that shows improvements over recall and precision.

Journal Article
TL;DR: The results show that the use of learning materials as training data for the grading model outperforms the k-NN-based grading methods and the division of the learning materials in the training data is crucial.
Abstract: Automatic Essay Assessor (AEA) is a system that utilizes information retrieval techniques such as Latent Semantic Analysis (LSA), Probabilistic Latent Semantic Analysis (PLSA), and Latent Dirichlet Allocation (LDA) for automatic essay grading. The system uses learning materials and relatively few teacher-graded essays for calibrating the scoring mechanism before grading. We performed a series of experiments using LSA, PLSA and LDA for document comparisons in AEA. In addition to comparing the methods on a theoretical level, we compared the applicability of LSA, PLSA, and LDA to essay grading with empirical data. The results show that the use of learning materials as training data for the grading model outperforms the k-NN-based grading methods. In addition to this, we found that using LSA yielded slightly more accurate grading than PLSA and LDA. We also found that the division of the learning materials in the training data is crucial. It is better to divide learning materials into sentences than paragraphs.

Journal Article
TL;DR: This paper examines the robustness of several recommendation algorithms that use different model-based techniques: user clustering, feature reduction, and association rules, and considers techniques based on k-means and probabilistic latent semantic analysis (pLSA) that compare the profile of an active user to aggregate user clusters, rather than the original profiles.
Abstract: The open nature of collaborative recommender systems allows attackers who inject biased profile data to have a significant impact on the recommendations produced. Standard memory-based collaborative filtering algorithms, such as k-nearest neighbor, are quite vulnerable to profile injection attacks. Previous work has shown that some model-based techniques are more robust than standard k-nn. Model abstraction can inhibit certain aspects of an attack, providing an algorithmic approach to minimizing attack effectiveness. In this paper, we examine the robustness of several recommendation algorithms that use different model-based techniques: user clustering, feature reduction, and association rules. In particular, we consider techniques based on k-means and probabilistic latent semantic analysis (pLSA) that compare the profile of an active user to aggregate user clusters, rather than the original profiles. We then consider a recommendation algorithm that uses principal component analysis (PCA) to calculate the similarity between user profiles based on reduced dimensions. Finally, we consider a recommendation algorithm based on the data mining technique of association rule mining using the Apriori algorithm. Our results show that all techniques offer large improvements in stability and robustness compared to standard k-nearest neighbor. In particular, the Apriori algorithm performs extremely well against lowknowledge attacks, but at a cost of reduced coverage, and the PCA algorithm performs extremely well against focused attacks. Furthermore, our results show that all techniques can achieve comparable recommendation accuracy to standard k-nn.

Proceedings ArticleDOI
23 Oct 2008
TL;DR: An incremental recommendation algorithm based on Probabilistic Latent Semantic Analysis (PLSA) that can consider not only the users' long-term and short-term interests, but also users' negative and positive feedback is proposed.
Abstract: With the fast development of web 2.0, user-centric publishing and knowledge management platforms, such as Wiki, Blogs, and Q & A systems attract a large number of users. Given the availability of the huge amount of meaningful user generated content, incremental model based recommendation techniques can be employed to improve users' experience using automatic recommendations. In this paper, we propose an incremental recommendation algorithm based on Probabilistic Latent Semantic Analysis (PLSA). The proposed algorithm can consider not only the users' long-term and short-term interests, but also users' negative and positive feedback. We compare the proposed method with several baseline methods using a real-world Question & Answer website called Wenda. Experiments demonstrate both the effectiveness and the efficiency of the proposed methods.

Journal ArticleDOI
TL;DR: The extent to which the low-dimensional semantic spaces built in this paper respect traditional catalogue organization by artist and genre, and how well they generalize to unseen tracks, are investigated.
Abstract: In this paper we describe how to build a variety of information retrieval models for music collections based on social tags. We discuss the particular nature of social tags for music and apply latent semantic dimension reduction methods to co-occurrence counts of words in tags given to individual tracks. We evaluate the performance of various latent semantic models in relation to both previous work and a simple full-rank vector space model based on tags. We investigate the extent to which our low-dimensional semantic spaces respect traditional catalogue organization by artist and genre, and how well they generalize to unseen tracks, and we illustrate some of the concepts expressed by the learned dimensions.

Proceedings ArticleDOI
16 Aug 2008
TL;DR: This work proposes a solution to the challenge of the CoNLL 2008 shared task that uses a generative history-based latent variable model to predict the most likely derivation of a synchronous dependency parser for both syntactic and semantic dependencies.
Abstract: We propose a solution to the challenge of the CoNLL 2008 shared task that uses a generative history-based latent variable model to predict the most likely derivation of a synchronous dependency parser for both syntactic and semantic dependencies. The submitted model yields 79.1% macro-average F1 performance, for the joint task, 86.9% syntactic dependencies LAS and 71.0% semantic dependencies F1. A larger model trained after the deadline achieves 80.5% macro-average F1, 87.6% syntactic dependencies LAS, and 73.1% semantic dependencies F1.