scispace - formally typeset
Search or ask a question

Showing papers on "Probabilistic latent semantic analysis published in 2004"


Proceedings Article
01 Dec 2004
TL;DR: An alternative two-layer model based on exponential family distributions and the semantics of undirected models is proposed, which performs well on document retrieval tasks and provides an elegant solution to searching with keywords.
Abstract: Directed graphical models with one layer of observed random variables and one or more layers of hidden random variables have been the dominant modelling paradigm in many research fields. Although this approach has met with considerable success, the causal semantics of these models can make it difficult to infer the posterior distribution over the hidden variables. In this paper we propose an alternative two-layer model based on exponential family distributions and the semantics of undirected models. Inference in these "exponential family harmoniums" is fast while learning is performed by minimizing contrastive divergence. A member of this family is then studied as an alternative probabilistic model for latent semantic indexing. In experiments it is shown that they perform well on document retrieval tasks and provide an elegant solution to searching with keywords.

520 citations


Patent
14 Apr 2004
TL;DR: In this paper, a trainable semantic vector (TSV) is constructed to represent the significance of the information relative to each of the predetermined categories, and various types of manipulation and analysis such as searching, classification, and clustering can subsequently be performed on a semantic level.
Abstract: An apparatus and method are disclosed for producing a semantic representation of information in a semantic space. The information is first represented in a table that stores values which indicate a relationship with predetermined categories. The categories correspond to dimensions in the semantic space. The significance of the information with respect to the predetermined categories is then determined. A trainable semantic vector (TSV) is constructed to provide a semantic representation of the information. The TSV has dimensions equal to the number of predetermined categories and represents the significance of the information relative to each of the predetermined categories. Various types of manipulation and analysis, such as searching, classification, and clustering, can subsequently be performed on a semantic level.

326 citations


Proceedings ArticleDOI
10 Oct 2004
TL;DR: A new way of modeling multi-modal co-occurrences is proposed, constraining the definition of the latent space to ensure its consistency in semantic terms (words), while retaining the ability to jointly model visual information.
Abstract: We address the problem of unsupervised image auto-annotation with probabilistic latent space models. Unlike most previous works, which build latent space representations assuming equal relevance for the text and visual modalities, we propose a new way of modeling multi-modal co-occurrences, constraining the definition of the latent space to ensure its consistency in semantic terms (words), while retaining the ability to jointly model visual information. The concept is implemented by a linked pair of Probabilistic Latent Semantic Analysis (PLSA) models. On a 16000-image collection, we show with extensive experiments that our approach significantly outperforms previous joint models.

258 citations


Journal ArticleDOI
TL;DR: A search-based algorithm for learning hierarchical latent class models from data using a framework where the local dependence problem can be addressed in a principled manner is developed.
Abstract: Latent class models are used for cluster analysis of categorical data. Underlying such a model is the assumption that the observed variables are mutually independent given the class variable. A serious problem with the use of latent class models, known as local dependence, is that this assumption is often untrue. In this paper we propose hierarchical latent class models as a framework where the local dependence problem can be addressed in a principled manner. We develop a search-based algorithm for learning hierarchical latent class models from data. The algorithm is evaluated using both synthetic and real-world data.

235 citations


Proceedings ArticleDOI
22 Aug 2004
TL;DR: A unified framework for the discovery and analysis of Web navigational patterns based on Probabilistic Latent Semantic Analysis is developed and the flexibility of this framework is shown in characterizing various relationships among users and Web objects.
Abstract: The primary goal of Web usage mining is the discovery of patterns in the navigational behavior of Web users. Standard approaches, such as clustering of user sessions and discovering association rules or frequent navigational paths, do not generally provide the ability to automatically characterize or quantify the unobservable factors that lead to common navigational patterns. It is, therefore, necessary to develop techniques that can automatically discover hidden semantic relationships among users as well as between users and Web objects. Probabilistic Latent Semantic Analysis (PLSA) is particularly useful in this context, since it can uncover latent semantic associations among users and pages based on the co-occurrence patterns of these pages in user sessions. In this paper, we develop a unified framework for the discovery and analysis of Web navigational patterns based on PLSA. We show the flexibility of this framework in characterizing various relationships among users and Web objects. Since these relationships are measured in terms of probabilities, we are able to use probabilistic inference to perform a variety of analysis tasks such as user segmentation, page classification, as well as predictive tasks such as collaborative recommendations. We demonstrate the effectiveness of our approach through experiments performed on real-world data sets.

165 citations


Journal ArticleDOI
TL;DR: It is suggested that the use of a high dimensional dynamic viewer with an effective projection pursuit routine and user control, coupled with the exquisite abilities of the human visual system, can often succeed in discovering multiple revealing views that are missed by current computational algorithms.
Abstract: Most techniques for relating textual information rely on intellectually created links such as author-chosen keywords and titles, authority indexing terms, or bibliographic citations. Similarity of the semantic content of whole documents, rather than just titles, abstracts, or overlap of keywords, offers an attractive alternative. Latent semantic analysis provides an effective dimension reduction method for the purpose that reflects synonymy and the sense of arbitrary word combinations. However, latent semantic analysis correlations with human text-to-text similarity judgments are often empirically highest at ≈300 dimensions. Thus, two- or three-dimensional visualizations are severely limited in what they can show, and the first and/or second automatically discovered principal component, or any three such for that matter, rarely capture all of the relations that might be of interest. It is our conjecture that linguistic meaning is intrinsically and irreducibly very high dimensional. Thus, some method to explore a high dimensional similarity space is needed. But the 2.7 × 107 projections and infinite rotations of, for example, a 300-dimensional pattern are impossible to examine. We suggest, however, that the use of a high dimensional dynamic viewer with an effective projection pursuit routine and user control, coupled with the exquisite abilities of the human visual system to extract information about objects and from moving patterns, can often succeed in discovering multiple revealing views that are missed by current computational algorithms. We show some examples of the use of latent semantic analysis to support such visualizations and offer views on future needs.

146 citations


Proceedings ArticleDOI
07 Jul 2004
TL;DR: A hierarchical version of these methods for analysis of principal components in discrete data can be interpreted as a discrete version of ICA, and a hierarchical version yielding components at different levels of detail is developed.
Abstract: Methods for analysis of principal components in discrete data have existed for some time under various names such as grade of membership modelling, probabilistic latent semantic analysis, and genotype inference with admixture. In this paper we explore a number of extensions to the common theory, and present some application of these methods to some common statistical tasks. We show that these methods can be interpreted as a discrete version of ICA. We develop a hierarchical version yielding components at different levels of detail, and additional techniques for Gibbs sampling. We compare the algorithms on a text prediction task using support vector machines, and to information retrieval.

138 citations


22 Jul 2004
TL;DR: A domain-general framework for learning abstract relational knowledge, which goes beyond previous category-learning models in psychology and applies in two specific domains: learning the structure of kinship systems and learning causal theories.
Abstract: We present a framework for learning abstract relational knowledge, with the aim of explaining how people acquire intuitive theories of physical, biological, or social systems. Our algorithm infers a generative relational model with latent classes, simultaneously determining the kinds of entities that exist in a domain, the number of these latent classes, and the relations between classes that are possible or likely. This model goes beyond previous category-learning models in psychology, which consider the attributes associated with individual categories but not the relationships that can exist between categories. We apply this domain-general framework in two specific domains: learning the structure of kinship systems and learning causal theories.

105 citations


Journal ArticleDOI
TL;DR: Semantic distance as derived from WordNet appears distinct from other measures of word pair relatedness and is psychologically functional.
Abstract: WordNet, an electronic dictionary (or lexical database), is a valuable resource for computational and cognitive scientists. Recent work on the computing of semantic distances among nodes (synsets) in WordNet has made it possible to build a large database of semantic distances for use in selecting word pairs for psychological research. The database now contains nearly 50,000 pairs of words that have values for semantic distance, associative strength, and similarity based on co-occurrence. Semantic distance was found to correlate weakly with these other measures but to correlate more strongly with another measure of semantic relatedness, featural similarity. Hierarchical clustering analysis suggested that the knowledge structure underlying semantic distance is similar in gross form to that underlying featural similarity. In experiments in which semantic similarity ratings were used, human participants were able to discriminate semantic distance. Thus, semantic distance as derived from WordNet appears distinct from other measures of word pair relatedness and is psychologically functional. This database may be downloaded from www.psychonomic.org/archive/.

72 citations


Proceedings ArticleDOI
21 Jul 2004
TL;DR: This paper applied Feature Latent Semantic Analysis (FLSA) to dialogue act classification with excellent results on three corpora: CallHome Spanish, MapTask, and their own corpus of tutoring dialogues.
Abstract: We discuss Feature Latent Semantic Analysis (FLSA), an extension to Latent Semantic Analysis (LSA). LSA is a statistical method that is ordinarily trained on words only; FLSA adds to LSA the richness of the many other linguistic features that a corpus may be labeled with. We applied FLSA to dialogue act classification with excellent results. We report results on three corpora: CallHome Spanish, MapTask, and our own corpus of tutoring dialogues.

62 citations


Proceedings ArticleDOI
27 Jun 2004
TL;DR: This work addresses content based image retrieval (CBIR), focusing on developing a hidden semantic concept discovery methodology to address effective semantics-intensive image retrieval.
Abstract: This work addresses content based image retrieval (CBIR), focusing on developing a hidden semantic concept discovery methodology to address effective semantics-intensive image retrieval. In our approach, each image in the database is segmented to region; associated with homogenous color, texture, and shape features. By exploiting regional statistical information in each image and employing a vector quantization method, a uniform and sparse region-based representation is achieved. With this representation a probabilistic model based on statistical-hidden-class assumptions of the image database is obtained, to which expectation-maximization (EM) technique is applied to analyze semantic concepts hidden in the database. An elaborated retrieval algorithm is designed to support the probabilistic model. The semantic similarity is measured through integrating the posterior probabilities of the transformed query image, as well as a constructed negative example, to the discovered semantic concepts. The proposed approach has a solid statistical foundation and the experimental evaluations on a database of 10,000 general-purposed images demonstrate its promise of the effectiveness.

Proceedings ArticleDOI
15 Oct 2004
TL;DR: This paper presents a novel approach that allows to automatically cluster FACScode into meaningful categories, and shows that the newly derived codes constitute a competitive alternative to both basic emotion and FACScodes.
Abstract: For supervised training of automatic facial expression recognition systems, adequate ground truth labels that describe relevant facial expression categories are necessary. One possibility is to label facial expressions into emotion categories. Another approach is to label facial expressions independently from any interpretation attempts. This can be achieved via the facial action coding system (FACS). In this paper we present a novel approach that allows to automatically cluster FACScodes into meaningful categories. Our approach exploits the fact that FACScodes can be seen as documents containing terms -the action units (AUs) present in the codes-and so text modeling methods that capture co-occurrence information in low-dimensional spaces can be used. The FACScode derived descriptions are computed by Latent Semantic Analysis (LSA) and Probabilistic Latent Semantic Analysis (PLSA). We show that, as a high-level description of facial actions, the newly derived codes constitute a competitive alternative to both basic emotion and FACScodes. We have used them to train different types of artificial neural networks

Proceedings ArticleDOI
01 Nov 2004
TL;DR: This article proposed a supervised LSI (SLSI) which selects the most discriminative basis vectors using the training data iteratively and projects the documents into a reduced dimensional space for better classification.
Abstract: Latent semantic indexing (LSI) is a successful technology in information retrieval (IR) which attempts to explore the latent semantics implied by a query or a document through representing them in a dimension-reduced space. However, LSI is not optimal for document categorization tasks because it aims to find the most representative features for document representation rather than the most discriminative ones. In this paper, we propose supervised LSI (SLSI) which selects the most discriminative basis vectors using the training data iteratively. The extracted vectors are then used to project the documents into a reduced dimensional space for better classification. Experimental evaluations show that the SLSI approach leads to dramatic dimension reduction while achieving good classification results.

01 Jan 2004
TL;DR: A unified framework based on Probabilistic Latent Semantic Analysis is proposed to create models of Web users, taking into account both the navigational usage data and the Web site content information, and which can more accurately capture users’ access patterns and generate more effective recommendations.
Abstract: Web usage mining techniques, such as clustering of user sessions, are often used to identify Web user access patterns. However, to understand the factors that lead to common navigational patterns, it is necessary to develop techniques that can automatically characterize users’ navigational tasks and intentions. Such a characterization must be based both on the common usage patterns, as well as on common semantic information associated with the visited Web resources. The integration of semantic content and usage patterns allows the system to make inferences based on the underlying reasons for which a user may or may not be interested in particular items. In this paper, we propose a unified framework based on Probabilistic Latent Semantic Analysis to create models of Web users, taking into account both the navigational usage data and the Web site content information. Our joint probabilistic model is based on a set of discovered latent factors that “explain” the underlying relationships among pageviews in terms of their common usage and their semantic relationships. Based on the discovered user models, we propose algorithms for characterizing Web user segments and to provide dynamic and personalized recommendations based on these segments. Our experiments, performed on real usage data, show that this approach can more accurately capture users’ access patterns and generate more effective recommendations, when compared to more traditional methods based on clustering.

Journal ArticleDOI
TL;DR: A violation of the naive Bayes model is interpreted as an indication of the presence of latent variables, and it is shown how latent variables can be detected.

Journal ArticleDOI
TL;DR: This work proposes to apply feature ordering method based on support vector machines in order to select LSI-features that are suited for classification and suggests that the method improves classification performance with considerably more compact representation.

Proceedings ArticleDOI
25 Jul 2004
TL;DR: A new way to implement the LSI methodology, based on polynomial filtering, is discussed, which does not rely on any matrix decomposition and therefore its computational cost and storage requirements are low relative to traditional implementations of LSI.
Abstract: Latent Semantic Indexing (LSI) is a well established and effective framework for conceptual information retrieval. In traditional implementations of LSI the semantic structure of the collection is projected into the k-dimensional space derived from a rank-k approximation of the original term-by-document matrix. This paper discusses a new way to implement the LSI methodology, based on polynomial filtering. The new framework does not rely on any matrix decomposition and therefore its computational cost and storage requirements are low relative to traditional implementations of LSI. Additionally, it can be used as an effective information filtering technique when updating LSI models based on user feedback.

Journal ArticleDOI
TL;DR: The notion oflexical triggers is extended to the cross-lingual problem, permitting the construction of sharper language models for a target-language document by drawing statistics from related documents in a resource-rich language.
Abstract: In-domain texts for estimating statistical language models are not easily found for most languages of the world. We present two techniques to take advantage of in-domain text resources in other languages. First, we extend the notion of lexical triggers, which have been used monolingually for language model adaptation, to the cross-lingual problem, permitting the construction of sharper language models for a target-language document by drawing statistics from related documents in a resource-rich language. Next, we show that cross-lingual latent semantic analysis is similarly capable of extracting useful statistics for language modeling. Neither technique requires explicit translation capabilities between the two languages! We demonstrate significant reductions in both perplexity and word error rate on a Mandarin speech recognition task by using these techniques.

01 Jan 2004
TL;DR: In this article, a nonparametric Bayesian treatment for analyzing records containing occurrences of items is described, which retains the strength of previous approaches that explore the latent factors of each record (e.g. topics of documents), and further uncovers the clustering structure of records.
Abstract: This paper describes nonparametric Bayesian treatments for analyzing records containing occurrences of items. The introduced model retains the strength of previous approaches that explore the latent factors of each record (e.g. topics of documents), and further uncovers the clustering structure of records, which reflects the statistical dependencies of the latent factors. The nonparametric model induced by a Dirichlet process (DP) flexibly adapts model complexity to reveal the clustering structure of the data. To avoid the problems of dealing with infinite dimensions, we further replace the DP prior by a simpler alternative, namely Dirichlet-multinomial allocation (DMA), which maintains the main modelling properties of the DP. Instead of relying on Markov chain Monte Carlo (MCMC) for inference, this paper applies efficient variational inference based on DMA. The proposed approach yields encouraging empirical results on both a toy problem and text data. The results show that the proposed algorithm uncovers not only the latent factors, but also the clustering structure.

Proceedings ArticleDOI
04 Oct 2004
TL;DR: A use of confidence scores to weight words in the history, a weight of the prior topic distribution and a way of calculating perplexity that accounts for recognition errors in the model context are introduced.
Abstract: This paper describes experiments with a PLSA-based language model for conversational telephone speech. This model uses a long-range history and exploits topic information in the test text to adjust probabilities of test words. The PLSA-based model was found to lower test set perplexity over a traditional word+class-based 4-gram by 13% (optimistic estimate using a reference transcript as history) or by 6% (realistic estimate using recognised transcript as history). Moreover, this paper introduces a use of confidence scores to weight words in the history, a weight of the prior topic distribution and a way of calculating perplexity that accounts for recognition errors in the model context.

Patent
23 Nov 2004
TL;DR: One aspect of the invention is that of efficiently and incrementally adding new terms to an already trained probabilistic latent semantic analysis (PLSA) model as discussed by the authors, which is a technique that has been used in many applications.
Abstract: One aspect of the invention is that of efficiently and incrementally adding new terms to an already trained probabilistic latent semantic analysis (PLSA) model.

Proceedings ArticleDOI
25 Jul 2004
TL;DR: Independent component analysis applied on word context data gives distinct features which reflect syntactic and semantic categories which can be obtained without any human supervision or tagged corpora that would have some predetermined morphological, syntactic or semantic information.
Abstract: Our aim is to find syntactic and semantic relationships of words based on the analysis of corpora. We propose the application of independent component analysis, which seems to have clear advantages over two classic methods: latent semantic analysis and self-organizing maps. Latent semantic analysis is a simple method for automatic generation of concepts that are useful, e.g., in encoding documents for information retrieval purposes. However, these concepts cannot easily be interpreted by humans. Self-organizing maps can be used to generate an explicit diagram which characterizes the relationships between words. The resulting map reflects syntactic categories in the overall organization and semantic categories in the local level. The self-organizing map does not, however, provide any explicit distinct categories for the words. Independent component analysis applied on word context data gives distinct features which reflect syntactic and semantic categories. Thus, independent component analysis gives features or categories that are both explicit and can easily be interpreted by humans. This result can be obtained without any human supervision or tagged corpora that would have some predetermined morphological, syntactic or semantic information.

Proceedings Article
01 Dec 2004
TL;DR: Unlike previous methods, the latent distribution non-parametrically is estimated which enables us to model data generated by an underlying low dimensional, multimodal distribution and makes the method suitable for special data types, for example binary or count data.
Abstract: We present a semi-parametric latent variable model based technique for density modelling, dimensionality reduction and visualization. Unlike previous methods, we estimate the latent distribution non-parametrically which enables us to model data generated by an underlying low dimensional, multimodal distribution. In addition, we allow the components of latent variable models to be drawn from the exponential family which makes the method suitable for special data types, for example binary or count data. Simulations on real valued, binary and count data show favorable comparison to other related schemes both in terms of separating different populations and generalization to unseen samples.

Journal Article
TL;DR: Comparison of Information Retrieval Techniques: Latent Semantic Indexing and Concept Indexing with a comparison of information retrieval techniques for concept indexing and retrieval.
Abstract: Comparison of Information Retrieval Techniques: Latent Semantic Indexing and Concept Indexing.

Journal ArticleDOI
TL;DR: In this paper, the authors prove a theorem that relates the effective dimension of a hierarchical latent class (HLC) model to the effective dimensions of a number of latent class models, which makes it computationally feasible for large HLC models to be computed.
Abstract: Hierarchical latent class (HLC) models are tree-structured Bayesian networks where leaf nodes are observed while internal nodes are latent. There are no theoretically well justified model selection criteria for HLC models in particular and Bayesian networks with latent nodes in general. Nonetheless, empirical studies suggest that the BIC score is a reasonable criterion to use in practice for learning HLC models. Empirical studies also suggest that sometimes model selection can be improved if standard model dimension is replaced with effective model dimension in the penalty term of the BIC score. Effective dimensions are difficult to compute. In this paper, we prove a theorem that relates the effective dimension of an HLC model to the effective dimensions of a number of latent class models. The theorem makes it computationally feasible to compute the effective dimensions of large HLC models. The theorem can also be used to compute the effective dimensions of general tree models.

01 Jan 2004
TL;DR: The paper develops an algorithm that hierarchically groups words together which conceptually belong together, which leads to a hierachical extension of Bayesian probabilistic latent semantic indexing.
Abstract: An ontology is a speci…cation of a conceptualization, a shared understanding of some domain of interest. The paper develops an algorithm that hierarchically groups words together which conceptually belong together. We assume conceptual similarity if words often appear in the same context. This leads to a hierachical extension of Bayesian probabilistic latent semantic indexing. We derive the update formulae of the algorithm and present some preliminary examples of derived classes. 1 Introduction An ontology is a speci…cation of a conceptualization [Gru93], a shared understanding of some domain of interest. Sowa [Sow03] distinguishes di¤erent types of ontologies: A formal ontology is speci…ed by a collection of names for concept and relation types organized in a partial ordering by the type-subtype relation, which can be evaluated by automatic logical reasoning. A prototype-based ontology distinguishes subtypes by a comparison with a typical member or prototype for each subtype. A new entity is assigned to the category whose prototype has the smallest semantic distance. Large ontologies often use a mixture of de…nitional methods: formal axioms and de…nitions are used for the terms in mathematics, physics and engineering. Real world entities, like plants, animals, or common household items, cannot be assigned to given categories with certainty and have to be categorized based on similarities. Research on supervised text categorization has shown [Seb02] that we can distinguish categories of entities by evaluating similarity measures of texts describing these entities. In news categorization we are able to assign 800 di¤erent categories with a “reliability” (F-value) of up to 80-90%. Hence if we have a text describing a prototype of an ontology, a text describing a new entity can be identi…ed with this reliability and the entity can be assigned to the category of this ontology. For a newly evolving domain often no concept hierarchies exist. We either have to generate concepts by hand –which is costly and error-prone –or we can use machine learning methods to ful…ll this task. In this paper we describe a text clustering approach, which is not only able to extract new concepts from a text collection in an automatic fashion, but also generates a hierarchy of concepts. In contrast to other approaches the objects of clustering are words, not documents. This allows di¤erent parts of a document to belong to di¤erent concepts. An ontology concept is no longer represented by a single prototype, but by a probabilistic model, which inherently de…nes the similarity of a word or phrase to a cluster by some probability. The approach uses a probabilistic model, relating words to documents and concepts. It extends probabilistic latent semantic analysis which was developed by Hofmann [Hof01]. It employs Bayesian prior distributions to regularize the solution and avoid over…tting (cf. [dB01]). In addition this allows to introduce prior information –like existing concepts and their relations –into the process. 2 Probabilistic Latent Semantic Analysis Search engines, as known from the web, match the words of a query to the words of web documents. This approach misses documents which contain synonyms or slightly di¤erent formulations. Latent Semantic Analysis (LSA) [DDF90] is an approach to automatic indexing and information retrieval that attempts to overcome the de…ciencies of term matching. It starts with the bag-of-words representation of documents, which describes a document with the counts of words in it, and applies a dimension-reducing linear projection. This mapping is determined by the Singular Value Decomposition, generating for each document a representation in terms of relatively few “latent semantic factors”. If a group of words or terms often occurs together, it is represented in one or more of these latent factors. Hence it is possible to determine the similarity of documents in this space even if they have no words in common. As, however, the model implicitly assumes some Gaussian noise the underlying statistical model is unsuitable for count data [Hof01]. At the same time Cheeseman [CSK88] proposed Autoclass, a generative mixture model for the attributes of an entity, which respects the statistical properties of count data. He assumes that each entity belongs to an underlying class z=iz and that the attributes (words) are independent and follow a classconditional distribution p(wjz=iz). The algorithm estimates these p(wjz=iz) from data as well as a probability distribution of classes p(zjdid) for each document, which de…nes a soft clustering for the documents. It was extended to the Bayesian framework by [CS96]. Probabilistic Latentent Semantic Analysis (PLSA) [Hof01] transfers this approach to the single words of a document. It assumes the existence of a latent variable z with nz di¤erent values and speci…es the following generative model for the words w in a document did, id = 1; : : : ; nd: p(wjdid) = nz X iz=1 p(wjz=iz)p(z=izjdid) Therefore the probability that a word occurs in document did is a mixture distribution of the p(wjz=iz). The nw(id) di¤erent words of document id are generated independently. In contrast to Autoclass and LSA, a document may belong to several clusters. The probabilities p(wjz=iz) and p(z=izjdid) may be estimated by the EM-algorithm in the maximum likelihood framework. Although this model is a vast simpli…cation, it leads to meaningful clusters, as words that occur often together have a high probability p(wjz=iz) with respect to some factor. Hofmann has shown experimentally [Hof01] that PLSA achieves a higher perplexity reduction than LSA. Freitas and Barnard [dB01] extended the model to the Bayesian framework and to the combination of features from other modalities like images. A similar Information Bottleneck model has been proposed by [ST01]. 3 Hierarchical Latent Semantic Analysis Many agglomerative and divisive clustering algorithms developed in statistics have been used to cluster words and terms in a hierarchical manner [DMK02]. These procedures, however, often arrive at suboptimal clusterings as they merge or divide clusters in a greedy fashion. Model-based hierarchical document clustering was proposed by several authors, e.g. for Autoclass [CS96], latent semantic analysis [TCPH01], and in the Bayesian framework [SV00]. But for the sake of ontology learning we want to cluster words and phrases, not documents. In this paper we present a novel Bayesian approach modeling word occurence by probabilistic hierarchical latent semantic classes. We do not assume a …xed number of classes and a …xed hierarchy, but allow the algorithm to select an appropriate topology. A latent class iza at the lowest level is assumed to correspond to a speci…c context: it is described by a probability distribution p(wjz=iza) of words, which gives the highest probability mass to words which often occur together in that context. If izb is a di¤erent class at the lowest level, there are some words which simultaneously occur often in both contexts. These words are collected in a higher class izh with the associated word distribution p(wjz=izh). The distribution of words in context a is given by the mixture ap(wjz=iza) + (1 a)p(wjz=izh) and in context b is given by bp(wjz=izb) + (1 b)p(wjz=izh), where a and b are some real valued mixture coe¢ cients in (0; 1). The latent class izh is called a parent of za and zb. We extend this approach by …rst allowing several levels of parents, such that a higher level context may be subconcept of some other context. In addition we consider multiple parents for one concepts. We only require that the resulting graph has the structure of a directed acyclic graph (DAG). By pruning away less important parents we may arrive at a proper tree-structured hierarchy. The following section describes the model in more detail. 4 Hierarchical Model with Multiple Parents We assume that there is a latent variable Z with possible values 1; : : : ; nz, which determine the lower-level latent class. Furthermore there are nu nz random latent indicator variables U1; : : : ; Unu. Every iu < nu has a non-empty set of one or more parents Pa(iu) fmax(iu+ 1; nz); : : : ; nug. The ancestors of iu are collected in a set An(iu) = Pa(iu) [ An(Pa(iu)). nu is the root element, contained in the ancestor of all other iu, and itself has no parent, i.e. Pa(nu) = ;, An(nu) = ;. A path of length k from iu1 to iuk > iu1 is a sequence (iu1 < iu2 < < iuk) such that iuj+1 2 Pa(iuj). Let Pth (iu1; iuk; k) be the set of all such paths. Note that there may be paths of di¤erent lengths between iu1 and iuk. The words of the corpus are generated sequentially in the following way: 1. did(m) is the document containing the m-th word Wm. 2. Select lower level class: A value iz 2 f1; : : : ; nzg for the latent variable Zm, is randomly selected according to distribution p(Zm=izjdid) =: zid;iz. We set iu = iz and the random binary indicator variable Um;iu is set to 1. 3. Decide if word or parent: A value for the binary random indicator variable Sm;iu with values sm;iu 2 f0; 1g is i.i.d. selected according to p(Sm;iujUm;iu) =: ( siu;0; siu;1). 4. Select word: If sm;iu = 1 then a wordWm p(Wmjuiu) =: ( wiu;1; : : : ; wiu;nw) is i.i.d. randomly selected and stored. We set m = m + 1 and generate the next word according to step 1. 5. Select parent: If sm;iu = 0, a value for the “parent”latent variable Vm;iu with values vm;iu 2 Pa(iu) is i.i.d. randomly selected according to the distribution pPa (ivjiu) := pPa(Vm;iu=ivjSm;iu=0) =: ( viu;iv)iv2Pa(iu) 6. Goto higher “level”: We set iu = iv and the random binary indicator variable Um;iu is set to 1. We continue with step 4. As p(unu=1) = 1:0 the procedure always stops for each word with iu = nu. This completely de…nes the probabilistic model for the generation of words. The Bayesian approach [GCSR95] has advantages over the use of the likelihood principle. First it allows regularisation. Sinc

01 Jan 2004
TL;DR: A replacement of LSI (Latent Se- mantic Indexing) with a projection matrix created from WordNet hierarchy and compared with LSI is presented.
Abstract: In the area of information retrieval, the dimension of doc- ument vectors plays an important role Firstly, with higher dimensions index structures suer the "curse of dimensionality" and their eciency rapidly decreases Secondly, we may not use exact words when looking for a document, thus we miss some relevant documents LSI (Latent Se- mantic Indexing) is a numerical method, which discovers latent semantic in documents by creating concepts from existing terms However, it is hard to compute LSI In this article, we oer a replacement of LSI with a projection matrix created from WordNet hierarchy and compare it with LSI

Proceedings Article
01 Jan 2004
TL;DR: Preliminary experiments are conducted to support usability of SVD usage as a possible data mining method and lattice size reduction tool for latent semantic indexing.
Abstract: Latent semantic indexing (LSI) is an application of numeri- cal method called singular value decomposition (SVD), which discovers latent semantic in documents by creating concepts from existing terms The application area is not limited to text retrieval, many applications such as image compression are known We propose usage of SVD as a possible data mining method and lattice size reduction tool We oer in this paper preliminary experiments to support usability of proposed method

Proceedings ArticleDOI
19 May 2004
TL;DR: This paper proposes an automatic semantic relationship discovering approach for constructing the semantic link network, which adopts the data mining algorithms to discover the semantic relationships between keyword sets, and then uses deductive and analogical reasoning to enrich the semantic relationship.
Abstract: An important obstacle to the success of the Semantic Web is that the establishment of the semantic relationship is labor-intensive. This paper proposes an automatic semantic relationship discovering approach for constructing the semantic link network. The basic premise of this work is that the semantics of a web page can be reflected by a set of keywords, and the semantic relationship between two web pages can be determined by the semantic relationship between their keyword sets. The approach adopts the data mining algorithms to discover the semantic relationships between keyword sets, and then uses deductive and analogical reasoning to enrich the semantic relationships. The proposed algorithms have been implemented. Experiment shows that the approach is feasible.

Proceedings ArticleDOI
15 Dec 2004
TL;DR: The composite directed MRF model has a potentially exponential number of loops and becomes a context sensitive grammar, nevertheless it is able to estimate its parameters in cubic time using an efficient modified ME method, the generalized inside-outside algorithm, which extends the inside- outside algorithm to incorporate the effects of the n-gram and PLSA language models.
Abstract: We present a directed Markov random field (MRF) model that combines n-gram models, probabilistic context free grammars (PCFGs) and probabilistic latent semantic analysis (PLSA) for the purpose of statistical language modeling. Even though the composite directed MRF model potentially has an exponential number of loops and becomes a context sensitive grammar, we are nevertheless able to estimate its parameters in cubic time using an efficient modified EM method, the generalized inside-outside algorithm, which extends the inside-outside algorithm to incorporate the effects of the n-gram and PLSA language models. We generalize various smoothing techniques to alleviate the sparseness of n-gram counts in cases where there are hidden variables. We also derive an analogous algorithm to calculate the probability of initial subsequence of a sentence, generated by the composite language model. Our experimental results on the Wall Street Journal corpus show that we obtain significant reductions in perplexity compared to the state-of-the-art baseline trigram model with Good-Turing and Kneser-Ney smoothings.