scispace - formally typeset
Search or ask a question

Showing papers on "Latent semantic analysis published in 2002"


Posted Content
TL;DR: This article presented an unsupervised learning algorithm for recognizing synonyms based on statistical data acquired by querying a web search engine, called Pointwise Mutual Information (PMI) and Information Retrieval (IR) to measure the similarity of pairs of words.
Abstract: This paper presents a simple unsupervised learning algorithm for recognizing synonyms, based on statistical data acquired by querying a Web search engine. The algorithm, called PMI-IR, uses Pointwise Mutual Information (PMI) and Information Retrieval (IR) to measure the similarity of pairs of words. PMI-IR is empirically evaluated using 80 synonym test questions from the Test of English as a Foreign Language (TOEFL) and 50 synonym test questions from a collection of tests for students of English as a Second Language (ESL). On both tests, the algorithm obtains a score of 74%. PMI-IR is contrasted with Latent Semantic Analysis (LSA), which achieves a score of 64% on the same 80 TOEFL questions. The paper discusses potential applications of the new unsupervised learning algorithm and some implications of the results for LSA and LSI (Latent Semantic Indexing).

1,303 citations


Proceedings ArticleDOI
20 Apr 2002
TL;DR: The new Cognitive Walkthrough for the Web (CWW) is superior for evaluating how well websites support users' navigation and information search tasks.
Abstract: This paper proposes a transformation of the Cognitive Walkthrough (CW), a theory-based usability inspection method that has proven useful in designing applications that support use by exploration. The new Cognitive Walkthrough for the Web (CWW) is superior for evaluating how well websites support users' navigation and information search tasks. The CWW uses Latent Semantic Analysis to objectively estimate the degree of semantic similarity (information scent) between representative user goal statements (100-200 words) and heading/link texts on each web page. Using an actual website, the paper shows how the CWW identifies three types of problems in web page designs. Three experiments test CWW predictions of users' success rates in accomplishing goals, verifying the value of CWW for identifying these usability problems

264 citations


Journal ArticleDOI
TL;DR: In this article, the effects of temporal and semantic proximity on output order in delayed and continuous-distractor free recall of random word lists were investigated using Latent Semantic Analysis (LSA).

226 citations


Book ChapterDOI
01 Jan 2002
TL;DR: This paper illustrates that the large-scale structure of this representation has statistical properties that corre- spond well with those of semantic networks produced by humans, and trace this to the fidelity with which it reproduces the natural statistics of language.
Abstract: A probabilistic approach to semantic representation Thomas L. Griffiths & Mark Steyvers {gruffydd,msteyver}@psych.stanford.edu Department of Psychology Stanford University Stanford, CA 94305-2130 USA Abstract Semantic networks produced from human data have statistical properties that cannot be easily captured by spatial representations. We explore a probabilis- tic approach to semantic representation that explic- itly models the probability with which words occur in different contexts, and hence captures the proba- bilistic relationships between words. We show that this representation has statistical properties consis- tent with the large-scale structure of semantic net- works constructed by humans, and trace the origins of these properties. Contemporary accounts of semantic representa- tion suggest that we should consider words to be either points in a high-dimensional space (eg. Lan- dauer & Dumais, 1997), or interconnected nodes in a semantic network (eg. Collins & Loftus, 1975). Both of these ways of representing semantic information provide important insights, but also have shortcom- ings. Spatial approaches illustrate the importance of dimensionality reduction and employ simple al- gorithms, but are limited by Euclidean geometry. Semantic networks are less constrained, but their graphical structure lacks a clear interpretation. In this paper, we view the function of associa- tive semantic memory to be efficient prediction of the concepts likely to occur in a given context. We take a probabilistic approach to this problem, mod- eling documents as expressing information related to a small number of topics (cf. Blei, Ng, & Jordan, 2002). The topics of a language can then be learned from the words that occur in different documents. We illustrate that the large-scale structure of this representation has statistical properties that corre- spond well with those of semantic networks produced by humans, and trace this to the fidelity with which it reproduces the natural statistics of language. Approaches to semantic representation Spatial approaches Latent Semantic Analysis (LSA; Landauer & Dumais, 1997) is a procedure for finding a high-dimensional spatial representation for words. LSA uses singular value decomposition to factorize a word-document co-occurrence matrix. An approximation to the original matrix can be ob- tained by choosing to use less singular values than its rank. One component of this approximation is a matrix that gives each word a location in a high di- mensional space. Distances in this space are predic- tive in many tasks that require the use of semantic information. Performance is best for approximations that used less singular values than the rank of the matrix, illustrating that reducing the dimensional- ity of the representation can reduce the effects of statistical noise and increase efficiency. While the methods behind LSA were novel in scale and subject, the suggestion that similarity relates to distance in psychological space has a long history (Shepard, 1957). Critics have argued that human similarity judgments do not satisfy the properties of Euclidean distances, such as symmetry or the tri- angle inequality. Tversky and Hutchinson (1986) pointed out that Euclidean geometry places strong constraints on the number of points to which a par- ticular point can be the nearest neighbor, and that many sets of stimuli violate these constraints. The number of nearest neighbors in similarity judgments has an analogue in semantic representation. Nelson, McEvoy and Schreiber (1999) had people perform a word association task in which they named an as- sociated word in response to a set of target words. Steyvers and Tenenbaum (submitted) noted that the number of unique words produced for each target fol- lows a power law distribution: if k is the number of words, P (k) ∝ k γ . For reasons similar to those of Tversky and Hutchinson, it is difficult to produce a power law distribution by thresholding cosine or dis- tance in Euclidean space. This is shown in Figure 1. Power law distributions appear linear in log-log co- ordinates. LSA produces curved log-log plots, more consistent with an exponential distribution. Semantic networks Semantic networks were pro- posed by Collins and Quillian (1969) as a means of storing semantic knowledge. The original net- works were inheritance hierarchies, but Collins and Loftus (1975) generalized the notion to cover arbi- trary graphical structures. The interpretation of this graphical structure is vague, being based on connect- ing nodes that “activate” one another. Steyvers and Tenenbaum (submitted) constructed a semantic net- work from the word association norms of Nelson et

222 citations


Patent
Jerome R. Bellegarda1
12 Sep 2002
TL;DR: In this paper, a method and system for dynamic language modeling of a document are described and a vector representation of the current document in a latent semantic analysis (LSA) space is determined.
Abstract: A method and system for dynamic language modeling of a document are described In one embodiment, a number of local probabilities of a current document are computed and a vector representation of the current document in a latent semantic analysis (LSA) space is determined In addition, a number of global probabilities based upon the vector representation of the current document in an LSA space is computed Further, the local probabilities and the global probabilities are combined to produce the language modeling

187 citations


Patent
Colin Leonard Bird1
21 Jun 2002
TL;DR: In this paper, a method of generating metadata is provided including providing ( 401 ) a plurality of source texts ( 100 ), processing the plurality of sources ( 100 ) to extract primary metadata in the form of the sets of words ( 104, 106 ).
Abstract: A method of generating metadata is provided including providing ( 401 ) a plurality of source texts ( 100 ), processing the plurality of source texts ( 100 ) to extract primary metadata in the form of a plurality of sets of words ( 104, 106 ), and comparing ( 407 ) each of the source texts ( 100 ) with each of the sets of words ( 104, 106 ). The method includes using a clustering program to extract the sets of words ( 104, 106 ) from the source texts ( 100 ). The step of comparing is carried out by Latent Semantic Analysis to compare the similarity of meaning of each source text ( 100 ) with each set of words ( 104, 106 ) obtained by the clustering program. The comparison obtains a measure of the extent to which each source text ( 100 ) is representative of a set of words ( 104, 106 ).

81 citations


Proceedings ArticleDOI
11 Aug 2002
TL;DR: The view presented in this paper is that the fundamental vocabulary of the system is the images in the database and that relevance feedback is a document whose words are the images which expresses the semantic intent of the user over that query.
Abstract: This paper proposes a novel view of the information generated by relevance feedback. The latent semantic analysis is adapted to this view to extract useful inter-query information. The view presented in this paper is that the fundamental vocabulary of the system is the images in the database and that relevance feedback is a document whose words are the images. A relevance feedback document contains the intra-query information which expresses the semantic intent of the user over that query. The inter-query information then takes the form of a collection of documents which can be subjected to latent semantic analysis. An algorithm to query the latent semantic index is presented and evaluated against real data sets.

63 citations


Journal ArticleDOI
TL;DR: The similarity between human judgments and LSA indicates that LSA will be useful in accounting for reading strategies in a Web-based version of SERT and indicates that students who are not complying with SERT tended to paraphrase the text sentences, whereas students who were compliant with Sert tended to explain the sentences.
Abstract: We tested a computer-based procedure for assessing reader strategies that was based on verbal protocols that utilized latent semantic analysis (LSA). Students were given self-explanation-reading training (SERT), which teaches strategies that facilitate self-explanation during reading, such as elaboration based on world knowledge and bridging between text sentences. During a computerized version of SERT practice, students read texts and typed self-explanations into a computer after each sentence. The use of SERT strategies during this practice was assessed by determining the extent to which students used the information in the current sentence versus the prior text or world knowledge in their self-explanations. This assessment was made on the basis of human judgments and LSA. Both human judgments and LSA were remarkably similar and indicated that students who were not complying with SERT tended to paraphrase the text sentences, whereas students who were compliant with SERT tended to explain the sentences in terms of what they knew about the world and of information provided in the prior text context. The similarity between human judgments and LSA indicates that LSA will be useful in accounting for reading strategies in a Web-based version of SERT.

52 citations


Journal ArticleDOI
TL;DR: This paper introduces latent semantic analysis (LSA), a machine learning method for representing the meaning of words, sentences, and texts in a high-dimensional semantic space from reading a very large amount of texts.

38 citations


Book ChapterDOI
TL;DR: Two different approaches for incorporating background knowledge into nearest-neighbor text classification are described, one of which redescribes examples using Latent Semantic Indexing on the background knowledge, assessing document similarities in this redescribed space.
Abstract: This paper describes two different approaches for incorporating background knowledge into nearest-neighbor text classification. Our first approach uses background text to assess the similarity between training and test documents rather than assessing their similarity directly. The second method redescribes examples using Latent Semantic Indexing on the background knowledge, assessing document similarities in this redescribed space. Our experimental results showthat both approaches can improve the performance of nearest-neighbor text classification. These methods are especially useful when labeling text is a labor-intensive job and when there is a large amount of information available about a specific problem on the World Wide Web.

35 citations


Journal ArticleDOI
01 Jan 2002
TL;DR: This work relies on LSA to represent the student model in a tutoring system, and designed tutoring strategies to automatically detect lexeme misunderstandings and to select among the various examples of a domain the one which is best to expose the student to.
Abstract: Latent semantic analysis (LSA) is a tool for extracting semantic information from texts as well as a model of language learning based on the exposure to texts. We rely on LSA to represent the student model in a tutoring system. Domain examples and student productions are represented in a high-dimensional semantic space, automatically built from a statistical analysis of the co-occurrences of their lexemes. We also designed tutoring strategies to automatically detect lexeme misunderstandings and to select among the various examples of a domain the one which is best to expose the student to. Two systems are presented: the first one successively presents texts to be read by the student, selecting the next one according to the comprehension of the prior ones by the student. The second plays a board game (kalah) with the student in such a way that the next configuration of the board is supposed to be the most appropriate with respect to the semantic structure of the domain and the previous student's moves.

Book ChapterDOI
01 Jan 2002
TL;DR: A study designed to address the matter of corpus selection by systematically testing the kind of texts is discussed, and a significant fraction of the text found in a physics text may exemplify incorrect thinking.
Abstract: The Right Stuff: Do You Need to Sanitize Your Corpus When Using Latent Semantic Analysis? Brent A. Olde (baolde@memphis.edu) Department of Psychology, 202 Psychology Building University of Memphis, Memphis, TN 38152 USA Donald R. Franceschetti (dfrncsch@memphis.edu ) Department of Physics, University of Memphis, CAMPUS BOX 523390 Memphis, TN 38152 USA Ashish Karnavat (akarnavat@chiinc.com) CHI Systems, Inc., 716 N. Bethlehem Pike, Suite 300 Lower Gwynedd, PA 19002 USA Arthur C. Graesser (a-graesser@memphis.edu) Department of Psychology, 202 Psychology Building University of Memphis, Memphis, TN 38152 USA and the Tutoring Research Group Abstract Student responses to conceptual physics questions were analyzed with latent semantic analysis (LSA), using different text corpora. Expert evaluations of student answers to questions were correlated with LSA metrics of the similarity between student responses and ideal answers. We compared the adequacy of several text corpora in LSA performance evaluation, including the inclusion of written incorrect reasoning and tangentially relevant historical information. The results revealed that there is no benefit in meticulously eliminating the wrong or irrelevant information that normally accompanies a textbook. Results are also reported on the impact of corpus size and the addition of information that is not topic relevant. Introduction AutoTutor is an intelligent tutoring agent that interacts with a student using natural language dialogue (Graesser, Person, Harter, & TRG, in press; Graesser, VanLehn, Rose, Jordan, & Harter, 2001). The tutor’s interactions are not limited to single-word answers or formulaic yes/no decision trees. AutoTutor attempts to tackle the problem of understanding lengthy discourse contributions of the student, which are often ungrammatical and vague. AutoTutor responds to the student with discourse moves that are pedagogically appropriate. It is this cooperative, constructive, one-on- one dialogue that is believed to produce learning gains (Graesser, Person, & Magliano, 1995). One major component in the comprehension mechanism is the knowledge representation provided by Latent Semantic Analysis (LSA). LSA is a statistical, corpus-based natural language understanding technique that computes similarity comparisons between a set of terms and texts (Kintsch, 1998; Landauer & Dumais, 1997; Landauer, Foltz, & Laham, 1998). The present study focuses on the domain of conceptual physics. It should be noted that most modern physics texts (such as Hewitt, 1998) devote considerable space to the historical evolution of physical concepts, the cultural context of physics, and its social impact. Some authors also devote appreciable space to discussing discarded theories and chains of reasoning that lead to incorrect conclusions. Thus, a significant fraction of the text found in a physics text may exemplify incorrect thinking. The Tutoring Research Group at the University of Memphis has been concerned with the best strategy for selecting a corpus of texts when constructing an LSA space. A naive approach would be to gather a number of physics texts, and combine them into one corpus. However, there are some important, unexplored issues that must be addressed about this approach. What should be done about the text that was written to illustrate incorrect reasoning? Does the inclusion of historical information or peripherally related information strengthen or dilute the accuracy with which physics concepts are represented in the LSA space? In short, how much special preparation of the corpus is needed, if it is to provide a reliable representation of the physics that students are expected to learn? In this paper, we provide a brief overview of LSA and how it is used in our tutoring system. Then we discuss a study designed to address the matter of corpus selection by systematically testing the kind of texts

Proceedings ArticleDOI
11 Jun 2002
TL;DR: This paper presents an infrastructure allowing the use of latent semantic analysis and open hypermedia concepts in the automatic identification of relationships among web pages, and presents an extensible latent semantic indexing service and an open linkbase service.
Abstract: The more the web grows, the harder it is for users to find the information they need. As a result, it is even more difficult to identify when documents are related. To find out that two or more documents are in fact related, users have to navigate by the documents in carry out an analysis about their content. This paper presents an infrastructure allowing the use of latent semantic analysis and open hypermedia concepts in the automatic identification of relationships among web pages. Latent Semantic Analysis has been proposed by the information retrieval community as an attempt to organize automatically text objects into a semantic structure appropriate for matching. In open hypermedia systems, links are managed and stored in a special database, a linkbase, which allows the addition of hypermedia functionality to a document without changing the original structure and format of the document. We first present two complementary link-related efforts: an extensible latent semantic indexing service and an open linkbase service. Leveraging off those efforts, we present an infrastructure that identifying latent semantic links within web repositories and makes them available in an open linkbase. To demonstrate by example the utility of our open infrastructure, we built an application presenting a directory of semantic links extracted from web sites.

Journal ArticleDOI
TL;DR: Comparison of traditional methods of scoring the Logical Memory test of the Wechsler Memory Scale-III with a new method based on Latent Semantic Analysis (LSA) showed that LSA was at least as valid and sensitive as traditional measures, suggesting that it may serve as an improved measure of prose recall.
Abstract: The aim of this study was to compare traditional methods of scoring the Logical Memory test of the Wechsler Memory Scale-III with a new method based on Latent Semantic Analysis (LSA). LSA represents texts as vectors in a high-dimensional semantic space and the similarity of any two texts is measured by the cosine of the angle between their respective vectors. The Logical Memory test was administered to a sample of 72 elderly individuals, 14 of whom were classified as cognitively impaired by the Mini-Mental State Examination (MMSE). The results showed that LSA was at least as valid and sensitive as traditional measures. Partial correlations between prose recall measures and measures of cognitive function indicated that LSA explained all the relationship between Logical Memory and general cognitive function. This suggests that LSA may serve as an improved measure of prose recall.

Proceedings Article
01 Jan 2002
TL;DR: A principled approach to incorporating prior knowledge into low rank approximation techniques is described, and its application to PCA-based approximations of several data sets is demonstrated.
Abstract: Low rank approximation techniques are widespread in pattern recognition research — they include Latent Semantic Analysis (LSA), Probabilistic LSA, Principal Components Analysus (PCA), the Generative Aspect Model, and many forms of bibliometric analysis. All make use of a low-dimensional manifold onto which data are projected. Such techniques are generally "unsupervised," which allows them to model data in the absence of labels or categories. With many practical problems, however, some prior knowledge is available in the form of context. In this paper, I describe a principled approach to incorporating such information, and demonstrate its application to PCA-based approximations of several data sets.

Book ChapterDOI
11 Dec 2002
TL;DR: Two novel approaches are proposed to extract important sentences from a document to create its summary by combining the ideas of latent semantic analysis and text relationship maps to interpret conceptual structures of a document.
Abstract: In this paper, two novel approaches are proposed to extract important sentences from a document to create its summary. The first is a corpus-based approach using feature analysis. It brings up three new ideas: 1) to employ ranked position to emphasize the significance of sentence position, 2) to reshape word unit to achieve higher accuracy of keyword importance, and 3) to train a score function by the genetic algorithm for obtaining a suitable combination of feature weights. The second approach combines the ideas of latent semantic analysis and text relationship maps to interpret conceptual structures of a document. Both approaches are applied to Chinese text summarization. The two approaches were evaluated by using a data corpus composed of 100 articles about politics from New Taiwan Weekly, and when the compression ratio was 30%, average recalls of 52.0% and 45.6% were achieved respectively.

Book ChapterDOI
19 Sep 2002
TL;DR: This paper uses a technique called Random Indexing to accumulate context vectors and uses the context vectors to perform automatic query expansion on Swedish, French and Italian monolingual query expansion in CLEF 2002.
Abstract: Vector space techniques can be used for extracting semantically similar words from the co-occurrence statistics of words in large text data collections. We have used a technique called Random Indexing to accumulate context vectors for Swedish, French and Italian. We have then used the context vectors to perform automatic query expansion. In this paper, we report on our CLEF 2002 experiments on Swedish, French and Italian monolingual query expansion.

Book ChapterDOI
09 Sep 2002
TL;DR: This work presents a variety of tools for visualising word-meanings in vector spaces and graph models, derived from co-occurrence information and local syntactic analysis, which suggest new solutions to standard problems such as automatic management of lexical resources.
Abstract: Many ways of dealing with large collections of linguistic information involve the general principle of mapping words, larger terms and documents into some sort of abstract space. Considerable effort has been devoted to applying such techniques for practical tasks such as information retrieval and word-sense disambiguation. However, the inherent structure of these spaces is often less well-understood.Visualisation tools can help to uncover the relationships between meanings in this space, giving a clearer picture of the natural structure of linguistic information. We present a variety of tools for visualising word-meanings in vector spaces and graph models, derived from co-occurrence information and local syntactic analysis. Our techniques suggest new solutions to standard problems such as automatic management of lexical resources, which perform well under evaluation.The tools presented in this paper are all available for public use on our website.

01 Jan 2002
TL;DR: This work empirically demonstrate that LSI uses up to fifth order term co-occurrence, and proves mathematically that a connectivity path exists for every nonzero element in the truncated term-term matrix computed by LSI.
Abstract: Current research in Latent Semantic Indexing (LSI) shows improvements in performance for a wide variety of information retrieval systems. We propose the development of a theoretical foundation for understanding the values produced in the reduced form of the term-term matrix. We assert that LSI’s use of higher orders of co -occurrence is a critical component of this study. In this work we present experiments that precisely determine the degree of co -occurrence used in LSI. We empirically demonstrate that LSI uses up to fifth order term co-occurrence. We also prove mathematically that a connectivity path exists for every nonzero element in the truncated term-term matrix computed by LSI. A complete understanding of this term transitivity is key to understanding LSI. 1. INTRODUCTION The use of co-occurrence information in textual data has led to improvements in performance when applied to a variety of applications in information retrieval, computational linguistics and textual data mining. Furthermore, many researchers in these fields have developed techniques that explicitly employ second and third order term co-occurrence. Examples include applications such as literature search [14], word sense disambiguation [12], ranking of relevant documents [15], and word selection [8]. Other authors have developed algorithms that implicitly rely on the use of term co-occurrence for applications such as search and retrieval [5], trend detection [14], and stemming [17]. In what follows we refer to various degrees of term transitivity as orders of co-occurrence – first order if two terms co-occur, second order if two terms are linked only by a third, etc. An example of second order co -occurrence follows . Assume that a collection has one document that contains the terms

Proceedings ArticleDOI
13 May 2002
TL;DR: An unsupervised topic model decomposition is built which allows to infer topic related word distributions from very short adaptation texts which is then used to constraint the estimation of a minimum divergence trigram language.
Abstract: This work presents a language model adaptation method combining the latent semantic analysis framework with the minimum discrimination information estimation criterion. In particular, an unsupervised topic model decomposition is built which allows to infer topic related word distributions from very short adaptation texts. The resulting word distribution is then used to constraint the estimation of a minimum divergence trigram language. With respect to previous work, implementation details are discussed that make such approach effective for a large scale application. Experimental results are provided for a digital library indexing task, i.e. the speech transcription of five historical documentary films. By adapting a trigram language model from very terse content descriptions, i.e. maximum ten words, available for each film, a word error rate relative reduction of 3.2% was achieved.

01 Jan 2002
TL;DR: A method of information extraction from large corpora of texts to estimate the affective valence associated to any term by combining latent semantic analysis (LSA) and the determination of the emotional content of a text based on the words that compose it.
Abstract: The aim of this research is to develop a method of information extraction from large corpora of texts to estimate the affective valence associated to any term. Our approach combines two techniques : latent semantic analysis (LSA) and the determination of the emotional content of a text based on the words that compose it. A preliminary study designed to evaluate this approach has been conducted on a corpus of several thousands of articles published in a Belgian newspaper. A first analysis showed that, by combining LSA and a dictionary of 3000 words, it is possible t o approximate efficiently the affective valence of words on the base of the words that are associated to them in the semantic space. A second analysis applied the technique t o firm names. We conclude by proposing some improvements of the technique.

Book ChapterDOI
02 Jun 2002
TL;DR: Apex, a two-loop system which provides texts to be learned through summarization, uses LSA (Latent Semantic Analysis), a tool devised for the semantic comparison of texts.
Abstract: Current systems aiming at engaging students in Self-Regulated Learning processes are often prompt-based and domain-dependent. Such metacognitive prompts are either difficult to interpret for novices or ignored by experts. Although domain-dependence per se cannot be considered as a drawback, it is often due to a rigid structure which prevents from moving to another domain. We detail here Apex, a two-loop system which provides texts to be learned through summarization. In the first loop, called Reading, the student formulates a query and is provided with texts related to this query. Then the student judges whether each text presented could be summarized. In the second loop, called Writing, the student writes out a summary of the texts, then gets an assessment from the system. In order to automatically perform various comprehension-centered tasks (i.e., texts that match queries, assessment of summaries), our system uses LSA (Latent Semantic Analysis), a tool devised for the semantic comparison of texts.

Proceedings Article
01 Jan 2002
TL;DR: A statistical scheme that dynamically determines the unit scope in the generalization stage and the combination of all the techniques leads to a 14% perplexity reduction on a subset of Wall Street Journal, compared with the trigram model.
Abstract: We describe an extension to the use of Latent Semantic Analysis (LSA) for language modeling. This technique makes it easier to exploit long distance relationships in natural language for which the traditional n-gram is unsuited. However, with the growth of length, the semantic representation of the history may be contaminated by irrelevant information, increasing the uncertainty in predicting the next word. To address this problem, we propose a multilevel framework dividing the history into three levels corresponding to document, paragraph and sentence. To combine the three levels of information with the n-gram, a Softmax network is used. We further present a statistical scheme that dynamically determines the unit scope in the generalization stage. The combination of all the techniques leads to a 14% perplexity reduction on a subset of Wall Street Journal, compared with the trigram model.

Journal ArticleDOI
TL;DR: Experimental results demonstrate that a projection-based symmetrical factorisation method for extracting semantic features from collections of text documents stored in a Latent Semantic space yields a comparable representation to that provided by a novel probabilistic approach.
Abstract: This paper proposes a projection-based symmetrical factorisation method for extracting semantic features from collections of text documents stored in a Latent Semantic space. Preliminary experimental results demonstrate this yields a comparable representation to that provided by a novel probabilistic approach which reconsiders the entire indexing problem of text documents and works directly in the original high dimensional vector-space representation of text. The employed projection index is derived here from the a priori constraints on the problem. The principal advantage of this approach is computational efficiency and is obtained by the exploitation of the Latent Semantic Indexing as a preprocessing stage. Simulation results on subsets of the 20-Newsgroups text corpus in various settings are provided.

Journal ArticleDOI
TL;DR: It is proposed that the combination of reading time and the semantics of documents accessed by users reflect their tacit knowledge, and the technique of Latent Interest Analysis (LIA) is introduced to model information needs based on tacit knowledge.
Abstract: Online resources of engineering design information are a critical resource for practicing engineers. These online resources often contain references and content associated with technical memos, journal articles and “white papers” of prior engineering projects. However, filtering this stream of information to find the right information appropriate to an engineering issue and the engineer is a time-consuming task. The focus of this research lies in ascertaining tacit knowledge to model the information needs of the users of an engineering information system. It is proposed that the combination of reading time and the semantics of documents accessed by users reflect their tacit knowledge. By combining the computational text analysis tool of Latent Semantic Analysis with analyses of on-line user transaction logs, we introduce the technique of Latent Interest Analysis (LIA) to model information needs based on tacit knowledge. Information needs are modeled by a vector equation consisting of a linear combination of the user’s queries and prior documents downloaded, scaled by the reading time of each document to measure the degree of relevance. A validation study of the LIA model revealed a higher correlation between predicted and actual information needs for our model in comparison to models lacking scaling by reading time and a representation of the semantics of prior accessed documents. The technique was incorporated into a digital library to recommend engineering education materials to users.

Book ChapterDOI
01 Jan 2002
TL;DR: This work casts the passages of a large and representative text corpus as a system of simultaneous linear equations in which passage meaning equals the sum of word meanings, and produces a high-dimensional vector representing the average contribution to passage meanings of every word.
Abstract: Latent Semantic Analysis (LSA) treats language learning and representation as a problem in mathematical induction. It casts the passages of a large and representative text corpus as a system of simultaneous linear equations in which passage meaning equals the sum of word meanings. LSA simulates human language understanding with surprising fidelity. Successes to date disprove the poverty of the stimulus argument for lexical meaning and recast the problem of syntax learning, but leave much room for improvement. Semantic atoms are not only single words; idioms need lexicalization. Syntax surely matters; LSA ignores word order. LSA’s knowledge resembles intuition; people also use language for logic. Relations to other input matter. LSA represents perceptual phenomena vicariously, e.g. color relations. Demonstrations that people think in other modes, or that LSA does not exhaust linguistic meaning do not question LS A’s validity, but call for more modeling, testing, and integration.

Journal ArticleDOI
TL;DR: The goal is to use cluster analysis to group word senses objectively on the basis of their co-occurrence with other words, and the results of classifying two senses of the word BANK indicate high classification accuracy for primary word senses, but poor classification accuracyfor secondary word senses.
Abstract: Recently, statistical models for the identification of word senses in English text have been suggested, such as Latent Semantic Analysis (LSA), which is based on dimensionality reduction. While this approach has yielded promising results, it makes many assumptions about the underlying semantic structure. In this paper, the goal is to use cluster analysis to group word senses objectively on the basis of their co-occurrence with other words. This method does not make any a priori assumptions about the group to which a case might be assigned: It is an arbitrary classification made on the basis of a specific number of a group, which are then classified on the basis of their metric distance from one another in a high-dimensional space. The results of classifying two senses of the word BANK indicate high classification accuracy for primary word senses, but poor classification accuracy for secondary word senses. A role for using cluster analysis to determine highly discriminating items in text is discussed.

Proceedings ArticleDOI
07 Jan 2002
TL;DR: This work will provide participants with a basic understanding of LSA theory and methods and demonstrate applications of its use in support of collaboration and learning.
Abstract: Latent Semantic Analysis (LSA) is a powerful tool for use in the support of learning in collaborative environments. We will provide participants with a basic understanding of LSA theory and methods and demonstrate applications of its use in support of collaboration and learning.

Proceedings ArticleDOI
24 Aug 2002
TL;DR: A comparative evaluation of two data-driven models used in translation selection of English-Korean machine translation using k-nearest neighbor (k-NN) learning to select an appropriate translation of the unseen instances in the dictionary.
Abstract: We present a comparative evaluation of two data-driven models used in translation selection of English-Korean machine translation. Latent semantic analysis(LSA) and probabilistic latent semantic analysis (PLSA) are applied for the purpose of implementation of data-driven models in particular. These models are able to represent complex semantic structures of given contexts, like text passages. Grammatical relationships, stored in dictionaries, are utilized in translation selection essentially. We have used k-nearest neighbor (k-NN) learning to select an appropriate translation of the unseen instances in the dictionary. The distance of instances in k-NN is computed by estimating the similarity measured by LSA and PLSA. For experiments, we used TREC data(AP news in 1988) for constructing latent semantic spaces of two models and Wall Street Journal corpus for evaluating the translation accuracy in each model. PLSA selected relatively more accurate translations than LSA in the experiment, irrespective of the value of k and the types of grammatical relationship.

Proceedings ArticleDOI
08 Nov 2002
TL;DR: It is demonstrated empirically that under a broad range of circumstances LSA performs poorly, and a two-stage algorithm based on LSA that performs significantly better is described.
Abstract: A common problem faced when gathering information from the web is the use of different names to refer to the same entity. For example, the city in India referred to as Bombay in some documents may be referred to as Mumbai in others because its name officially changed from the former to the latter in 1995. Multiplicity of names can cause relevant documents to be missed by search engines. Our goal is to develop an automated system that discovers additional names for an entity given just one of its names. Latent semantic analysis (LSA) is generally thought to be well-suited for this task (Berry & Fierro 1996). We demonstrate empirically that under a broad range of circumstances LSA performs poorly, and describe a two-stage algorithm based on LSA that performs significantly better.