scispace - formally typeset
Search or ask a question

Showing papers on "Latent semantic analysis published in 2003"


Journal ArticleDOI
TL;DR: This article introduces a method for inferring the semantic orientation of a word from its statistical association with a set of positive and negative paradigm words, based on two different statistical measures of word association.
Abstract: The evaluative character of a word is called its semantic orientation. Positive semantic orientation indicates praise (e.g., "honest", "intrepid") and negative semantic orientation indicates criticism (e.g., "disturbing", "superfluous"). Semantic orientation varies in both direction (positive or negative) and degree (mild to strong). An automated system for measuring semantic orientation would have application in text classification, text filtering, tracking opinions in online discussions, analysis of survey responses, and automated chat systems (chatbots). This article introduces a method for inferring the semantic orientation of a word from its statistical association with a set of positive and negative paradigm words. Two instances of this approach are evaluated, based on two different statistical measures of word association: pointwise mutual information (PMI) and latent semantic analysis (LSA). The method is experimentally tested with 3,596 words (including adjectives, adverbs, nouns, and verbs) that have been manually labeled positive (1,614 words) and negative (1,982 words). The method attains an accuracy of 82.8p on the full test set, but the accuracy rises above 95p when the algorithm is allowed to abstain from classifying mild words.

1,651 citations


Proceedings ArticleDOI
02 Nov 2003
TL;DR: This paper applies and compares two simple latent space models commonly used in text analysis, namely Latent Semantic Analysis (LSA) and Probabilistic LSA (PLSA), and found that, on a 8000-image dataset, a classic LSA model defined on keywords and a very basic image representation performed as well as much more complex, state-of-the-art methods.
Abstract: Image auto-annotation, i.e., the association of words to whole images, has attracted considerable attention. In particular, unsupervised, probabilistic latent variable models of text and image features have shown encouraging results, but their performance with respect to other approaches remains unknown. In this paper, we apply and compare two simple latent space models commonly used in text analysis, namely Latent Semantic Analysis (LSA) and Probabilistic LSA (PLSA). Annotation strategies for each model are discussed. Remarkably, we found that, on a 8000-image dataset, a classic LSA model defined on keywords and a very basic image representation performed as well as much more complex, state-of-the-art methods. Furthermore, non-probabilistic methods (LSA and direct image matching) outperformed PLSA on the same dataset.

289 citations


Proceedings ArticleDOI
12 Jul 2003
TL;DR: A construction-inspecific model of multiword expression decomposability based on latent semantic analysis is presented, and evidence is furnished for the calculated similarities being correlated with the semantic relational content of WordNet.
Abstract: This paper presents a construction-inspecific model of multiword expression decomposability based on latent semantic analysis. We use latent semantic analysis to determine the similarity between a multiword expression and its constituent words, and claim that higher similarities indicate greater decomposability. We test the model over English noun-noun compounds and verb-particles, and evaluate its correlation with similarities and hyponymy values in WordNet. Based on mean hyponymy over partitions of data ranked on similarity, we furnish evidence for the calculated similarities being correlated with the semantic relational content of WordNet.

239 citations


Journal ArticleDOI
TL;DR: Three experiments were conducted to examine whether spatial iconicity affects semantic-relatedness judgments and showed that this effect did not occur when the words were presented horizontally, thus ruling out that the iconicity effect is due to the order in which the words are read.
Abstract: Three experiments were conducted to examine whether spatial iconicity affects semantic-relatedness judgments. Subjects made speeded decisions with regard to whether members of a simultaneously presented word pair were semantically related. In Experiment 1, the words were presented one above the other. In the experimental pair, the words denoted parts of larger objects (e.g., ATTIC-BASEMENT). The words were either in an iconic relation with their referents (e.g., ATTIC presented above BASEMENT) or in a reverse-iconic relation (BASEMENT above ATTIC). The reverse-iconic condition yielded significantly slower semantic-relatedness judgments than did the iconic condition. Experiments 2 and 3 showed that this effect did not occur when the words were presented horizontally, thus ruling out that the iconicity effect is due to the order in which the words are read. Two alternative explanations for this finding are discussed.

216 citations


Journal ArticleDOI
TL;DR: The authors used Latent Semantic Analysis (LSA) to estimate the semantic similarity between readers' think-aloud protocols to focal sentences and sentences in the stories that provided direct causal antecedents to the focal sentences.
Abstract: The viability of assessing reading strategies is studied based on think-aloud protocols combined with Latent Semantic Analysis (LSA). Readers in two studies thought aloud after reading specific focal sentences embedded in two stories. LSA was used to estimate the semantic similarity between readers' think-aloud protocols to the focal sentences and sentences in the stories that provided direct causal antecedents to the focal sentences. Study 1 demonstrated that according to human- and LSA-based assessments of the protocols, the responses of less-skilled readers semantically overlapped more with the focal sentences than with the causal antecedent sentences, whereas the responses of skilled readers overlapped with these sentences equally. In addition, the extent that the semantic overlap with causal antecedents was greater than the overlap with the focal sentences predicted performance on comprehension test questions and the Nelson-Denny test of reading skill. Study 2 replicated these findings and also demon...

174 citations


Proceedings ArticleDOI
27 May 2003
TL;DR: An unsupervised algorithm for placing unknown words into a taxonomy and its accuracy on a large and varied sample of words is evaluated and it is shown that automatic filtering using the class-labelling algorithm gives a fourfold improvement in accuracy.
Abstract: This paper describes an unsupervised algorithm for placing unknown words into a taxonomy and evaluates its accuracy on a large and varied sample of words. The algorithm works by first using a large corpus to find semantic neighbors of the unknown word, which we accomplish by combining latent semantic analysis with part-of-speech information. We then place the unknown word in the part of the taxonomy where these neighbors are most concentrated, using a class-labelling algorithm developed especially for this task. This method is used to reconstruct parts of the existing Word-Net database, obtaining results for common nouns, proper nouns and verbs. We evaluate the contribution made by part-of-speech tagging and show that automatic filtering using the class-labelling algorithm gives a fourfold improvement in accuracy.

135 citations


Patent
14 May 2003
TL;DR: In this paper, a method and apparatus for generating speech that sounds more natural is presented, where word prominence and latent semantic analysis are used to generate more natural sounding speech, and a word prominence is assigned to a word in the current sentence in accordance with the information determination.
Abstract: A method and apparatus is provided for generating speech that sounds more natural. In one embodiment, word prominence and latent semantic analysis are used to generate more natural sounding speech. A method for generating speech that sounds more natural may comprise generating synthesized speech having certain word prominence characteristics and applying a semantically-driven word prominence assignment model to specify word prominence consistent with the way humans assign word prominence. A speech representative of a current sentence is generated. The determination is made whether information in the current sentence is new or previously given in accordance with a semantic relationship between the current sentence and a number of preceding sentences. A word prominence is assigned to a word in the current sentence in accordance with the information determination.

125 citations


Proceedings ArticleDOI
31 May 2003
TL;DR: Syntactically Enhanced LSA is presented here an approach which generalizes LSA by considering a word along with its syntactic neighborhood given by the part-of-speech tag of its preceding word, as a unit of knowledge representation, which provides better discrimination of syntactic-semantic knowledge representation than LSA.
Abstract: Latent semantic analysis (LSA) has been used in several intelligent tutoring systems(ITS's) for assessing students' learning by evaluating their answers to questions in the tutoring domain. It is based on word-document co-occurrence statistics in the training corpus and a dimensionality reduction technique. However, it doesn't consider the word-order or syntactic information, which can improve the knowledge representation and therefore lead to better performance of an ITS. We present here an approach called Syntactically Enhanced LSA (SELSA) which generalizes LSA by considering a word along with its syntactic neighborhood given by the part-of-speech tag of its preceding word, as a unit of knowledge representation. The experimental results on Auto-Tutor task to evaluate students' answers to basic computer science questions by SELSA and its comparison with LSA are presented in terms of several cognitive measures. SELSA is able to correctly evaluate a few more answers than LSA but is having less correlation with human evaluators than LSA has. It also provides better discrimination of syntactic-semantic knowledge representation than LSA.

95 citations


Patent
04 Jun 2003
TL;DR: In this paper, an automatic speech recognition and semantic categorization system is used to convert unstructured voice input into structured data that can then be used to access one or more databases to retrieve associated supplemental data.
Abstract: Unstructured voice information from an incoming caller is processed by automatic speech recognition and semantic categorization system to convert the information into structured data that may then be used to access one or more databases to retrieve associated supplemental data. The structured data and associated supplemental data are then made available through a presentation system that provides information to the call center agent and, optionally, to the incoming caller. The system thus allows a call center information processing system to handle unstructured voice input for use by the live agent in handling the incoming call and for storage and retrieval at a later time. The semantic analysis system may be implemented by a global parser or by an information retrieval technique, such as latent semantic analysis. Co-occurrence of keywords may be used to associate prior calls with an incoming call to assist in understanding the purpose of the incoming call.

95 citations


Journal ArticleDOI
TL;DR: This article examines the application of LSA to automated essay scoring, and compares LSA methods to earlier statistical methods for assessing essay quality, and critically review contemporary essay-scoring systems built on LSA.
Abstract: Latent semantic analysis (LSA) is an automated, statistical technique for comparing the semantic similarity of words or documents. In this article, I examine the application of LSA to automated essay scoring. I compare LSA methods to earlier statistical methods for assessing essay quality, and critically review contemporary essay-scoring systems built on LSA, including the Intelligent Essay Assessor, Summary Street, State the Essence, Apex, and Select-a-Kibitzer. Finally, I discuss current avenues of research, including LSA's application to computer-measured readability assessment and to automatic summarization of student essays.

92 citations


Book ChapterDOI
22 Sep 2003
TL;DR: This paper introduces an approach for semantically enhanced collaborative filtering in which structured semantic knowledge about items, extracted automatically from the Web based on domain-specific reference ontologies, is used in conjunction with user-item mappings to create a combined similarity measure and generate predictions.
Abstract: Item-based Collaborative Filtering (CF) algorithms have been designed to deal with the scalability problems associated with traditional user-based CF approaches without sacrificing recommendation or prediction accuracy. Item-based algorithms avoid the bottleneck in computing user-user correlations by first considering the relationships among items and performing similarity computations in a reduced space. Because the computation of item similarities is independent of the methods used for generating predictions, multiple knowledge sources, including structured semantic information about items, can be brought to bear in determining similarities among items. The integration of semantic similarities for items with rating- or usage-based similarities allows the system to make inferences based on the underlying reasons for which a user may or may not be interested in a particular item. Furthermore, in cases where little or no rating (or usage) information is available (such as in the case of newly added items, or in very sparse data sets), the system can still use the semantic similarities to provide reasonable recommendations for users. In this paper, we introduce an approach for semantically enhanced collaborative filtering in which structured semantic knowledge about items, extracted automatically from the Web based on domain-specific reference ontologies, is used in conjunction with user-item mappings to create a combined similarity measure and generate predictions. Our experimental results demonstrate that the integrated approach yields significant advantages both in terms of improving accuracy, as well as in dealing with very sparse data sets or new items.

Journal Article
TL;DR: In this article, a comparison of the performance of a number of text categorization methods in two different data sets is presented, and the results obtained using the Mean Reciprocal Rank as a measure of overall performance, a commonly used evaluation measure for question answering tasks.
Abstract: In this paper we present a comprehensive comparison of the performance of a number of text categorization methods in two different data sets. In particular, we evaluate the Vector and Latent Semantic Analysis (LSA) methods, a classifier based on Support Vector Machines (SVM) and the k-Nearest Neighbor variations of the Vector and LSA models. We report the results obtained using the Mean Reciprocal Rank as a measure of overall performance, a commonly used evaluation measure for question answering tasks. We argue that this evaluation measure is also very well suited for text categorization tasks. Our results show that overall, SVMs and k-NN LSA perform better than the other methods, in a statistically significant way.

Journal ArticleDOI
TL;DR: Two issues are discussed that researchers must attend to when evaluating the utility of LSA for predicting psychological phenomena, and LSA indices of similarity should be derived from theoretical analysis of the processes involved in understanding two conflicting accounts of a historical event.
Abstract: Latent semantic analysis (LSA) is a computational model of human knowledge representation that approximates semantic relatedness judgments. Two issues are discussed that researchers must attend to when evaluating the utility of LSA for predicting psychological phenomena. First, the role of semantic relatedness in the psychological process of interest must be understood. LSA indices of similarity should then be derived from this theoretical understanding. Second, the knowledge base (semantic space) from which similarity indices are generated must contain 'knowledge' that is appropriate to the task at hand. Proposed solutions are illustrated with data from an experiment in which LSA-based indices were generated from theoretical analysis of the processes involved in understanding two conflicting accounts of a historical event. These indices predict the complexity of subsequent student reasoning about the event, as well as hand-coded predictions generated from think-aloud protocols collected when students were reading the accounts of the event.

Book ChapterDOI
08 Oct 2003
TL;DR: A comprehensive comparison of the performance of a number of text categorization methods in two different data sets is presented, in particular, the Vector and Latent Semantic Analysis (LSA) methods, a classifier based on Support Vector Machines (SVM) and the k-Nearest Neighbor variations of theVector and LSA models.
Abstract: In this paper we present a comprehensive comparison of the performance of a number of text categorization methods in two different data sets. In particular, we evaluate the Vector and Latent Semantic Analysis (LSA) methods, a classifier based on Support Vector Machines (SVM) and the k-Nearest Neighbor variations of the Vector and LSA models.

Book ChapterDOI
Eric D. Brill1
16 Feb 2003
TL;DR: Recent work in a number of areas, including grammar checker development, automatic question answering, and language modeling, where state of the art accuracy is achieved using very simple methods, suggesting that the field of NLP might benefit by concentrating less on technology development and more on data acquisition.
Abstract: We can still create computer programs displaying only the most rudimentary natural language processing capabilities. One of the greatest barriers to advanced natural language processing is our inability to overcome the linguistic knowledge acquisition bottleneck. In this paper, we describe recent work in a number of areas, including grammar checker development, automatic question answering, and language modeling, where state of the art accuracy is achieved using very simple methods whose power comes entirely from the plethora of text currently available to these systems, as opposed to deep linguistic analysis or the application of state of the art machine learning techniques. This suggests that the field of NLP might benefit by concentrating less on technology development and more on data acquisition.

Journal ArticleDOI
Susan T. Dumais1
TL;DR: This paper summarizes three lines of research that are motivated by the practical problem of helping users find information from external data sources, most notably computers, that believe that solutions to practical information access problems can shed light on human knowledge representation and reasoning.

Book ChapterDOI
14 Sep 2003
TL;DR: A model called HALe is proposed which automatically derives dimensional representations of words in a high dimensional context space from an email corpus which is used to discover a network of people based on a seed contextual description.
Abstract: This paper is about finding explicit and implicit connections between people by mining semantic associations from their email communications. Following from a socio-cognitive stance, we propose a model called HALe which automatically derives dimensional representations of words in a high dimensional context space from an email corpus. These dimensional representations are used to discover a network of people based on a seed contextual description. Such a network represents useful connections between people not easily achievable by 'normal' retrieval means. Implicit connections are "lifted" by applying latent semantic analysis to the high dimensional context space. The discovery techniques are applied to a substantial corpus of real-life email utterance drawn from a small-to-medium size information technology organization. The techniques are computationally tractable, and evidence is presented that suggests appropriate explicit connections are being brought to light, as well as interesting, and perhaps serendipitous implicit connections. The ultimate goal of such techniques is to bring to light context-sensitive, ephemeral, and often hidden relationships between people, and between people and information, which pervade the enterprise.

Journal ArticleDOI
TL;DR: A new model (EMMA: the environmental model of analogy) which relies on co-occurrence information provided by LSA (Latent Semantic Analysis) to ground the relations between the symbolic elements aligned in analogy, and demonstrates that the environmental approach to semantics embodied in LSA can produce appropriate patterns of analogical retrieval, but that this semantic measure is not sufficient to model analogical mapping.

Proceedings ArticleDOI
28 Jul 2003
TL;DR: By examining the number of times term t is identified for a search on term t' (precision) using differing ranges of dimensions, it is found that lower ranked dimensions identify related terms and higher-ranked dimensions discriminate between the synonyms.
Abstract: We seek insight into Latent Semantic Indexing by establishing a method to identify the optimal number of factors in the reduced matrix for representing a keyword. This method is demonstrated empirically by duplicating all documents containing a term t, and inserting new documents in the database that replace t with t'. By examining the number of times term t is identified for a search on term t' (precision) using differing ranges of dimensions, we find that lower ranked dimensions identify related terms and higher-ranked dimensions discriminate between the synonyms.

Book ChapterDOI
01 Jun 2003
TL;DR: Using a new type of memory compaction mechanism for data mining in vitro, DNA-based semantic retrieval compares favorably with statistically-based Latent Semantic Analysis (LSA), one of the best performers for semantic associative-based retrieval on text corpora.
Abstract: Associative memories based on DNA-affinity have been proposed. Here, the efficiency, reliability, and semantic capability for associative retrieval of three models of a DNA-based memory are quantified and compared to current conventional methods. In affinity-based memories[1], retrievals and deletions under stringent conditions occur reliably (98%) within very short times (100 milliseconds), regardless of the degree of stringency of the recall or the number of simultaneous queries in the input. In a more sophisticated type of DNA-based memory B proposed and experimentally verified by Chen et al. [2] with three genomes, the sensitivity of the discrimination ability remains unchanged when used on a library of 18 plasmids in the range of 1-4kbps and does appear to grow exponentially with the number of library strands used, even under simultaneous multiple queries in the same input. Finally, using a new type of memory compaction mechanism for data mining in vitro, DNA-based semantic retrieval compares favorably with statistically-based Latent Semantic Analysis (LSA), one of the best performers for semantic associative-based retrieval on text corpora.

01 Jan 2003
TL;DR: The results show that while the linguistic processing has a substantial influence on the LSA performance, the traditional factors are even more important, and therefore it did not prove that the linguistic pre-processing substantially improves text categorisation.
Abstract: The paper presents on-going work towards deeper understanding of the factors influencing the performance of the Latent Semantic Analysis (LSA). Unlike previous attempts that concentrate on problems such as matrix elements weighting, space dimensionality selection, similarity measure etc., we primarily study the impact of another, often neglected, but fundamental element of LSA (and of any text processing technique): the definition of “word”. For the purpose, a balanced corpus of Bulgarian newspaper texts was carefully created, to allow for in-depth observations of the LSA performance, and series of experiments were performed in order to understand and compare (with respect to the task of text categorisation) six possible inputs with different level of linguistic quality, including: graphemic form as met in the text, stem, lemma, phrase, lemma&phrase and part-of-speech annotation. In addition to LSA, we made comparisons to the standard vector-space model, without any dimensionality reduction. The results show that while the linguistic processing has a substantial influence on the LSA performance, the traditional factors are even more important, and therefore we did not prove that the linguistic pre-processing substantially improves text categorisation.

Proceedings ArticleDOI
27 May 2003
TL;DR: These experiments in applying Latent Semantic Analysis (LSA) to dialogue act classification employ both LSA proper and LSA augmented in two ways, and report results on DIAG, the authors' own corpus of tutoring dialogues, and on the CallHome Spanish corpus.
Abstract: This paper presents our experiments in applying Latent Semantic Analysis (LSA) to dialogue act classification. We employ both LSA proper and LSA augmented in two ways. We report results on DIAG, our own corpus of tutoring dialogues, and on the CallHome Spanish corpus. Our work has the theoretical goal of assessing whether LSA, an approach based only on raw text, can be improved by using additional features of the text.

01 Jan 2003
TL;DR: It is shown that, for both humans and model, metaphors take longer to process than the literal meanings and second, an inductive context can shorten the processing time.
Abstract: This paper presents a computational model of referential metaphor comprehension. This model is designed on top of Latent Semantic Analysis (LSA), a model of the representation of word and text meanings. Compre­hending a referential metaphor consists in scanning the semantic neighbors of the metaphor in order to find words that are also semantically related to the context. The depth of that search is compared to the time it takes for humans to process a metaphor. In particular, we are interested in two independent variables : the nature of the reference (either a literal meaning or a figurative meaning) and the nature of the context (inductive or not inductive). We show that, for both humans and model, first, metaphors take longer to process than the literal meanings and second, an inductive context can shorten the processing time.


Posted Content
TL;DR: In this paper, the authors introduce a method for inferring the semantic orientation of a word from its statistical association with a set of positive and negative paradigm words, based on pointwise mutual information (PMI) and latent semantic analysis (LSA).
Abstract: The evaluative character of a word is called its semantic orientation. Positive semantic orientation indicates praise (e.g., "honest", "intrepid") and negative semantic orientation indicates criticism (e.g., "disturbing", "superfluous"). Semantic orientation varies in both direction (positive or negative) and degree (mild to strong). An automated system for measuring semantic orientation would have application in text classification, text filtering, tracking opinions in online discussions, analysis of survey responses, and automated chat systems (chatbots). This paper introduces a method for inferring the semantic orientation of a word from its statistical association with a set of positive and negative paradigm words. Two instances of this approach are evaluated, based on two different statistical measures of word association: pointwise mutual information (PMI) and latent semantic analysis (LSA). The method is experimentally tested with 3,596 words (including adjectives, adverbs, nouns, and verbs) that have been manually labeled positive (1,614 words) and negative (1,982 words). The method attains an accuracy of 82.8% on the full test set, but the accuracy rises above 95% when the algorithm is allowed to abstain from classifying mild words.

Journal ArticleDOI
TL;DR: The concept of data-driven semantic inference is introduced, which in principle allows for any word constructs in command/query formulation, which is no longer necessary for users to memorize the exact syntax of every command.
Abstract: Spoken interaction tasks are typically approached using a formal grammar as language model. While ensuring good system performance, this imposes a rigid framework on users, by implicitly forcing them to conform to a pre-defined interaction structure. This paper introduces the concept of data-driven semantic inference, which in principle allows for any word constructs in command/query formulation. Each unconstrained word string is automatically mapped onto the intended action through a semantic classification against the set of supported actions. As a result, it is no longer necessary for users to memorize the exact syntax of every command. The underlying (latent semantic analysis) framework relies on co-occurrences between words and commands, as observed in a training corpus. A suitable extension can also handle commands that are ambiguous at the word level. The behavior of semantic inference is characterized using a desktop user interface control task involving 113 different actions. Under realistic usage conditions, this approach exhibits a 2 to 5% classification error rate. Various training scenarios of increasing scope are considered to assess the influence of coverage on performance. Sufficient semantic knowledge about the task domain is found to be captured at a level of coverage as low as 70%. This illustrates the good generalization properties of semantic inference.

Proceedings Article
09 Aug 2003
TL;DR: A new LSA algorithm significantly improves the precision of AutoTutor's natural language understanding and can be applied to othernatural language understanding applications.
Abstract: The intelligent tutoring system AutoTutor uses latent semantic analysis to evaluate student answers to the tutor's questions. By comparing a student's answer to a set of expected answers, the system determines how much information is covered and how to continue the tutorial. Despite the success of LSA in tutoring conversations, the system sometimes has difficulties determining at an early stage whether or not an expectation is covered. A new LSA algorithm significantly improves the precision of AutoTutor's natural language understanding and can be applied to other natural language understanding applications.

Journal ArticleDOI
TL;DR: The effectiveness of a domain-specific latent semantic analysis (LSA) in assessing reading strategies was examined, and the science LSA space correlated highly with human judgments, and more highly than did the general reading space.
Abstract: The effectiveness of a domain-specific latent semantic analysis (LSA) in assessing reading strategies was examined. Students were given self-explanation reading training (SERT) and asked to think aloud after each sentence in a science text. Novice and expert human raters and two LSA spaces (general reading, science) rated the similarity of each think-aloud protocol to benchmarks representing three different reading strategies (minimal, local, and global). The science LSA space correlated highly with human judgments, and more highly than did the general reading space. Also, cosines from the science LSA spaces can distinguish between different levels of semantic similarity, but may have trouble in distinguishing local processing protocols. Thus, a domain-specific LSA space is advantageous regardless of the size of the space. The results are discussedin the context of applying the science LSA to a computer-based version of SERT that gives online feedback based on LSA cosines.

Proceedings ArticleDOI
30 Nov 2003
TL;DR: Experiments show that the underlying framework is latent semantic analysis, which is competitive with the state-of-the-art in e-mail classification, and potentially advantageous in real-world applications with high junk-to-legitimate ratios.
Abstract: The explosion in unsolicited mass electronic mail (junk e-mail) over the past decade has sparked interest in automatic filtering solutions. Traditional techniques tend to rely on header analysis, keyword/keyphrase matching and analogous rule-based predicates, and/or some probabilistic model of text generation. This paper aims instead at deciding whether or not the latent subject matter is consistent with the user's interests. The underlying framework is latent semantic analysis: each e-mail is automatically classified against two semantic anchors, one for legitimate and one for junk messages. Experiments show that this approach is competitive with the state-of-the-art in e-mail classification, and potentially advantageous in real-world applications with high junk-to-legitimate ratios. The resulting technology has been successfully released in August 2002 as part of the e-mail client bundled with the MacOS 10.2 operating system.

Proceedings ArticleDOI
27 May 2003
TL;DR: An investigation into the use of LSA in language modeling for conversational speech recognition finds that previously proposed methods of combining an LSA-based unigram model with an N-gram model yield much smaller reductions in perplexity on speech transcriptions than has been reported on written text.
Abstract: Latent semantic analysis (LSA), first exploited in indexing documents for information retrieval, has since been used by several researchers to demonstrate impressive reductions in the perplexity of statistical language models on text corpora such as the Wall Street Journal. In this paper we present an investigation into the use of LSA in language modeling for conversational speech recognition. We find that previously proposed methods of combining an LSA-based unigram model with an N-gram model yield much smaller reductions in perplexity on speech transcriptions than has been reported on written text. We next present a family of exponential models in which LSA similarity is a feature of a word-history pair. The maximum entropy model in this family yields a greater reduction in perplexity, and statistically significant improvements in recognition accuracy over a trigram model on the Switchboard corpus. We conclude with a comparison of this LSA-featured model with a previously proposed topic-dependent maximum entropy model.