scispace - formally typeset
Search or ask a question

Showing papers on "Latent semantic analysis published in 2009"


Journal ArticleDOI
TL;DR: In this paper, a deep graphical model of the word-count vectors obtained from a large set of documents is proposed. But the model is restricted to the deep layer of the deep neural network and cannot handle large numbers of documents.

1,266 citations


Journal ArticleDOI
TL;DR: The thesis of the paper is that hyperdimensional representation has much to offer to students of cognitive science, theoretical neuroscience, computer science and engineering, and mathematics.
Abstract: The 1990s saw the emergence of cognitive models that depend on very high dimensionality and randomness. They include Holographic Reduced Representations, Spatter Code, Semantic Vectors, Latent Semantic Analysis, Context-Dependent Thinning, and Vector-Symbolic Architecture. They represent things in high-dimensional vectors that are manipulated by operations that produce new high-dimensional vectors in the style of traditional computing, in what is called here hyperdimensional computing on account of the very high dimensionality. The paper presents the main ideas behind these models, written as a tutorial essay in hopes of making the ideas accessible and even provocative. A sketch of how we have arrived at these models, with references and pointers to further reading, is given at the end. The thesis of the paper is that hyperdimensional representation has much to offer to students of cognitive science, theoretical neuroscience, computer science and engineering, and mathematics.

761 citations


Journal ArticleDOI
TL;DR: A review of the available methodologies for derivation of semantic relatedness from free text, as well as their evaluation in a variety of biomedical and other applications can be found in this article.

200 citations


Journal ArticleDOI
TL;DR: This work evaluates a simple metric of pointwise mutual information and demonstrates that this metric benefits from training on extremely large amounts of data and correlates more closely with human semantic similarity ratings than do publicly available implementations of several more complex models.
Abstract: Computational models of lexical semantics, such as latent semantic analysis, can automatically generate semantic similarity measures between words from statistical redundancies in text. These measures are useful for experimental stimulus selection and for evaluating a model’s cognitive plausibility as a mechanism that people might use to organize meaning in memory. Although humans are exposed to enormous quantities of speech, practical constraints limit the amount of data that many current computational models can learn from. We follow up on previous work evaluating a simple metric of pointwise mutual information. Controlling for confounds in previous work, we demonstrate that this metric benefits from training on extremely large amounts of data and correlates more closely with human semantic similarity ratings than do publicly available implementations of several more complex models. We also present a simple tool for building simple and scalable models from large corpora quickly and efficiently.

153 citations


Journal ArticleDOI
TL;DR: In this article, a non-negative matrix factorization (NMF) was used to select sentences for automatic generic document summarization, which is more similar to the human cognition process.
Abstract: In existing unsupervised methods, Latent Semantic Analysis (LSA) is used for sentence selection. However, the obtained results are less meaningful, because singular vectors are used as the bases for sentence selection from given documents, and singular vector components can have negative values. We propose a new unsupervised method using Non-negative Matrix Factorization (NMF) to select sentences for automatic generic document summarization. The proposed method uses non-negative constraints, which are more similar to the human cognition process. As a result, the method selects more meaningful sentences for generic document summarization than those selected using LSA.

141 citations


Proceedings ArticleDOI
31 Mar 2009
TL;DR: A new statistical method for detecting and tracking changes in word meaning, based on Latent Semantic Analysis, which allows researchers to make statistical inferences on questions such as whether the meaning of a word changed across time or if a phonetic cluster is associated with a specific meaning.
Abstract: This paper presents a new statistical method for detecting and tracking changes in word meaning, based on Latent Semantic Analysis. By comparing the density of semantic vector clusters this method allows researchers to make statistical inferences on questions such as whether the meaning of a word changed across time or if a phonetic cluster is associated with a specific meaning. Possible applications of this method are then illustrated in tracing the semantic change of 'dog', 'do', and 'deer' in early English and examining and comparing phonaesthemes.

131 citations


Proceedings Article
11 Jul 2009
TL;DR: This work presents a novel method that leverages data from 'Wikispeedia', an online game played on Wikipedia, that significantly outperforms Latent Semantic Analysis in a psychometric evaluation of the quality of learned semantic distances.
Abstract: Computing the semantic distance between realworld concepts is crucial for many intelligent applications. We present a novel method that leverages data from 'Wikispeedia', an online game played on Wikipedia; players have to reach an article from another, unrelated article, only by clicking links in the articles encountered. In order to automatically infer semantic distances between everyday concepts, our method effectively extracts the common sense displayed by humans during play, and is thus more desirable, from a cognitive point of view, than purely corpus-based methods. We show that our method significantly outperforms Latent Semantic Analysis in a psychometric evaluation of the quality of learned semantic distances.

98 citations


Proceedings Article
11 Jul 2009
TL;DR: This paper compares the recently proposed ESA model with two latent models (LSI and LDA) showing that the former is clearly superior to the both, and contributes to clarifying the role of explicit vs. implicitly derived or latent concepts in (cross-language) information retrieval research.
Abstract: The field of information retrieval and text manipulation (classification, clustering) still strives for models allowing semantic information to be folded in to improve performance with respect to standard bag-of-word based models. Many approaches aim at a concept-based retrieval, but differ in the nature of the concepts, which range from linguistic concepts as defined in lexical resources such as WordNet, latent topics derived from the data itself - as in Latent Semantic Indexing (LSI) or (Latent Dirichlet Allocation (LDA) - to Wikipedia articles as proxies for concepts, as in the recently proposed Explicit Semantic Analysis (ESA) model. A crucial question which has not been answered so far is whether models based on explicitly given concepts (as in the ESA model for instance) perform inherently better than retrieval models based on "latent" concepts (as in LSI and/or LDA). In this paper we investigate this question closer in the context of a cross-language setting, which inherently requires concept-based retrieval bridging between different languages. In particular, we compare the recently proposed ESA model with two latent models (LSI and LDA) showing that the former is clearly superior to the both. From a general perspective, our results contribute to clarifying the role of explicit vs. implicitly derived or latent concepts in (cross-language) information retrieval research.

92 citations


Journal ArticleDOI
TL;DR: An ensemble approach that integrates LSA and n-gram co-occurrence is proposed that is able to achieve high accuracy and improve the performance quite substantially compared with current techniques.
Abstract: Summary writing is an important part of many English Language Examinations. As grading students' summary writings is a very time-consuming task, computer-assisted assessment will help teachers carry out the grading more effectively. Several techniques such as latent semantic analysis (LSA), n-gram co-occurrence and BLEU have been proposed to support automatic evaluation of summaries. However, their performance is not satisfactory for assessing summary writings. To improve the performance, this paper proposes an ensemble approach that integrates LSA and n-gram co-occurrence. As a result, the proposed ensemble approach is able to achieve high accuracy and improve the performance quite substantially compared with current techniques. A summary assessment system based on the proposed approach has also been developed.

80 citations


Journal ArticleDOI
TL;DR: These findings show that language encodes geographical information that language users in turn may use in their understanding of language and the world.

76 citations


Journal ArticleDOI
TL;DR: A novel approach to automatic image annotation which combines global, regional, and contextual features by an extended cross-media relevance model that describes the global features as a distribution vector of visual topics and model the textual context as a multinomial distribution.

Journal ArticleDOI
TL;DR: The results suggest that semantic and syntactic evaluations are the primary components of paraphrase quality, and that computationally light systems such as latent semantic analysis (semantics) and minimal edit distances (syntax) present promising approaches to simulating human evaluations of paraphrases.
Abstract: Two sentences are paraphrases if their meanings are equivalent but their words and syntax are different. Paraphrasing can be used to aid comprehension, stimulate prior knowledge, and assist in writing-skills development. As such, paraphrasing is a feature of fields as diverse as discourse psychology, composition, and computer science. Although automated paraphrase assessment is both commonplace and useful, research has centered solely on artificial, edited paraphrases and has used only binary dimensions (i.e., is or is not a paraphrase). In this study, we use an extensive database (N=1,998) of natural paraphrases generated by high school students that have been assessed along 10 dimensions (e.g., semantic completeness, lexical similarity, syntactical similarity). This study investigates the components of paraphrase quality emerging from these dimensions and examines whether computational approaches can simulate those human evaluations. The results suggest that semantic and syntactic evaluations are the primary components of paraphrase quality, and that computationally light systems such as latent semantic analysis (semantics) and minimal edit distances (syntax) present promising approaches to simulating human evaluations of paraphrases.

Book ChapterDOI
18 Mar 2009
TL;DR: In this article, the authors address the question: how can the distance between different semantic spaces be computed? By representing each Semantic Space as a subspace of a more general Hilbert space, the relationship between Semantic Spaces can be computed by means of the subspace distance.
Abstract: Semantic Space models, which provide a numerical representation of words' meaning extracted from corpus of documents, have been formalized in terms of Hermitian operators over real valued Hilbert spaces by Bruza et al. [1]. The collapse of a word into a particular meaning has been investigated applying the notion of quantum collapse of superpositional states [2]. While the semantic association between words in a Semantic Space can be computed by means of the Minkowski distance [3] or the cosine of the angle between the vector representation of each pair of words, a new procedure is needed in order to establish relations between two or more Semantic Spaces. We address the question: how can the distance between different Semantic Spaces be computed? By representing each Semantic Space as a subspace of a more general Hilbert space, the relationship between Semantic Spaces can be computed by means of the subspace distance. Such distance needs to take into account the difference in the dimensions between subspaces. The availability of a distance for comparing different Semantic Subspaces would enable to achieve a deeper understanding about the geometry of Semantic Spaces which would possibly translate into better effectiveness in Information Retrieval tasks.

Proceedings ArticleDOI
06 Aug 2009
TL;DR: The proposed novel model of semantic similarity using the semantic relations that exist among words outperforms all existing web-based semantic similarity measures, achieving a Pearson correlation coefficient of 0.867 on the Millet-Charles dataset.
Abstract: Semantic similarity is a central concept that extends across numerous fields such as artificial intelligence, natural language processing, cognitive science and psychology. Accurate measurement of semantic similarity between words is essential for various tasks such as, document clustering, information retrieval, and synonym extraction. We propose a novel model of semantic similarity using the semantic relations that exist among words. Given two words, first, we represent the semantic relations that hold between those words using automatically extracted lexical pattern clusters. Next, the semantic similarity between the two words is computed using a Mahalanobis distance measure. We compare the proposed similarity measure against previously proposed semantic similarity measures on Miller-Charles benchmark dataset and WordSimilarity-353 collection. The proposed method outperforms all existing web-based semantic similarity measures, achieving a Pearson correlation coefficient of 0.867 on the Millet-Charles dataset.

Journal ArticleDOI
TL;DR: It is shown that there is a characteristic scale, roughly around a few thousand words, which establishes the typical size of the most informative segments in written language, and the words whose contributions to the overall information is larger, are the ones more closely associated with the main subjects and topics of the text.
Abstract: Written language is a complex communication signal capable of conveying information encoded in the form of ordered sequences of words. Beyond the local order ruled by grammar, semantic and thematic structures affect long-range patterns in word usage. Here, we show that a direct application of information theory quantifies the relationship between the statistical distribution of words and the semantic content of the text. We show that there is a characteristic scale, roughly around a few thousand words, which establishes the typical size of the most informative segments in written language. Moreover, we find that the words whose contributions to the overall information is larger, are the ones more closely associated with the main subjects and topics of the text. This scenario can be explained by a model of word usage that assumes that words are distributed along the text in domains of a characteristic size where their frequency is higher than elsewhere. Our conclusions are based on the analysis of a large database of written language, diverse in subjects and styles, and thus are likely to be applicable to general language sequences encoding complex information.

Book ChapterDOI
18 Apr 2009
TL;DR: A formal model to search entities as well as a complete Entity Ranking system is proposed, providing examples of its application to the enterprise context and experimentally evaluates the system on the Expert Search task to show how it can be adapted to different scenarios.
Abstract: Entity Ranking has recently become an important search task in Information Retrieval. The goal is not to find documents matching query terms, but, instead, finding entities. In this paper we propose a formal model to search entities as well as a complete Entity Ranking system, providing examples of its application to the enterprise context. We experimentally evaluate our system on the Expert Search task in order to show how it can be adapted to different scenarios. The results show that combining simple IR techniques we improve of 53% in terms of [email protected] over our baseline.

Journal ArticleDOI
TL;DR: Two types of multi-criteria probabilistic latent semantic analysis algorithms extended from the single-rating version are proposed, inspired by the Bayesian network and linear regression.
Abstract: Nowadays some recommender system researchers have already been engaging multi-criteria that model possible attributes of the item to generate the improved recommendations. However, the statistical machine learning methods successful in the single-rating recommender system have not been investigated in the context of multi-criteria ratings. In this paper, we propose two types of multi-criteria probabilistic latent semantic analysis algorithms extended from the single-rating version. First, the mixture of multi-variate Gaussian distribution is assumed to be the underlying distribution of multi-criteria ratings of each user. Second, we further assume the mixture of the linear Gaussian regression model as the underlying distribution of multi-criteria ratings of each user, inspired by the Bayesian network and linear regression. The experiment results on the Yahoo!Movies ratings data set show that the full multi-variate Gaussian model and the linear Gaussian regression model achieve a stable performance gain over other tested methods.

Proceedings ArticleDOI
06 Dec 2009
TL;DR: Experimental results demonstrate that the newly developed semantic-based model enhances the clustering quality of sets of documents substantially.
Abstract: Most of text mining techniques are based on word and/or phrase analysis of the text. The statistical analysis of a term (word or phrase) frequency captures the importance of the term within a document. However, to achieve a more accurate analysis, the underlying mining technique should indicate terms that capture the semantics of the text from which the importance of a term in a sentence and in the document can be derived. Incorporating semantic features from the WordNet lexical database is one of many approaches that have been tried to improve the accuracy of text clustering techniques. A new semantic-based model that analyzes documents based on their meaning is introduced. The proposed model analyzes terms and their corresponding synonyms and/or hypernyms on the sentence and document levels. In this model, if two documents contain different words and these words are semantically related, the proposed model can measure the semantic-based similarity between the two documents. The similarity between documents relies on a new semantic-based similarity measure which is applied to the matching concepts between documents. Experiments using the proposed semantic-based model in text clustering are conducted. Experimental results demonstrate that the newly developed semantic-based model enhances the clustering quality of sets of documents substantially.

Journal ArticleDOI
TL;DR: The MBPNN is proposed to accelerate the training speed of BPNN and improve the categorization accuracy, and the application of LSA for the system can lead to dramatic dimensionality reduction while achieving good classification results.
Abstract: This paper proposed a new text categorization model based on the combination of modified back propagation neural network (MBPNN) and latent semantic analysis (LSA). The traditional back propagation neural network (BPNN) has slow training speed and is easy to trap into a local minimum, and it will lead to a poor performance and efficiency. In this paper, we propose the MBPNN to accelerate the training speed of BPNN and improve the categorization accuracy. LSA can overcome the problems caused by using statistically derived conceptual indices instead of individual words. It constructs a conceptual vector space in which each term or document is represented as a vector in the space. It not only greatly reduces the dimension but also discovers the important associative relationship between terms. We test our categorization model on 20-newsgroup corpus and reuter-21578 corpus, experimental results show that the MBPNN is much faster than the traditional BPNN. It also enhances the performance of the traditional BPNN. And the application of LSA for our system can lead to dramatic dimensionality reduction while achieving good classification results.

01 Jan 2009
TL;DR: Forster et al. as mentioned in this paper used Latent Semantic Analysis (LSA) to measure semantic distance between words and found that LSA methods produced a better model of the underlying semantic originality of responses than traditional measures.
Abstract: Creativity Evaluation through Latent Semantic Analysis Eve A. Forster (eve.forster@utoronto.ca) University of Toronto Scarborough, Department of Psychology 1265 Military Trail, Toronto, ON M1C 1A4 Canada Kevin N. Dunbar (dunbar@utsc.utoronto.ca) University of Toronto Scarborough, Department of Psychology 1265 Military Trail, Toronto, ON M1C 1A4 Canada Abstract in a person’s internal and external environment, causing different subscales such as drawing or writing to fluctuate in different ways. The psychometric approach accommodates these multiple factors by administering a large battery of short tests, to encapsulate all aspects of creativity. Most of the tests require people to generate or manipulate a large number of ideas. Guilford and Hoepfner (1966) provide 57 tasks that ask participants to do things such as grouping and regrouping objects according to common properties, listing the consequences of unlikely situations, and the UoO Task. While it is almost 60 years since Guilford’s original address to APA, the scoring of creativity tasks still remains problematic. One way of addressing these problems would be to use an automated measurement tool that uses underlying semantic knowledge to assess creativity. Although initially developed to model language learning, Latent Semantic Analysis (LSA) has since proven itself as a flexible tool with a variety of sophisticated uses. In this article we test the hypothesis that it can be used as a consistent and completely automated creativity scoring method. Here, we use LSA to score creativity of participants who perform the Uses of Objects task. The Uses of Objects Task is a widely used creativity test. The test is usually scored by humans, which introduces subjectivity and individual variance into creativity scores. Here, we present a new computational method for scoring creativity: Latent Semantic Analysis (LSA), a tool used to measure semantic distance between words. 33 participants provided creative uses for 20 separate objects. We compared both human judges and LSA scores and found that LSA methods produced a better model of the underlying semantic originality of responses than traditional measures. Keywords: latent semantic analysis; creativity; natural language processing Creativity research has had a short, but interesting history in the Cognitive Sciences. Beginning with Guilford’s presidential address to the American Psychological Association in 1950, researchers have sought ways of discovering creative individuals (Guilford, 1947) that provide an alternative to the long and laborious methods used by the Gestalt psychologists. Gestalt research methods often consisted of extensive interviews with creative individuals (such as Albert Einstein; Wertheimer, 1945), which offered fascinating accounts of creative moments of some creative people, but was not amenable to discovering vast numbers of creative individuals. Guilford advocated the use of the psychometric approach for this purpose, and over the subsequent decade a number of new creativity tests were devised. By the mid- 1960s, the Guilford Alternate Uses test (Guilford, 1967) and the Torrance Test of Creative Thinking (Torrance, 1998) were widely used measures of creativity across the world. On the surface, these tests were ideal; they were easy to administer and quick to score: the more responses made, the more creative the individual. Researchers soon discovered that there was more to creativity than number of responses. New measures were proposed that counted number of categories employed, and measured response elaboration and novelty. These measures brought new problems to creativity assessment: they are inherently subjective, have large variances in coding, and take a considerable amount of time to score. Sternberg and Lubart (1992) write that creativity is a function of six factors: intelligence, knowledge, thinking style, personality, motivation and environmental context. Each of these can fluctuate from day to day due to changes The Uses of Objects Task The UoO Task is a psychometric test that requires people to generate multiple, original uses for a given object. Quantitative scores count the number of ideas (a measure of fluency) or number of words per response (elaboration), and subjective scores judge creativity and category switching. The task is widely used (Dunbar, 2008; Guilford, 1967; Guilford & Hoepfner, 1966; Hudson, 1968; Torrance, 1998). Scoring of the UoO task can be easily automated, but doing so strips the responses of their meaning. The only two scoring options at present are meaningful but subjective and slow, or consistent and fast but meaningless. The ideal scoring method should be meaningful, consistent and completely automated; such a method may be devised by combining a traditional elaboration measure with a novel assessment of originality. The need for consistent measurement Popular scoring systems such as the Torrance Test of Creative Thinking (Torrance, 1998) require a trained person to assess productions, but this option is not always practical. Such assessment is slow, expensive, and subjective (and

Proceedings ArticleDOI
16 Sep 2009
TL;DR: The development of the summarizer is described which is based on Iterative Residual Rescaling (IRR) that creates the latent semantic space of a set of documents under consideration that enables to control the influence of major and minor topics in the latent space.
Abstract: This paper deals with our recent research in text summarization. The field has moved from multi-document summarization to update summarization. When producing an update summary of a set of topic-related documents the summarizer assumes prior knowledge of the reader determined by a set of older documents of the same topic. The update summarizer thus must solve a novelty vs. redundancy problem. We describe the development of our summarizer which is based on Iterative Residual Rescaling (IRR) that creates the latent semantic space of a set of documents under consideration. IRR generalizes Singular Value Decomposition (SVD) and enables to control the influence of major and minor topics in the latent space. Our sentence-extractive summarization method computes the redundancy, novelty and significance of each topic. These values are finally used in the sentence selection process. The sentence selection component prevents inner summary redundancy. The results of our participation in TAC evaluation seem to be promising.

Book ChapterDOI
24 Jun 2009
TL;DR: This work generalizes ESA in order to clearly show the degrees of freedom it provides and proposes some variants of ESA along different dimensions, testing their impact on performance on a cross-lingual mate retrieval task on two datasets (JRC-ACQUIS and Multext).
Abstract: Explicit Semantic Analysis (ESA) has been recently proposed as an approach to computing semantic relatedness between words (and indirectly also between texts) and has thus a natural application in information retrieval, showing the potential to alleviate the vocabulary mismatch problem inherent in standard Bag-of-Word models. The ESA model has been also recently extended to cross-lingual retrieval settings, which can be considered as an extreme case of the vocabulary mismatch problem. The ESA approach actually represents a class of approaches and allows for various instantiations. As our first contribution, we generalize ESA in order to clearly show the degrees of freedom it provides. Second, we propose some variants of ESA along different dimensions, testing their impact on performance on a cross-lingual mate retrieval task on two datasets (JRC-ACQUIS and Multext). Our results are interesting as a systematic investigation has been missing so far and the variations between different basic design choices are significant. We also show that the settings adopted in the original ESA implementation are reasonably good, which to our knowledge has not been demonstrated so far, but can still be significantly improved by tuning the right parameters (yielding a relative improvement on a cross-lingual mate retrieval task of between 62% (Multext) and 237% (JRC-ACQUIS) with respect to the original ESA model).

Journal ArticleDOI
TL;DR: An unsupervised method based on a combination of hidden Markov models and latent semantic analysis which allows the topics of interest to be defined freely, without the need for data annotation, and can identify short segments is introduced.

Journal ArticleDOI
TL;DR: A new approach to automatic discovery of implicit rhetorical information from texts based on evolutionary computation methods is proposed, which uses previously obtained training information which involves semantic and structural criteria.
Abstract: In this paper, we propose a new approach to automatic discovery of implicit rhetorical information from texts based on evolutionary computation methods. In order to guide the search for rhetorical connections from natural-language texts, the model uses previously obtained training information which involves semantic and structural criteria. The main features of the model and new designed operators and evaluation functions are discussed, and the different experiments assessing the robustness and accuracy of the approach are described. Experimental results show the promise of evolutionary methods for rhetorical role discovery.

Journal ArticleDOI
TL;DR: Results showed significantly higher reliability of LSA as a computerized assessment tool for expository text when it used a best-dimension algorithm rather than a standard LSA algorithm.
Abstract: In this study, we compared four expert graders with latent semantic analysis (LSA) to assess short summaries of an expository text. As is well known, there are technical difficulties for LSA to establish a good semantic representation when analyzing short texts. In order to improve the reliability of LSA relative to human graders, we analyzed three new algorithms by two holistic methods used in previous research (Leon, Olmos, Escudero, Canas, & Salmeron, 2006). The three new algorithms were (1) the semantic common network algorithm, an adaptation of an algorithm proposed by W. Kintsch (2001, 2002) with respect to LSA as a dynamic model of semantic representation; (2) a best-dimension reduction measure of the latent semantic space, selecting those dimensions that best contribute to improving the LSA assessment of summaries (Hu, Cai, Wiemer-Hastings, Graesser, & McNamara, 2007); and (3) the Euclidean distance measure, used by Rehder et al. (1998), which incorporates at the same time vector length and the cosine measures. A total of 192 Spanish middle-grade students and 6 experts took part in this study. They read an expository text and produced a short summary. Results showed significantly higher reliability of LSA as a computerized assessment tool for expository text when it used a best-dimension algorithm rather than a standard LSA algorithm. The semantic common network algorithm also showed promising results.

Proceedings ArticleDOI
07 Nov 2009
TL;DR: Evaluation results show that the new Web-based method presented here improves ac-curacy and more robust for measuring semantic similarity between words and gives higher correlation value than some existing methods.
Abstract: Semantic similarity measures play an important role in the extraction of semantic relations. Semantic similarity measures are widely used in Natural Language Processing (NLP) and information Retrieval (IR). This paper presents a new Web-based method for measuring the semantic similarity between words. Different from other methods which are based on taxonomy or Search engine in Internet, our method uses snippets from Wikipedia1 to calculate the semantic similarity between words by using cosine similarity and TF-IDF. Also, the stemmer algorithm and stop words are used in preprocessing the snippets from Wikipedia. We set different threshold to evaluate our results in order to decrease the interference from noise and redundancy. Our method was empirically evaluated using Rubenstein-Good enough benchmark dataset. It gives higher correlation value (with 0.615) than some existing methods. Evaluation results show that our method improves ac-curacy and more robust for measuring semantic similarity between words.

Proceedings ArticleDOI
08 Jul 2009
TL;DR: A parallel LSA implementation on the GPU is presented, using NVIDIA R Compute Unified Device Architecture (CUDA) and CUBLAS and an optimized Basic Linear Algebra Subprograms library to speedup large-scale LSA processes.
Abstract: Latent Semantic Analysis (LSA) can be used to reduce the dimensions of large Term-Document datasets using Singular Value Decomposition. However, with the ever expanding size of data sets, current implementations are not fast enough to quickly and easily compute the results on a standard PC. The Graphics Processing Unit (GPU) can solve some highly parallel problems much faster than the traditional sequential processor (CPU). Thus, a deployable system using a GPU to speedup large-scale LSA processes would be a much more effective choice (in terms of cost/performance ratio) than using a computer cluster. In this paper, we presented a parallel LSA implementation on the GPU, using NVIDIA R Compute Unified Device Architecture (CUDA) and Compute Unified Basic Linear Algebra Subprograms (CUBLAS). The performance of this implementation is compared to traditional LSA implementation on CPU using an optimized Basic Linear Algebra Subprograms library. For large matrices that have dimensions divisible by 16, the GPU algorithm ran five to six times faster than the CPU version.

Proceedings ArticleDOI
31 May 2009
TL;DR: A new method of computing term specificity is proposed, based on modeling the rate of learning of word meaning in Latent Semantic Analysis (LSA), and it is demonstrated that it shows excellent performance compared to existing methods on a broad range of tests.
Abstract: The idea that some words carry more semantic content than others, has led to the notion of term specificity, or informativeness. Computational estimation of this quantity is important for various applications such as information retrieval. We propose a new method of computing term specificity, based on modeling the rate of learning of word meaning in Latent Semantic Analysis (LSA). We analyze the performance of this method both qualitatively and quantitatively and demonstrate that it shows excellent performance compared to existing methods on a broad range of tests. We also demonstrate how it can be used to improve existing applications in information retrieval and summarization.

Proceedings ArticleDOI
05 Oct 2009
TL;DR: This study uses techniques based on Latent Semantic Analysis to discover the underlying semantic relations between words in privacy policies and designs a system used to analyze privacy policies called Hermes.
Abstract: E-commerce privacy policies tend to consist of many ambiguities in language that protects companies more than the customers. Types of ambiguities found are currently divided into four patterns: mitigation (downplaying frequency), enhancement (emphasizing nonessential qualities), obfuscation (hedging claims and obscuring causality), and omission (removing agents). A number of phrases have been identified as creating ambiguities within these four categories. When a customer accepts the terms and conditions of a privacy policy, words and phrases (from the category of mitigation) such as "occasionally" or "from time to time" actually give the e-commerce vendor permission to send as many spamming email offers as they deem necessary . Our study uses techniques based on Latent Semantic Analysis to discover the underlying semantic relations between words in privacy policies. Additional potential ambiguities and other word relations are found automatically. Words are clustered according to their topic in privacy policies using principal directions. This provides us with a ranking of the most significant words from each clustered topic as well as a ranking of the privacy policy topics. We also extract a signature that forms the basis of a typical privacy policy. These results lead to the design of a system used to analyze privacy policies called Hermes. Given an arbitrary privacy policy our system provides a list of the potential ambiguities along with a score that represents the similarity to a typical privacy policy.

Patent
13 Jan 2009
TL;DR: In this paper, a weighted morpheme-by-document matrix is generated based at least in part on the number of word form instances within each document of the corpus and based on a weighting function.
Abstract: A technique for information retrieval includes parsing a corpus to identify a number of wordform instances within each document of the corpus. A weighted morpheme-by-document matrix is generated based at least in part on the number of wordform instances within each document of the corpus and based at least in part on a weighting function. The weighted morpheme-by-document matrix separately enumerates instances of stems and affixes. Additionally or alternatively, a term-by-term alignment matrix may be generated based at least in part on the number of wordform instances within each document of the corpus. At least one lower rank approximation matrix is generated by factorizing the weighted morpheme-by-document matrix and/or the term-by-term alignment matrix.