scispace - formally typeset
Search or ask a question

Showing papers on "Latent semantic analysis published in 2014"


Journal ArticleDOI
TL;DR: This work proposes a flexible architecture to integrate text- and image-based distributional information, and shows in a set of empirical tests that the integrated model is superior to the purely text-based approach, and it provides somewhat complementary semantic information with respect to the latter.
Abstract: Distributional semantic models derive computational representations of word meaning from the patterns of co-occurrence of words in text. Such models have been a success story of computational linguistics, being able to provide reliable estimates of semantic relatedness for the many semantic tasks requiring them. However, distributional models extract meaning information exclusively from text, which is an extremely impoverished basis compared to the rich perceptual sources that ground human semantic knowledge. We address the lack of perceptual grounding of distributional models by exploiting computer vision techniques that automatically identify discrete "visual words" in images, so that the distributional representation of a word can be extended to also encompass its co-occurrence with the visual words of images it is associated with. We propose a flexible architecture to integrate text- and image-based distributional information, and we show in a set of empirical tests that our integrated model is superior to the purely text-based approach, and it provides somewhat complementary semantic information with respect to the latter.

900 citations


Proceedings ArticleDOI
03 Nov 2014
TL;DR: A new latent semantic model that incorporates a convolutional-pooling structure over word sequences to learn low-dimensional, semantic vector representations for search queries and Web documents is proposed.
Abstract: In this paper, we propose a new latent semantic model that incorporates a convolutional-pooling structure over word sequences to learn low-dimensional, semantic vector representations for search queries and Web documents. In order to capture the rich contextual structures in a query or a document, we start with each word within a temporal context window in a word sequence to directly capture contextual features at the word n-gram level. Next, the salient word n-gram features in the word sequence are discovered by the model and are then aggregated to form a sentence-level feature vector. Finally, a non-linear transformation is applied to extract high-level semantic information to generate a continuous vector representation for the full text string. The proposed convolutional latent semantic model (CLSM) is trained on clickthrough data and is evaluated on a Web document ranking task using a large-scale, real-world data set. Results show that the proposed model effectively captures salient semantic information in queries and documents for the task while significantly outperforming previous state-of-the-art semantic models.

723 citations


Book ChapterDOI
06 Sep 2014
TL;DR: The latent category learning (LCL), which is an unsupervised learning problem given only image-level class labels, is proposed, which uses the typical probabilistic Latent Semantic Analysis to learn the latent categories, which can represent objects, object parts or backgrounds.
Abstract: Localizing objects in cluttered backgrounds is a challenging task in weakly supervised localization. Due to large object variations in cluttered images, objects have large ambiguity with backgrounds. However, backgrounds contain useful latent information, e.g., the sky for aeroplanes. If we can learn this latent information, object-background ambiguity can be reduced to suppress the background. In this paper, we propose the latent category learning (LCL), which is an unsupervised learning problem given only image-level class labels. Firstly, inspired by the latent semantic discovery, we use the typical probabilistic Latent Semantic Analysis (pLSA) to learn the latent categories, which can represent objects, object parts or backgrounds. Secondly, to determine which category contains the target object, we propose a category selection method evaluating each category’s discrimination. We evaluate the method on the PASCAL VOC 2007 database and ILSVRC 2013 detection challenge. On VOC 2007, the proposed method yields the annotation accuracy of 48%, which outperforms previous results by 10%. More importantly, we achieve the detection average precision of 30.9%, which improves previous results by 8% and can be competitive with the supervised deformable part model (DPM) 5.0 baseline 33.7%. On ILSVRC 2013 detection, the method yields the precision of 6.0%, which is also competitive with the DPM 5.0.

229 citations


Journal ArticleDOI
TL;DR: A computational text analysis technique for measuring the moral loading of concepts as they are used in a corpus, using latent semantic analysis to compute the semantic similarity between concepts and moral keywords taken from the “Moral foundation Dictionary”.
Abstract: In this paper we present a computational text analysis technique for measuring the moral loading of concepts as they are used in a corpus. This method is especially useful for the study of online corpora as it allows for the rapid analysis of moral rhetoric in texts such as blogs and tweets as events unfold. We use latent semantic analysis to compute the semantic similarity between concepts and moral keywords taken from the A¢Â€ÂœMoral foundation DictionaryA¢Â€Â. This measure of semantic similarity represents the loading of these concepts on the five moral dimensions identified by moral foundation theory. We demonstrate the efficacy of this method using three different concepts and corpora.

87 citations


Journal ArticleDOI
TL;DR: Experimental results demonstrate that GALSF outperforms both LSI and filter-based feature selection methods on benchmark datasets for various feature dimensions.
Abstract: In this paper, genetic algorithm oriented latent semantic features (GALSF) are proposed to obtain better representation of documents in text classification. The proposed approach consists of feature selection and feature transformation stages. The first stage is carried out using the state-of-the-art filter-based methods. The second stage employs latent semantic indexing (LSI) empowered by genetic algorithm such that a better projection is attained using appropriate singular vectors, which are not limited to the ones corresponding to the largest singular values, unlike standard LSI approach. In this way, the singular vectors with small singular values may also be used for projection whereas the vectors with large singular values may be eliminated as well to obtain better discrimination. Experimental results demonstrate that GALSF outperforms both LSI and filter-based feature selection methods on benchmark datasets for various feature dimensions.

83 citations


Journal ArticleDOI
TL;DR: A number of potential applications of LSA are discussed to show how it can be used in empirical Operations Management research, specifically in areas that can benefit from analyzing large volumes of unstructured textual data.
Abstract: In this article, we introduce the use of Latent Semantic Analysis (LSA) as a technique for uncovering the intellectual structure of a discipline. LSA is an emerging quantitative method for content analysis that combines rigorous statistical techniques and scholarly judgment as it proceeds to extract and decipher key latent factors. We provide a stepwise explanation and illustration for implementing LSA. To demonstrate LSA's ability to uncover the intellectual structure of a discipline, we present a study of the field of Operations Management. We also discuss a number of potential applications of LSA to show how it can be used in empirical Operations Management research, specifically in areas that can benefit from analyzing large volumes of unstructured textual data.

80 citations


Book ChapterDOI
01 Jan 2014
TL;DR: This paper provides a short, concise overview of some selected text mining methods, focusing on statistical methods, i.e. Latent Semantic Analysis, Probabilistic Latent seminar analysis, Hierarchical Latent Dirichlet Allocation, Principal Component Analysis, and Support Vector Machines, along with some examples from the biomedical domain.
Abstract: Text is a very important type of data within the biomedical domain. For example, patient records contain large amounts of text which has been entered in a non-standardized format, consequently posing a lot of challenges to processing of such data. For the clinical doctor the written text in the medical findings is still the basis for decision making – neither images nor multimedia data. However, the steadily increasing volumes of unstructured information need machine learning approaches for data mining, i.e. text mining. This paper provides a short, concise overview of some selected text mining methods, focusing on statistical methods, i.e. Latent Semantic Analysis, Probabilistic Latent Semantic Analysis, Latent Dirichlet Allocation, Hierarchical Latent Dirichlet Allocation, Principal Component Analysis, and Support Vector Machines, along with some examples from the biomedical domain. Finally, we provide some open problems and future challenges, particularly from the clinical domain, that we expect to stimulate future research.

73 citations


Journal ArticleDOI
01 Jun 2014-Cortex
TL;DR: These findings support the utility of LSA in examining the contribution of coherence to thought disorder and the its relationship with daily functioning and the proposed used of a natural language processing technique, Latent Semantic Analysis.

70 citations


Book ChapterDOI
18 Oct 2014
TL;DR: The experimental results show that in element-level matching, word embeddings could achieve better performance than previous methods.
Abstract: Ontology matching is one of the most important work to achieve the goal of the semantic web. To fulfill this task, element-level matching is an indispensable step to obtain the fundamental alignment. In element-level matching process, previous work generally utilizes WordNet to compute the semantic similarities among elements, but WordNet is limited by its coverage. In this paper, we introduce word embeddings to the field of ontology matching. We testified the superiority of word embeddings and presented a hybrid method to incorporate word embeddings into the computation of the semantic similarities among elements. We performed the experiments on the OAEI benchmark, conference track and real-world ontologies. The experimental results show that in element-level matching, word embeddings could achieve better performance than previous methods.

43 citations


Journal ArticleDOI
TL;DR: This paper examined how well corpus-based methods predict amplitude of the N400 component of the event-related potential (ERP), an online measure of lexical processing in brain electrical activity.

42 citations


Journal ArticleDOI
01 Jun 2014-Cortex
TL;DR: Assessing the psycholinguistic characteristics of words produced spontaneously by SD patients during an autobiographical memory interview revealed changes in the lexical-semantic landscape related to semantic diversity: the highly frequent and abstract words most prevalent in the patients' speech were also the most semantically diverse.

Journal ArticleDOI
TL;DR: The results show that CESA is a valid solution for sentiment analysis and that similar approaches for model building from the continuous flow of posts could be exploited in other scenarios.
Abstract: With the rapid growth of data generated by social web applications new paradigms in the generation of knowledge are opening. This paper introduces Crowd Explicit Sentiment Analysis (CESA) as an approach for sentiment analysis in social media environments. Similar to Explicit Semantic Analysis, microblog posts are indexed by a predefined collection of documents. In CESA, these documents are built up from common emotional expressions in social streams. In this way, texts are projected to feelings or emotions. This process is performed within a Latent Semantic Analysis. A few simple regular expressions (e.g. “I feel X”, considering X a term representing an emotion or feeling) are used to scratch the enormous flow of micro-blog posts to generate a textual representation of an emotional state with clear polarity value (e.g. angry, happy, sad, confident, etc.). In this way, new posts can be indexed by these feelings according to the distance to their textual representation. The approach is suitable in many scenarios dealing with social media publications and can be implemented in other languages with little effort. In particular, we have evaluated the system on Polarity Classification with both English and Spanish data sets. The results show that CESA is a valid solution for sentiment analysis and that similar approaches for model building from the continuous flow of posts could be exploited in other scenarios.

Journal ArticleDOI
TL;DR: This paper presents a page classification application in a banking workflow that represents administrative document images by merging visual and textual descriptions and uses an $$n$$n-gram model of the page stream allowing a finer-grained classification of pages.
Abstract: In this paper, we present a page classification application in a banking workflow. The proposed architecture represents administrative document images by merging visual and textual descriptions. The visual description is based on a hierarchical representation of the pixel intensity distribution. The textual description uses latent semantic analysis to represent document content as a mixture of topics. Several off-the-shelf classifiers and different strategies for combining visual and textual cues have been evaluated. A final step uses an $$n$$ n -gram model of the page stream allowing a finer-grained classification of pages. The proposed method has been tested in a real large-scale environment and we report results on a dataset of 70,000 pages.

Journal ArticleDOI
TL;DR: This article examined two indices of semantic similarity (i.e., latent semantic similarity [LSS], language style matching [LSM]) to determine their respective roles in initial, unstructured learning.
Abstract: In the present study, we examined two indices of semantic similarity (i.e., latent semantic similarity [LSS], language style matching [LSM]) to determine their respective roles in initial, unstruct...

Proceedings Article
01 May 2014
TL;DR: A collection of freely available Latent Semantic Analysis models built on the entire English Wikipedia and the TASA corpus are introduced, showing that for the task of word-to-word similarity, the scores assigned by these models are strongly correlated with human judgment, outperforming many other frequently used measures, and comparable to the state of the art.
Abstract: This paper introduces a collection of freely available Latent Semantic Analysis models built on the entire English Wikipedia and the TASA corpus. The models differ not only on their source, Wikipedia versus TASA, but also on the linguistic items they focus on: all words, content-words, nouns-verbs, and main concepts. Generating such models from large datasets (e.g. Wikipedia), that can provide a large coverage for the actual vocabulary in use, is computationally challenging, which is the reason why large LSA models are rarely available. Our experiments show that for the task of word-to-word similarity, the scores assigned by these models are strongly correlated with human judgment, outperforming many other frequently used measures, and comparable to the state of the art.

Proceedings ArticleDOI
01 Oct 2014
TL;DR: Results on the task of suggesting word translations in context for 3 language pairs reveal the utility of the proposed contextualized models of crosslingual semantic similarity.
Abstract: We propose the first probabilistic approach to modeling cross-lingual semantic similarity (CLSS) in context which requires only comparable data. The approach relies on an idea of projecting words and sets of words into a shared latent semantic space spanned by language-pair independent latent semantic concepts (e.g., crosslingual topics obtained by a multilingual topic model). These latent cross-lingual concepts are induced from a comparable corpus without any additional lexical resources. Word meaning is represented as a probability distribution over the latent concepts, and a change in meaning is represented as a change in the distribution over these latent concepts. We present new models that modulate the isolated out-ofcontext word representations with contextual knowledge. Results on the task of suggesting word translations in context for 3 language pairs reveal the utility of the proposed contextualized models of crosslingual semantic similarity.

Book ChapterDOI
31 Mar 2014
TL;DR: The Contact Representation of Word Networks (Crown) problem as discussed by the authors is a geometric representation problem where a set of axis-aligned rectangles (boxes) with fixed dimensions and a graph with vertex set are given.
Abstract: We study a geometric representation problem, where we are given a set \(\mathcal B\) of axis-aligned rectangles (boxes) with fixed dimensions and a graph with vertex set \(\mathcal B\). The task is to place the rectangles without overlap such that two rectangles touch if the graph contains an edge between them. We call this problem Contact Representation of Word Networks (Crown). It formalizes the geometric problem behind drawing word clouds in which semantically related words are close to each other. Here, we represent words by rectangles and semantic relationships by edges.

Journal ArticleDOI
TL;DR: This research presents a new benchmark dataset for evaluating Short Text Semantic Similarity measurement algorithms and the methodology used for its creation, STSS-131, designed to meet requirements drawing on a range of resources from traditional grammar to cognitive neuroscience.
Abstract: This research presents a new benchmark dataset for evaluating Short Text Semantic Similarity (STSS) measurement algorithms and the methodology used for its creation. The power of the dataset is evaluated by using it to compare two established algorithms, STASIS and Latent Semantic Analysis. This dataset focuses on measures for use in Conversational Agents; other potential applications include email processing and data mining of social networks. Such applications involve integrating the STSS algorithm in a complex system, but STSS algorithms must be evaluated in their own right and compared with others for their effectiveness before systems integration. Semantic similarity is an artifact of human perception; therefore its evaluation is inherently empirical and requires benchmark datasets derived from human similarity ratings. The new dataset of 64 sentence pairs, STSS-131, has been designed to meet these requirements drawing on a range of resources from traditional grammar to cognitive neuroscience. The human ratings are obtained from a set of trials using new and improved experimental methods, with validated measures and statistics. The results illustrate the increased challenge and the potential longevity of the STSS-131 dataset as the Gold Standard for future STSS algorithm evaluation.

Journal ArticleDOI
TL;DR: The authors examined whether pronouns in news media occurred in evaluative contexts reflecting psychological biases and measured the context of pronouns by computerized semantic analysis, and found that the majority of the contexts of pronouns reflected bias.
Abstract: This paper examines whether pronouns in news media occurred in evaluative contexts reflecting psychological biases. Contexts of pronouns were measured by computerized semantic analysis. Results sho ...

Journal ArticleDOI
01 Jun 2014-Cortex
TL;DR: This work employs computational language approaches to assess time-varying semantic and sequential properties of prose recall at various retrieval intervals in patients with schizophrenia, unaffected siblings and healthy unrelated control participants.

Journal ArticleDOI
01 Jun 2014-Cortex
TL;DR: This work designs a graphical representation for the discourse of the discourse in patients with disorganized speech and in healthy participants, and describes the properties of a context-dependent neural model, based on matrix associative memories, that performs goal-oriented linguistic behavior.

Journal ArticleDOI
TL;DR: This study presents a new approach for transforming the latent representation derived from a Latent Semantic Analysis (LSA) space into one where dimensions have nonlatent meanings, supporting the conclusion that the non latent coordinates generated using this methodology preserve the semantic relationships within the original LSA space.
Abstract: This study presents a new approach for transforming the latent representation derived from a Latent Semantic Analysis (LSA) space into one where dimensions have nonlatent meanings. These meanings are based on lexical descriptors, which are selected by the LSA user. The authors present three analyses that provide examples of the utility of this methodology. The first analysis demonstrates how document terms can be projected into meaningful new dimensions. The second demonstrates how to use the modified space to perform multidimensional document labeling to obtain a high and substantive reliability between LSA experts. Finally, the internal validity of the method is assessed by comparing an original semantic space with a modified space. The results show high consistency between the two spaces, supporting the conclusion that the nonlatent coordinates generated using this methodology preserve the semantic relationships within the original LSA space.

Journal ArticleDOI
TL;DR: A multi-strategy approach for semantically guided extraction, indexing and search of educational metadata is described; it combines machine learning, concept analysis, and corpus-based natural language processing techniques.
Abstract: Secondary-school teachers are in constant need of finding relevant digital resources to support specific didactic goals. Unfortunately, generic search engines do not allow them to identify learning objects among semi-structured candidate educational resources, much less retrieve them by teaching goals. This article describes a multi-strategy approach for semantically guided extraction, indexing and search of educational metadata; it combines machine learning, concept analysis, and corpus-based natural language processing techniques. The overall model was validated by comparing extracted metadata against standard search methods and heuristic-based techniques for Classification Accuracy and Metadata Quality (as evaluated by actual teachers), yielding promising results and showing that this semantically guided metadata extraction can effectively enhance access and use of educational digital material.

Proceedings ArticleDOI
28 May 2014
TL;DR: In this paper, the authors employed latent semantic analysis (LSA) as the term-document representation to handle the intelligence plagiarism, which was used in the Heuristic Retrieval (HR) component and Detailed Analysis (DA) component.
Abstract: Plagiarism is an important task since its number is increasing and the plagiarism technique is getting difficult. It means that there is not only literal plagiarism but also intelligence plagiarism. In order to handle the intelligence plagiarism, we employed latent semantic analysis (LSA) as the term-document representation. The LSA was used in the Heuristic Retrieval (HR) component and Detailed Analysis (DA) component. We conducted several experiments to compare the token type, the text segmentation and the threshold value. The test data were prepared manually from the available Indonesian paper corpus. Experimental results showed that the LSA outperformed the VSM (Vector Space Model), especially in test cases with intelligence plagiarism.

Journal ArticleDOI
TL;DR: The algorithm which is called FSFP (Free Source selection Free Priori probability distribution) is proposed, which can transfer knowledge from the long texts to the short ones and the experimental results on large data sets show the effectiveness of the new algorithm.
Abstract: Transfer learning is a method that studies how to identify the useful knowledge and skills in the previous tasks, and uses them to the new tasks or domains. At present, the research on transfer learning mostly focuses on the field of long texts. However, the source data should be given for the transportation from long texts to the sh ort ones, and the priori probability distribution of the data should be given at the same time. In order to solve the problems, the algorithm which is called FSFP (Free Source selection Free Priori probability distribution) is proposed. It can transfer knowledge from the long texts to the short ones. Latent semantic analysis is used to extract the key words as seed characteristic sets, which are semantically related to the long texts from the target domain. And then the graph structure of online information is built. With the help of the improved Laplacian Eigenmaps, the feature representations of high- dimensional data are mapped to a low-dimensional space. Lastly, the target data are classified in the constraint of minimizing the mutual information between the instance and the feature representation. The experimental results on large data sets show the effectiveness of the new algorithm.

Journal ArticleDOI
TL;DR: The superior performance of the GOLD models suggest that a single acquisition and storage mechanism, namely co-occurrence, can account for associative and conceptual relationships between words and is more psychologically plausible than models using singular value decomposition (SVD).
Abstract: The GOLD model (Graph Of Language Distribution) is a network model constructed based on co-occurrence in a large corpus of natural language that may be used to explore what information may be present in a graph-structured model of language, and what information may be extracted through theoretically-driven algorithms as well as standard graph analysis methods. The present study will employ GOLD to examine two types of relationship between words: semantic similarity and associative relatedness. Semantic similarity refers to the degree of overlap in meaning between words, while associative relatedness refers to the degree to which two words occur in the same schematic context. It is expected that a graph structured model of language constructed based on co-occurrence should easily capture associative relatedness, because this type of relationship is thought to be present directly in lexical co-occurrence. However, it is hypothesized that semantic similarity may be extracted from the intersection of the set of first-order connections, because two words that are semantically similar may occupy similar thematic or syntactic roles across contexts and thus would co-occur lexically with the same set of nodes. Two versions the GOLD model that differed in terms of the co-occurence window, bigGOLD at the paragraph level and smallGOLD at the adjacent word level, were directly compared to the performance of a well-established distributional model, Latent Semantic Analysis (LSA). The superior performance of the GOLD models (big and small) suggest that a single acquisition and storage mechanism, namely co-occurrence, can account for associative and conceptual relationships between words and is more psychologically plausible than models using singular value decomposition (SVD).

Journal ArticleDOI
TL;DR: Proposed Random forest method for sentiment classification of movie reviews and LSA based filtering mechanism to reduce the size of review summary is proposed.
Abstract: framework is designed for sentiment classification and feature based summarization system in a mobile environment. Posting online reviews has become an increasingly popular way for people to share their opinions about specific product or service with other users. It has become a common practice for web technologies to provide the venues and facilities for people to publish their reviews. Sentiment classification and feature based summarization are essential steps for the classification and summarization of movie reviews. System proposed Random forest method for sentiment classification of movie reviews. Identification of movie features and opinion words are both important for feature based summarization. System identified movie features using a novel approach called Latent Semantic Analysis (LSA) and frequency based approach. Then system identified opinion words using part-of-speech (POS) tagging method. The result of LSA is extended to LSA based filtering mechanism to reduce the size of review summary. System design focused on the sentiment classification accuracy and system response time.

Journal ArticleDOI
TL;DR: A good qualitative account of word similarities may be obtained by adjusting the cosine between word vectors from latent semantic analysis for vector lengths in a manner analogous to the quantum geometric model of similarity.
Abstract: A good qualitative account of word similarities may be obtained by adjusting the cosine between word vectors from latent semantic analysis for vector lengths in a manner analogous to the quantum geometric model of similarity.

Journal ArticleDOI
TL;DR: A comparison between Cosine Similarity and k-Nearest Neighbors algorithm in Latent Semantic Analysis method to score Arabic essays automatically is presented and it showed that the use of Cosin Similarity with LatentSemantic Analysis led to high results.
Abstract: 3 Abstract: In this paper, a comparison between Cosine Similarity and k-Nearest Neighbors algorithm in Latent Semantic Analysis method to score Arabic essays automatically is presented. It also improves Latent Semantic Analysis by processing the entered text, unifying the form of letters, deleting the formatting, replacing synonyms, stemming and deleting "Stop Words". The results showed that the use of Cosine Similarity with Latent Semantic Analysis led to high results than the use of k-Nearest Neighbors with Latent Semantic Analysis.

Proceedings ArticleDOI
01 Jan 2014
TL;DR: An automated movie recommendation system based on the similarity of movie: given a target movie selected by the user, the goal of the system is to provide a list of those movies that are most similar to the target one, without knowing any user preferences.
Abstract: Recommendation systems have become successful at suggesting content that are likely to be of interest to the user, however their performance greatly suffers when little information about the users preferences are given. In this paper we propose an automated movie recommendation system based on the similarity of movie: given a target movie selected by the user, the goal of the system is to provide a list of those movies that are most similar to the target one, without knowing any user preferences. The Topic Models of Latent Semantic Allocation (LSA) and Latent Dirichlet Allocation (LDA) have been applied and extensively compared on a movie database of two hundred thousand plots. Experiments are an important part of the paper; we examined the topic models behaviour based on standard metrics and on user evaluations, we have conducted performance assessments with 30 users to compare our approach with a commercial system. The outcome was that the performance of LSA was superior to that of LDA in supporting the selection of similar plots. Even if our system does not outperform commercial systems, it does not rely on human effort, thus it can be ported to any domain where natural language descriptions exist. Since it is independent from the number of user ratings, it is able to suggest famous movies as well as old or unheard movies that are still strongly related to the content of the video the user has watched.