Showing papers on "Latent semantic analysis published in 2014"

PDF

Open Access

Journal Article•DOI•

[...]

Elia Bruni¹, Nam Khanh Tran, Marco Baroni¹•Institutions (1)

01 Jan 2014-Journal of Artificial Intelligence Research

TL;DR: This work proposes a flexible architecture to integrate text- and image-based distributional information, and shows in a set of empirical tests that the integrated model is superior to the purely text-based approach, and it provides somewhat complementary semantic information with respect to the latter.

...read moreread less

Abstract: Distributional semantic models derive computational representations of word meaning from the patterns of co-occurrence of words in text. Such models have been a success story of computational linguistics, being able to provide reliable estimates of semantic relatedness for the many semantic tasks requiring them. However, distributional models extract meaning information exclusively from text, which is an extremely impoverished basis compared to the rich perceptual sources that ground human semantic knowledge. We address the lack of perceptual grounding of distributional models by exploiting computer vision techniques that automatically identify discrete "visual words" in images, so that the distributional representation of a word can be extended to also encompass its co-occurrence with the visual words of images it is associated with. We propose a flexible architecture to integrate text- and image-based distributional information, and we show in a set of empirical tests that our integrated model is superior to the purely text-based approach, and it provides somewhat complementary semantic information with respect to the latter.

...read moreread less

900 citations

Proceedings Article•DOI•

A Latent Semantic Model with Convolutional-Pooling Structure for Information Retrieval

[...]

Yelong Shen¹, Xiaodong He¹, Jianfeng Gao¹, Li Deng¹, Grégoire Mesnil² - Show less +1 more•Institutions (2)

Microsoft¹, Université de Montréal²

03 Nov 2014

TL;DR: A new latent semantic model that incorporates a convolutional-pooling structure over word sequences to learn low-dimensional, semantic vector representations for search queries and Web documents is proposed.

...read moreread less

Abstract: In this paper, we propose a new latent semantic model that incorporates a convolutional-pooling structure over word sequences to learn low-dimensional, semantic vector representations for search queries and Web documents. In order to capture the rich contextual structures in a query or a document, we start with each word within a temporal context window in a word sequence to directly capture contextual features at the word n-gram level. Next, the salient word n-gram features in the word sequence are discovered by the model and are then aggregated to form a sentence-level feature vector. Finally, a non-linear transformation is applied to extract high-level semantic information to generate a continuous vector representation for the full text string. The proposed convolutional latent semantic model (CLSM) is trained on clickthrough data and is evaluated on a Web document ranking task using a large-scale, real-world data set. Results show that the proposed model effectively captures salient semantic information in queries and documents for the task while significantly outperforming previous state-of-the-art semantic models.

...read moreread less

723 citations

Book Chapter•DOI•

Weakly Supervised Object Localization with Latent Category Learning

[...]

Chong Wang¹, Weiqiang Ren¹, Kaiqi Huang¹, Tieniu Tan¹•Institutions (1)

Chinese Academy of Sciences¹

06 Sep 2014

TL;DR: The latent category learning (LCL), which is an unsupervised learning problem given only image-level class labels, is proposed, which uses the typical probabilistic Latent Semantic Analysis to learn the latent categories, which can represent objects, object parts or backgrounds.

...read moreread less

Abstract: Localizing objects in cluttered backgrounds is a challenging task in weakly supervised localization. Due to large object variations in cluttered images, objects have large ambiguity with backgrounds. However, backgrounds contain useful latent information, e.g., the sky for aeroplanes. If we can learn this latent information, object-background ambiguity can be reduced to suppress the background. In this paper, we propose the latent category learning (LCL), which is an unsupervised learning problem given only image-level class labels. Firstly, inspired by the latent semantic discovery, we use the typical probabilistic Latent Semantic Analysis (pLSA) to learn the latent categories, which can represent objects, object parts or backgrounds. Secondly, to determine which category contains the target object, we propose a category selection method evaluating each category’s discrimination. We evaluate the method on the PASCAL VOC 2007 database and ILSVRC 2013 detection challenge. On VOC 2007, the proposed method yields the annotation accuracy of 48%, which outperforms previous results by 10%. More importantly, we achieve the detection average precision of 30.9%, which improves previous results by 8% and can be competitive with the supervised deformable part model (DPM) 5.0 baseline 33.7%. On ILSVRC 2013 detection, the method yields the precision of 6.0%, which is also competitive with the DPM 5.0.

...read moreread less

229 citations

Journal Article•DOI•

Measuring Moral Rhetoric in Text

[...]

Eyal Sagi¹, Morteza Dehghani²•Institutions (2)

Northwestern University¹, University of Southern California²

01 Apr 2014-Social Science Computer Review

TL;DR: A computational text analysis technique for measuring the moral loading of concepts as they are used in a corpus, using latent semantic analysis to compute the semantic similarity between concepts and moral keywords taken from the “Moral foundation Dictionary”.

...read moreread less

Abstract: In this paper we present a computational text analysis technique for measuring the moral loading of concepts as they are used in a corpus. This method is especially useful for the study of online corpora as it allows for the rapid analysis of moral rhetoric in texts such as blogs and tweets as events unfold. We use latent semantic analysis to compute the semantic similarity between concepts and moral keywords taken from the A¢Â€ÂœMoral foundation DictionaryA¢Â€Â. This measure of semantic similarity represents the loading of these concepts on the five moral dimensions identified by moral foundation theory. We demonstrate the efficacy of this method using three different concepts and corpora.

...read moreread less

87 citations

Journal Article•DOI•

Text classification using genetic algorithm oriented latent semantic features

[...]

Alper Kursat Uysal¹, Serkan Gunal¹•Institutions (1)

Anadolu University¹

01 Oct 2014-Expert Systems With Applications

TL;DR: Experimental results demonstrate that GALSF outperforms both LSI and filter-based feature selection methods on benchmark datasets for various feature dimensions.

...read moreread less

Abstract: In this paper, genetic algorithm oriented latent semantic features (GALSF) are proposed to obtain better representation of documents in text classification. The proposed approach consists of feature selection and feature transformation stages. The first stage is carried out using the state-of-the-art filter-based methods. The second stage employs latent semantic indexing (LSI) empowered by genetic algorithm such that a better projection is attained using appropriate singular vectors, which are not limited to the ones corresponding to the largest singular values, unlike standard LSI approach. In this way, the singular vectors with small singular values may also be used for projection whereas the vectors with large singular values may be eliminated as well to obtain better discrimination. Experimental results demonstrate that GALSF outperforms both LSI and filter-based feature selection methods on benchmark datasets for various feature dimensions.

...read moreread less

83 citations

Journal Article•DOI•

The use of latent semantic analysis in operations management research

[...]

Shailesh S. Kulkarni¹, Uday M. Apte², Nicholas Evangelopoulos¹•Institutions (2)

University of North Texas¹, Naval Postgraduate School²

01 Oct 2014-Decision Sciences

TL;DR: A number of potential applications of LSA are discussed to show how it can be used in empirical Operations Management research, specifically in areas that can benefit from analyzing large volumes of unstructured textual data.

...read moreread less

Abstract: In this article, we introduce the use of Latent Semantic Analysis (LSA) as a technique for uncovering the intellectual structure of a discipline. LSA is an emerging quantitative method for content analysis that combines rigorous statistical techniques and scholarly judgment as it proceeds to extract and decipher key latent factors. We provide a stepwise explanation and illustration for implementing LSA. To demonstrate LSA's ability to uncover the intellectual structure of a discipline, we present a study of the field of Operations Management. We also discuss a number of potential applications of LSA to show how it can be used in empirical Operations Management research, specifically in areas that can benefit from analyzing large volumes of unstructured textual data.

...read moreread less

80 citations

Book Chapter•DOI•

Biomedical text mining: State-of-the-art, open problems and future challenges

[...]

Andreas Holzinger¹, Johannes Schantl¹, Miriam Schroettner¹, Christin Seifert², Karin Verspoor³ - Show less +1 more•Institutions (3)

University of Graz¹, University of Passau², University of Melbourne³

01 Jan 2014

TL;DR: This paper provides a short, concise overview of some selected text mining methods, focusing on statistical methods, i.e. Latent Semantic Analysis, Probabilistic Latent seminar analysis, Hierarchical Latent Dirichlet Allocation, Principal Component Analysis, and Support Vector Machines, along with some examples from the biomedical domain.

...read moreread less

Abstract: Text is a very important type of data within the biomedical domain. For example, patient records contain large amounts of text which has been entered in a non-standardized format, consequently posing a lot of challenges to processing of such data. For the clinical doctor the written text in the medical findings is still the basis for decision making – neither images nor multimedia data. However, the steadily increasing volumes of unstructured information need machine learning approaches for data mining, i.e. text mining. This paper provides a short, concise overview of some selected text mining methods, focusing on statistical methods, i.e. Latent Semantic Analysis, Probabilistic Latent Semantic Analysis, Latent Dirichlet Allocation, Hierarchical Latent Dirichlet Allocation, Principal Component Analysis, and Support Vector Machines, along with some examples from the biomedical domain. Finally, we provide some open problems and future challenges, particularly from the clinical domain, that we expect to stimulate future research.

...read moreread less

73 citations

Journal Article•DOI•

Latent semantic variables are associated with formal thought disorder and adaptive behavior in older inpatients with schizophrenia.

[...]

Katherine Holshausen¹, Philip D. Harvey², Brita Elvevåg³, Peter W. Foltz⁴, Christopher R. Bowie¹ - Show less +1 more•Institutions (4)

Queen's University¹, University of Miami², University Hospital of North Norway³, University of Colorado Boulder⁴

01 Jun 2014-Cortex

TL;DR: These findings support the utility of LSA in examining the contribution of coherence to thought disorder and the its relationship with daily functioning and the proposed used of a natural language processing technique, Latent Semantic Analysis.

...read moreread less

70 citations

Book Chapter•DOI•

Ontology Matching with Word Embeddings

[...]

Yuanzhe Zhang¹, Xuepeng Wang¹, Siwei Lai¹, Shizhu He¹, Kang Liu¹, Jun Zhao¹, Xueqiang Lv² - Show less +3 more•Institutions (2)

Chinese Academy of Sciences¹, Beijing Information Science & Technology University²

18 Oct 2014

TL;DR: The experimental results show that in element-level matching, word embeddings could achieve better performance than previous methods.

...read moreread less

Abstract: Ontology matching is one of the most important work to achieve the goal of the semantic web. To fulfill this task, element-level matching is an indispensable step to obtain the fundamental alignment. In element-level matching process, previous work generally utilizes WordNet to compute the semantic similarities among elements, but WordNet is limited by its coverage. In this paper, we introduce word embeddings to the field of ontology matching. We testified the superiority of word embeddings and presented a hybrid method to incorporate word embeddings into the computation of the semantic similarities among elements. We performed the experiments on the OAEI benchmark, conference track and real-world ontologies. The experimental results show that in element-level matching, word embeddings could achieve better performance than previous methods.

...read moreread less

43 citations

Journal Article•DOI•

Examining the N400 semantic context effect item-by-item: relationship to corpus-based measures of word co-occurrence.

[...]

Cyma Van Petten¹•Institutions (1)

Binghamton University¹

01 Dec 2014-International Journal of Psychophysiology

TL;DR: This paper examined how well corpus-based methods predict amplitude of the N400 component of the event-related potential (ERP), an online measure of lexical processing in brain electrical activity.

...read moreread less

42 citations

Journal Article•DOI•

Broadly speaking: vocabulary in semantic dementia shifts towards general, semantically diverse words.

[...]

Paul Hoffman¹, Lotte Meteyard², Karalyn Patterson³•Institutions (3)

University of Manchester¹, University of Reading², Cognition and Brain Sciences Unit³

01 Jun 2014-Cortex

TL;DR: Assessing the psycholinguistic characteristics of words produced spontaneously by SD patients during an autobiographical memory interview revealed changes in the lexical-semantic landscape related to semantic diversity: the highly frequent and abstract words most prevalent in the patients' speech were also the most semantically diverse.

...read moreread less

Journal Article•DOI•

Crowd explicit sentiment analysis

[...]

Arturo Montejo-Ráez¹, Manuel Carlos Díaz-Galiano¹, Fernando Martínez-Santiago¹, L. A. Ureña-López¹•Institutions (1)

University of Jaén¹

01 Oct 2014-Knowledge Based Systems

TL;DR: The results show that CESA is a valid solution for sentiment analysis and that similar approaches for model building from the continuous flow of posts could be exploited in other scenarios.

...read moreread less

Abstract: With the rapid growth of data generated by social web applications new paradigms in the generation of knowledge are opening. This paper introduces Crowd Explicit Sentiment Analysis (CESA) as an approach for sentiment analysis in social media environments. Similar to Explicit Semantic Analysis, microblog posts are indexed by a predefined collection of documents. In CESA, these documents are built up from common emotional expressions in social streams. In this way, texts are projected to feelings or emotions. This process is performed within a Latent Semantic Analysis. A few simple regular expressions (e.g. “I feel X”, considering X a term representing an emotion or feeling) are used to scratch the enormous flow of micro-blog posts to generate a textual representation of an emotional state with clear polarity value (e.g. angry, happy, sad, confident, etc.). In this way, new posts can be indexed by these feelings according to the distance to their textual representation. The approach is suitable in many scenarios dealing with social media publications and can be implemented in other languages with little effort. In particular, we have evaluated the system on Polarity Classification with both English and Spanish data sets. The results show that CESA is a valid solution for sentiment analysis and that similar approaches for model building from the continuous flow of posts could be exploited in other scenarios.

...read moreread less

Journal Article•DOI•

Multimodal page classification in administrative document image streams

[...]

Marçal Rusiñol¹, Volkmar Frinken², Dimosthenis Karatzas¹, Andrew D. Bagdanov¹, Josep Lladós¹ - Show less +1 more•Institutions (2)

Autonomous University of Barcelona¹, Kyushu University²

01 Dec 2014-International Journal on Document Analysis and Recognition

TL;DR: This paper presents a page classification application in a banking workflow that represents administrative document images by merging visual and textual descriptions and uses an $$n$$n-gram model of the page stream allowing a finer-grained classification of pages.

...read moreread less

Abstract: In this paper, we present a page classification application in a banking workflow. The proposed architecture represents administrative document images by merging visual and textual descriptions. The visual description is based on a hierarchical representation of the pixel intensity distribution. The textual description uses latent semantic analysis to represent document content as a mixture of topics. Several off-the-shelf classifiers and different strategies for combining visual and textual cues have been evaluated. A final step uses an $$n$$ n -gram model of the page stream allowing a finer-grained classification of pages. The proposed method has been tested in a real large-scale environment and we report results on a dataset of 70,000 pages.

...read moreread less

Journal Article•DOI•

Latent Semantic Similarity and Language Style Matching in Initial Dyadic Interactions

[...]

Meghan J. Babcock¹, Vivian P. Ta¹, William Ickes¹•Institutions (1)

University of Texas at Arlington¹

01 Jan 2014-Journal of Language and Social Psychology

TL;DR: This article examined two indices of semantic similarity (i.e., latent semantic similarity [LSS], language style matching [LSM]) to determine their respective roles in initial, unstructured learning.

...read moreread less

Abstract: In the present study, we examined two indices of semantic similarity (i.e., latent semantic similarity [LSS], language style matching [LSM]) to determine their respective roles in initial, unstruct...

...read moreread less

Proceedings Article•

Latent Semantic Analysis Models on Wikipedia and TASA

[...]

Dan Stefanescu¹, Rajendra Banjade¹, Vasile Rus¹•Institutions (1)

University of Memphis¹

01 May 2014

TL;DR: A collection of freely available Latent Semantic Analysis models built on the entire English Wikipedia and the TASA corpus are introduced, showing that for the task of word-to-word similarity, the scores assigned by these models are strongly correlated with human judgment, outperforming many other frequently used measures, and comparable to the state of the art.

...read moreread less

Abstract: This paper introduces a collection of freely available Latent Semantic Analysis models built on the entire English Wikipedia and the TASA corpus. The models differ not only on their source, Wikipedia versus TASA, but also on the linguistic items they focus on: all words, content-words, nouns-verbs, and main concepts. Generating such models from large datasets (e.g. Wikipedia), that can provide a large coverage for the actual vocabulary in use, is computationally challenging, which is the reason why large LSA models are rarely available. Our experiments show that for the task of word-to-word similarity, the scores assigned by these models are strongly correlated with human judgment, outperforming many other frequently used measures, and comparable to the state of the art.

...read moreread less

Proceedings Article•DOI•

Probabilistic Models of Cross-Lingual Semantic Similarity in Context Based on Latent Cross-Lingual Concepts Induced from Comparable Data

[...]

Ivan Vulić¹, Marie-Francine Moens¹•Institutions (1)

Katholieke Universiteit Leuven¹

01 Oct 2014

TL;DR: Results on the task of suggesting word translations in context for 3 language pairs reveal the utility of the proposed contextualized models of crosslingual semantic similarity.

...read moreread less

Abstract: We propose the first probabilistic approach to modeling cross-lingual semantic similarity (CLSS) in context which requires only comparable data. The approach relies on an idea of projecting words and sets of words into a shared latent semantic space spanned by language-pair independent latent semantic concepts (e.g., crosslingual topics obtained by a multilingual topic model). These latent cross-lingual concepts are induced from a comparable corpus without any additional lexical resources. Word meaning is represented as a probability distribution over the latent concepts, and a change in meaning is represented as a change in the distribution over these latent concepts. We present new models that modulate the isolated out-ofcontext word representations with contextual knowledge. Results on the task of suggesting word translations in context for 3 language pairs reveal the utility of the proposed contextualized models of crosslingual semantic similarity.

...read moreread less

Book Chapter•DOI•

Semantic Word Cloud Representations: Hardness and Approximation Algorithms

[...]

Lukas Barth¹, Sara Irina Fabrikant², Stephen G. Kobourov³, Anna Lubiw⁴, Martin Nöllenburg¹, Yoshio Okamoto⁵, Sergey Pupyrev⁶, Sergey Pupyrev³, Claudio Squarcella⁷, Torsten Ueckerdt¹, Alexander Wolff⁸ - Show less +7 more•Institutions (8)

Karlsruhe Institute of Technology¹, University of Zurich², University of Arizona³, University of Waterloo⁴, University of Electro-Communications⁵, Ural Federal University⁶, Roma Tre University⁷, University of Würzburg⁸

31 Mar 2014

TL;DR: The Contact Representation of Word Networks (Crown) problem as discussed by the authors is a geometric representation problem where a set of axis-aligned rectangles (boxes) with fixed dimensions and a graph with vertex set are given.

...read moreread less

Abstract: We study a geometric representation problem, where we are given a set $\mathcal B$ of axis-aligned rectangles (boxes) with fixed dimensions and a graph with vertex set $\mathcal B$. The task is to place the rectangles without overlap such that two rectangles touch if the graph contains an edge between them. We call this problem Contact Representation of Word Networks (Crown). It formalizes the geometric problem behind drawing word clouds in which semantically related words are close to each other. Here, we represent words by rectangles and semantic relationships by edges.

...read moreread less

Journal Article•DOI•

A new benchmark dataset with production methodology for short text semantic similarity algorithms

[...]

James O'Shea¹, Zuhair Bandar¹, Keeley Crockett¹•Institutions (1)

Manchester Metropolitan University¹

03 Jan 2014-ACM Transactions on Speech and Language Processing

TL;DR: This research presents a new benchmark dataset for evaluating Short Text Semantic Similarity measurement algorithms and the methodology used for its creation, STSS-131, designed to meet requirements drawing on a range of resources from traditional grammar to cognitive neuroscience.

...read moreread less

Abstract: This research presents a new benchmark dataset for evaluating Short Text Semantic Similarity (STSS) measurement algorithms and the methodology used for its creation. The power of the dataset is evaluated by using it to compare two established algorithms, STASIS and Latent Semantic Analysis. This dataset focuses on measures for use in Conversational Agents; other potential applications include email processing and data mining of social networks. Such applications involve integrating the STSS algorithm in a complex system, but STSS algorithms must be evaluated in their own right and compared with others for their effectiveness before systems integration. Semantic similarity is an artifact of human perception; therefore its evaluation is inherently empirical and requires benchmark datasets derived from human similarity ratings. The new dataset of 64 sentence pairs, STSS-131, has been designed to meet these requirements drawing on a range of resources from traditional grammar to cognitive neuroscience. The human ratings are obtained from a set of trials using new and improved experimental methods, with validated measures and statistics. The results illustrate the increased challenge and the potential longevity of the STSS-131 dataset as the Gold Standard for future STSS algorithm evaluation.

...read moreread less

Journal Article•DOI•

Biases in News Media as Reflected by Personal Pronouns in Evaluative Contexts

[...]

Marie Gustafsson Sendén¹, Torun Lindholm², Sverker Sikström³•Institutions (3)

Stockholm University¹, Uppsala University², Lund University³

01 Jan 2014-Social Psychology

TL;DR: The authors examined whether pronouns in news media occurred in evaluative contexts reflecting psychological biases and measured the context of pronouns by computerized semantic analysis, and found that the majority of the contexts of pronouns reflected bias.

...read moreread less

Abstract: This paper examines whether pronouns in news media occurred in evaluative contexts reflecting psychological biases. Contexts of pronouns were measured by computerized semantic analysis. Results sho ...

...read moreread less

Journal Article•DOI•

A computational language approach to modeling prose recall in schizophrenia.

[...]

Mark Rosenstein¹, Catherine Diaz-Asper², Peter W. Foltz³, Brita Elvevåg⁴•Institutions (4)

Pearson Education¹, National Institutes of Health², University of Colorado Boulder³, University Hospital of North Norway⁴

01 Jun 2014-Cortex

TL;DR: This work employs computational language approaches to assess time-varying semantic and sequential properties of prose recall at various retrieval intervals in patients with schizophrenia, unaffected siblings and healthy unrelated control participants.

...read moreread less

Journal Article•DOI•

A modular approach to language production: models and facts.

[...]

Juan C. Valle-Lisboa¹, Andrés Pomi¹, Álvaro Cabana¹, Brita Elvevåg², Eduardo Mizraji¹ - Show less +1 more•Institutions (2)

University of the Republic¹, University Hospital of North Norway²

01 Jun 2014-Cortex

TL;DR: This work designs a graphical representation for the discourse of the discourse in patients with disorganized speech and in healthy participants, and describes the properties of a context-dependent neural model, based on matrix associative memories, that performs goal-oriented linguistic behavior.

...read moreread less

Journal Article•DOI•

Transforming Selected Concepts Into Dimensions in Latent Semantic Analysis

[...]

Ricardo Olmos¹, Guillermo Jorge-Botana², José León¹, Inmaculada Escudero²•Institutions (2)

Autonomous University of Madrid¹, National University of Distance Education²

14 Apr 2014-Discourse Processes

TL;DR: This study presents a new approach for transforming the latent representation derived from a Latent Semantic Analysis (LSA) space into one where dimensions have nonlatent meanings, supporting the conclusion that the non latent coordinates generated using this methodology preserve the semantic relationships within the original LSA space.

...read moreread less

Abstract: This study presents a new approach for transforming the latent representation derived from a Latent Semantic Analysis (LSA) space into one where dimensions have nonlatent meanings. These meanings are based on lexical descriptors, which are selected by the LSA user. The authors present three analyses that provide examples of the utility of this methodology. The first analysis demonstrates how document terms can be projected into meaningful new dimensions. The second demonstrates how to use the modified space to perform multidimensional document labeling to obtain a high and substantive reliability between LSA experts. Finally, the internal validity of the method is assessed by comparing an original semantic space with a modified space. The results show high consistency between the two spaces, supporting the conclusion that the nonlatent coordinates generated using this methodology preserve the semantic relationships within the original LSA space.

...read moreread less

Journal Article•DOI•

Web metadata extraction and semantic indexing for learning objects extraction

[...]

John Atkinson¹, Andrea Gonzalez¹, Mauricio Munoz¹, Hernán Astudillo²•Institutions (2)

University of Concepción¹, Federico Santa María Technical University²

01 Sep 2014-Applied Intelligence

TL;DR: A multi-strategy approach for semantically guided extraction, indexing and search of educational metadata is described; it combines machine learning, concept analysis, and corpus-based natural language processing techniques.

...read moreread less

Abstract: Secondary-school teachers are in constant need of finding relevant digital resources to support specific didactic goals. Unfortunately, generic search engines do not allow them to identify learning objects among semi-structured candidate educational resources, much less retrieve them by teaching goals. This article describes a multi-strategy approach for semantically guided extraction, indexing and search of educational metadata; it combines machine learning, concept analysis, and corpus-based natural language processing techniques. The overall model was validated by comparing extracted metadata against standard search methods and heuristic-based techniques for Classification Accuracy and Metadata Quality (as evaluated by actual teachers), yielding promising results and showing that this semantically guided metadata extraction can effectively enhance access and use of educational digital material.

...read moreread less

Proceedings Article•DOI•

Experiments on the Indonesian plagiarism detection using latent semantic analysis

[...]

Sidik Soleman¹, Ayu Purwarianti¹•Institutions (1)

Bandung Institute of Technology¹

28 May 2014

TL;DR: In this paper, the authors employed latent semantic analysis (LSA) as the term-document representation to handle the intelligence plagiarism, which was used in the Heuristic Retrieval (HR) component and Detailed Analysis (DA) component.

...read moreread less

Abstract: Plagiarism is an important task since its number is increasing and the plagiarism technique is getting difficult. It means that there is not only literal plagiarism but also intelligence plagiarism. In order to handle the intelligence plagiarism, we employed latent semantic analysis (LSA) as the term-document representation. The LSA was used in the Heuristic Retrieval (HR) component and Detailed Analysis (DA) component. We conducted several experiments to compare the token type, the text segmentation and the threshold value. The test data were prepared manually from the available Indonesian paper corpus. Experimental results showed that the LSA outperformed the VSM (Vector Space Model), especially in test cases with intelligence plagiarism.

...read moreread less

Journal Article•DOI•

FSFP: Transfer Learning From Long Texts to the Short

[...]

Wei Fengmei, Zhang Jianpei, Chu Yan, Yang Jing

01 Jul 2014-Applied Mathematics & Information Sciences

TL;DR: The algorithm which is called FSFP (Free Source selection Free Priori probability distribution) is proposed, which can transfer knowledge from the long texts to the short ones and the experimental results on large data sets show the effectiveness of the new algorithm.

...read moreread less

Abstract: Transfer learning is a method that studies how to identify the useful knowledge and skills in the previous tasks, and uses them to the new tasks or domains. At present, the research on transfer learning mostly focuses on the field of long texts. However, the source data should be given for the transportation from long texts to the sh ort ones, and the priori probability distribution of the data should be given at the same time. In order to solve the problems, the algorithm which is called FSFP (Free Source selection Free Priori probability distribution) is proposed. It can transfer knowledge from the long texts to the short ones. Latent semantic analysis is used to extract the key words as seed characteristic sets, which are semantically related to the long texts from the target domain. And then the graph structure of online information is built. With the help of the improved Laplacian Eigenmaps, the feature representations of high- dimensional data are mapped to a low-dimensional space. Lastly, the target data are classified in the constraint of minimizing the mutual information between the instance and the feature representation. The experimental results on large data sets show the effectiveness of the new algorithm.

...read moreread less

Journal Article•DOI•

Using a high-dimensional graph of semantic space to model relationships among words

[...]

Alice F. Jackson¹, Donald J. Bolger¹•Institutions (1)

University of Maryland, College Park¹

12 May 2014-Frontiers in Psychology

TL;DR: The superior performance of the GOLD models suggest that a single acquisition and storage mechanism, namely co-occurrence, can account for associative and conceptual relationships between words and is more psychologically plausible than models using singular value decomposition (SVD).

...read moreread less

Abstract: The GOLD model (Graph Of Language Distribution) is a network model constructed based on co-occurrence in a large corpus of natural language that may be used to explore what information may be present in a graph-structured model of language, and what information may be extracted through theoretically-driven algorithms as well as standard graph analysis methods. The present study will employ GOLD to examine two types of relationship between words: semantic similarity and associative relatedness. Semantic similarity refers to the degree of overlap in meaning between words, while associative relatedness refers to the degree to which two words occur in the same schematic context. It is expected that a graph structured model of language constructed based on co-occurrence should easily capture associative relatedness, because this type of relationship is thought to be present directly in lexical co-occurrence. However, it is hypothesized that semantic similarity may be extracted from the intersection of the set of first-order connections, because two words that are semantically similar may occupy similar thematic or syntactic roles across contexts and thus would co-occur lexically with the same set of nodes. Two versions the GOLD model that differed in terms of the co-occurence window, bigGOLD at the paragraph level and smallGOLD at the adjacent word level, were directly compared to the performance of a well-established distributional model, Latent Semantic Analysis (LSA). The superior performance of the GOLD models (big and small) suggest that a single acquisition and storage mechanism, namely co-occurrence, can account for associative and conceptual relationships between words and is more psychologically plausible than models using singular value decomposition (SVD).

...read moreread less

Journal Article•DOI•

Sentiment Classification and Feature based Summarization of Movie Reviews in Mobile Environment

[...]

Savita Harer, Sandeep Kadam

20 Aug 2014-International Journal of Computer Applications

TL;DR: Proposed Random forest method for sentiment classification of movie reviews and LSA based filtering mechanism to reduce the size of review summary is proposed.

...read moreread less

Abstract: framework is designed for sentiment classification and feature based summarization system in a mobile environment. Posting online reviews has become an increasingly popular way for people to share their opinions about specific product or service with other users. It has become a common practice for web technologies to provide the venues and facilities for people to publish their reviews. Sentiment classification and feature based summarization are essential steps for the classification and summarization of movie reviews. System proposed Random forest method for sentiment classification of movie reviews. Identification of movie features and opinion words are both important for feature based summarization. System identified movie features using a novel approach called Latent Semantic Analysis (LSA) and frequency based approach. Then system identified opinion words using part-of-speech (POS) tagging method. The result of LSA is extended to LSA based filtering mechanism to reduce the size of review summary. System design focused on the sentiment classification accuracy and system response time.

...read moreread less

Journal Article•DOI•

[...]

Walter Kintsch¹•Institutions (1)

University of Colorado Boulder¹

01 Jul 2014-Psychological Review

TL;DR: A good qualitative account of word similarities may be obtained by adjusting the cosine between word vectors from latent semantic analysis for vector lengths in a manner analogous to the quantum geometric model of similarity.

...read moreread less

Abstract: A good qualitative account of word similarities may be obtained by adjusting the cosine between word vectors from latent semantic analysis for vector lengths in a manner analogous to the quantum geometric model of similarity.

...read moreread less

Journal Article•DOI•

Comparison of cosine similarity and k-NN for automated essays scoring

[...]

Ahmed A. Ewees, Mohamed Eisa, M. M. Refaat

27 Dec 2014-International Journal of Advanced Research in Computer and Communication Engineering

TL;DR: A comparison between Cosine Similarity and k-Nearest Neighbors algorithm in Latent Semantic Analysis method to score Arabic essays automatically is presented and it showed that the use of Cosin Similarity with LatentSemantic Analysis led to high results.

...read moreread less

Abstract: 3 Abstract: In this paper, a comparison between Cosine Similarity and k-Nearest Neighbors algorithm in Latent Semantic Analysis method to score Arabic essays automatically is presented. It also improves Latent Semantic Analysis by processing the entered text, unifying the form of letters, deleting the formatting, replacing synonyms, stemming and deleting "Stop Words". The results showed that the use of Cosine Similarity with Latent Semantic Analysis led to high results than the use of k-Nearest Neighbors with Latent Semantic Analysis.

...read moreread less

Proceedings Article•DOI•

Comparing Topic Models for a Movie Recommendation System

[...]

Sonia Bergamaschi¹, Laura Po¹, Serena Sorrentino¹•Institutions (1)

University of Modena and Reggio Emilia¹

01 Jan 2014

TL;DR: An automated movie recommendation system based on the similarity of movie: given a target movie selected by the user, the goal of the system is to provide a list of those movies that are most similar to the target one, without knowing any user preferences.

...read moreread less

Abstract: Recommendation systems have become successful at suggesting content that are likely to be of interest to the user, however their performance greatly suffers when little information about the users preferences are given. In this paper we propose an automated movie recommendation system based on the similarity of movie: given a target movie selected by the user, the goal of the system is to provide a list of those movies that are most similar to the target one, without knowing any user preferences. The Topic Models of Latent Semantic Allocation (LSA) and Latent Dirichlet Allocation (LDA) have been applied and extensively compared on a movie database of two hundred thousand plots. Experiments are an important part of the paper; we examined the topic models behaviour based on standard metrics and on user evaluations, we have conducted performance assessments with 30 users to compare our approach with a commercial system. The outcome was that the performance of LSA was superior to that of LDA in supporting the selection of similar plots. Even if our system does not outperform commercial systems, it does not rely on human effort, thus it can be ported to any domain where natural language descriptions exist. Since it is independent from the number of user ratings, it is able to suggest famous movies as well as old or unheard movies that are still strongly related to the content of the video the user has watched.

...read moreread less