scispace - formally typeset
Search or ask a question

Showing papers on "Latent semantic analysis published in 2019"


Journal ArticleDOI
TL;DR: This article identifies common misconceptions that arise as a result of incomplete descriptions, outdated arguments, and unclear distinctions between theory and implementation of the models of semantic representation and clarify and amend these points to provide a theoretical basis for future research and discussions on vector models of semantics representation.
Abstract: Models that represent meaning as high-dimensional numerical vectors-such as latent semantic analysis (LSA), hyperspace analogue to language (HAL), bound encoding of the aggregate language environment (BEAGLE), topic models, global vectors (GloVe), and word2vec-have been introduced as extremely powerful machine-learning proxies for human semantic representations and have seen an explosive rise in popularity over the past 2 decades. However, despite their considerable advancements and spread in the cognitive sciences, one can observe problems associated with the adequate presentation and understanding of some of their features. Indeed, when these models are examined from a cognitive perspective, a number of unfounded arguments tend to appear in the psychological literature. In this article, we review the most common of these arguments and discuss (a) what exactly these models represent at the implementational level and their plausibility as a cognitive theory, (b) how they deal with various aspects of meaning such as polysemy or compositionality, and (c) how they relate to the debate on embodied and grounded cognition. We identify common misconceptions that arise as a result of incomplete descriptions, outdated arguments, and unclear distinctions between theory and implementation of the models. We clarify and amend these points to provide a theoretical basis for future research and discussions on vector models of semantic representation.

132 citations


Journal ArticleDOI
TL;DR: An integrated framework which bridges the gap between lexicon-based and machine learning approaches to achieve better accuracy and scalability is proposed and a novel genetic algorithm (GA)-based feature reduction technique is proposed to solve the scalability issue that arises as the feature-set grows.
Abstract: Due to the rapid development of Internet technologies and social media, sentiment analysis has become an important opinion mining technique. Recent research work has described the effectiveness of different sentiment classification techniques ranging from simple rule-based and lexicon-based approaches to more complex machine learning algorithms. While lexicon-based approaches have suffered from the lack of dictionaries and labeled data, machine learning approaches have fallen short in terms of accuracy. This paper proposes an integrated framework which bridges the gap between lexicon-based and machine learning approaches to achieve better accuracy and scalability. To solve the scalability issue that arises as the feature-set grows, a novel genetic algorithm (GA)-based feature reduction technique is proposed. By using this hybrid approach, we are able to reduce the feature-set size by up to 42% without compromising the accuracy. The comparison of our feature reduction technique with more widely used principal component analysis (PCA) and latent semantic analysis (LSA) based feature reduction techniques have shown up to 15.4% increased accuracy over PCA and up to 40.2% increased accuracy over LSA. Furthermore, we also evaluate our sentiment analysis framework on other metrics including precision, recall, F-measure, and feature size. In order to demonstrate the efficacy of GA-based designs, we also propose a novel cross-disciplinary area of geopolitics as a case study application for our sentiment analysis framework. The experiment results have shown to accurately measure public sentiments and views regarding various topics such as terrorism, global conflicts, and social issues. We envisage the applicability of our proposed work in various areas including security and surveillance, law-and-order, and public administration.

91 citations


Journal ArticleDOI
TL;DR: This article proposed to incorporate different categories of linguistic features into distributed representation of words in order to learn simultaneously the writing style representations based on unlabeled texts for AA, which allows topical, lexical, syntactical, and character-level feature vectors of each document to be extracted as stylometrics.
Abstract: Authorship analysis (AA) is the study of unveiling the hidden properties of authors from textual data. It extracts an author’s identity and sociolinguistic characteristics based on the reflected writing styles in the text. The process is essential for various areas, such as cybercrime investigation, psycholinguistics, political socialization, etc. However, most of the previous techniques critically depend on the manual feature engineering process. Consequently, the choice of feature set has been shown to be scenario- or dataset-dependent. In this paper, to mimic the human sentence composition process using a neural network approach, we propose to incorporate different categories of linguistic features into distributed representation of words in order to learn simultaneously the writing style representations based on unlabeled texts for AA. In particular, the proposed models allow topical, lexical, syntactical, and character-level feature vectors of each document to be extracted as stylometrics. We evaluate the performance of our approach on the problems of authorship characterization, authorship identification and authorship verification with the Twitter, blog, review, novel, and essay datasets. The experiments suggest that our proposed text representation outperforms the static stylometrics, dynamic ${n}$ -grams, latent Dirichlet allocation, latent semantic analysis, distributed memory model of paragraph vectors, distributed bag of words version of paragraph vector, word2vec representations, and other baselines.

85 citations


Journal ArticleDOI
TL;DR: This study examined whether source overlap between the speaking samples found in the TOEFL-iBT integrated speaking tasks and the responses produced by test-takers was predictive of human ratings of speaking proficiency, and found that global semantic similarity as reported by word2vec was an important predictor of coherence ratings.
Abstract: This article introduces the second version of the Tool for the Automatic Analysis of Cohesion (TAACO 2.0). Like its predecessor, TAACO 2.0 is a freely available text analysis tool that works on the Windows, Mac, and Linux operating systems; is housed on a user's hard drive; is easy to use; and allows for batch processing of text files. TAACO 2.0 includes all the original indices reported for TAACO 1.0, but it adds a number of new indices related to local and global cohesion at the semantic level, reported by latent semantic analysis, latent Dirichlet allocation, and word2vec. The tool also includes a source overlap feature, which calculates lexical and semantic overlap between a source and a response text (i.e., cohesion between the two texts based measures of text relatedness). In the first study in this article, we examined the effects that cohesion features, prompt, essay elaboration, and enhanced cohesion had on expert ratings of text coherence, finding that global semantic similarity as reported by word2vec was an important predictor of coherence ratings. A second study was conducted to examine the source and response indices. In this study we examined whether source overlap between the speaking samples found in the TOEFL-iBT integrated speaking tasks and the responses produced by test-takers was predictive of human ratings of speaking proficiency. The results indicated that the percentage of keywords found in both the source and response and the similarity between the source document and the response, as reported by word2vec, were significant predictors of speaking quality. Combined, these findings help validate the new indices reported for TAACO 2.0.

66 citations



Journal ArticleDOI
TL;DR: This method used Word2vec to construct a context sentence vector, and sense definition vectors then give each word sense a score using cosine similarity to compute the similarity between those sentence vectors, and shows that this method outperforms many unsupervised systems participating in the SENSEVAL-3 English lexical sample task.
Abstract: Words have different meanings (i.e., senses) depending on the context. Disambiguating the correct sense is important and a challenging task for natural language processing. An intuitive way is to select the highest similarity between the context and sense definitions provided by a large lexical database of English, WordNet. In this database, nouns, verbs, adjectives, and adverbs are grouped into sets of cognitive synonyms interlinked through conceptual semantics and lexicon relations. Traditional unsupervised approaches compute similarity by counting overlapping words between the context and sense definitions which must match exactly. Similarity should compute based on how words are related rather than overlapping by representing the context and sense definitions on a vector space model and analyzing distributional semantic relationships among them using latent semantic analysis (LSA). When a corpus of text becomes more massive, LSA consumes much more memory and is not flexible to train a huge corpus of text. A word-embedding approach has an advantage in this issue. Word2vec is a popular word-embedding approach that represents words on a fix-sized vector space model through either the skip-gram or continuous bag-of-words (CBOW) model. Word2vec is also effectively capturing semantic and syntactic word similarities from a huge corpus of text better than LSA. Our method used Word2vec to construct a context sentence vector, and sense definition vectors then give each word sense a score using cosine similarity to compute the similarity between those sentence vectors. The sense definition also expanded with sense relations retrieved from WordNet. If the score is not higher than a specific threshold, the score will be combined with the probability of that sense distribution learned from a large sense-tagged corpus, SEMCOR. The possible answer senses can be obtained from high scores. Our method shows that the result (50.9% or 48.7% without the probability of sense distribution) is higher than the baselines (i.e., original, simplified, adapted and LSA Lesk) and outperforms many unsupervised systems participating in the SENSEVAL-3 English lexical sample task.

44 citations


Journal ArticleDOI
TL;DR: In this paper, Latent Semantic Analysis (LSA) is used to extract knowledge from the large corpus of the 503 abstracts of academic papers published in various journals and conference proceedings.
Abstract: In recent years, Industry 4.0 has received immense attention from academic community, practitioners and the governments across nations resulting in explosive growth in the publication of articles, thereby making it imperative to reveal and discern the core research areas and research themes of Industry 4.0 extant literature. The purpose of this paper is to discuss research dynamics and to propose a taxonomy of Industry 4.0 research landscape along with future research directions.,A data-driven text mining approach, Latent Semantic Analysis (LSA), is used to review and extract knowledge from the large corpus of the 503 abstracts of academic papers published in various journals and conference proceedings. The adopted technique extracts several latent factors that characterise the emerging pattern of research. The cross-loading analysis of high-loaded papers is performed to identify the semantic link between research areas and themes.,LSA results uncover 13 principal research areas and 100 research themes. The study discovers “smart factory” and “new business model” as dominant research areas. A taxonomy is developed which contains five topical areas of Industry 4.0 field.,The data set developed is based on systematic article refining process which includes the keywords search in selected electronic databases and articles limited to English language only. So, there is a possibility that other related work may not be captured in the data set which may be published in other than examined databases and are in non-English language.,To the best of the authors’ knowledge, this study is the first of its kind that has used the LSA technique to reveal research trends in Industry 4.0 domain. This review will be beneficial to scholars and practitioners to understand the diversity and to draw a roadmap of Industry 4.0 research. The taxonomy and outlined future research agenda could help the practitioners and academicians to position their research work.

41 citations


Journal ArticleDOI
TL;DR: To automate and expedite the analysis tasks, this study deployed natural language processing (NLP) and commonly used unsupervised learning for text classification, namely latent semantic analysis (LSA) and latent Dirichlet allocation (LDA).

39 citations


Proceedings ArticleDOI
25 Apr 2019
TL;DR: This paper examines the results of applying three different text feature extraction approaches while classifying short sentences and phrases into categories with a neural network to find out which method is best at capturing text features and allows the classifier to achieve highest accuracy.
Abstract: In this paper, we examine the results of applying three different text feature extraction approaches while classifying short sentences and phrases into categories with a neural network in order to find out which method is best at capturing text features and allows the classifier to achieve highest accuracy. The examined feature extraction methods include a plain Term Frequency Inverse Document Frequency (TF-IDF) approach and its two modifications by applying different dimensionality reduction techniques: Latent Semantic Analysis (LSA) and Linear Discriminant Analysis (LDA). The results show that the TF-IDF feature extraction approach outperforms other methods allowing the classifier to achieve highest accuracy when working with larger datasets. Furthermore, the results show that the TF-IDF in combination with LSA approach allows the classifier to achieve similar accuracy while working with smaller datasets.

38 citations


Journal ArticleDOI
TL;DR: This research develops a patent mining approach based on the novelty detection statistical technique to identify unusual patents that may provide a fresh idea for potential opportunities and is applied in the telehealth industry and research findings can help telehealth firms formulate their technology strategies.

38 citations


Journal ArticleDOI
TL;DR: This study demonstrates that CNN model with pre-trained medical concept vectors could accurately identify target report-pairs with overlapping body sites and potentially accelerate the retrieving process for imaging diagnosis quality measurement.
Abstract: Imaging examinations, such as ultrasonography, magnetic resonance imaging and computed tomography scans, play key roles in healthcare settings. To assess and improve the quality of imaging diagnosis, we need to manually find and compare the pre-existing reports of imaging and pathology examinations which contain overlapping exam body sites from electrical medical records (EMRs). The process of retrieving those reports is time-consuming. In this paper, we propose a convolutional neural network (CNN) based method which can better utilize semantic information contained in report texts to accelerate the retrieving process. We included 16,354 imaging and pathology report-pairs from 1926 patients who admitted to Shanghai Tongren Hospital and had ultrasonic examinations between 1st May 2017 and 31st July 2017. We adapted the CNN model to calculate the similarities among the report-pairs to identify target report-pairs with overlapping body sites, and compared the performance with other six conventional models, including keyword mapping, latent semantic analysis (LSA), latent Dirichlet allocation (LDA), Doc2Vec, Siamese long short term memory (LSTM) and a model based on named entity recognition (NER). We also utilized graph embedding method to enhance the word representation by capturing the semantic relations information from medical ontologies. Additionally, we used LIME algorithm to identify which features (or words) are decisive for the prediction results and improved the model interpretability. Experiment results showed that our CNN model gained significant improvement compared to all other conventional models on area under the receiver operating characteristic (AUROC), precision, recall and F1-score in our test dataset. The AUROC of our CNN models gained approximately 3–7% improvement. The AUROC of CNN model with graph-embedding and ontology based medical concept vectors was 0.8% higher than the model with randomly initialized vectors and 1.5% higher than the one with pre-trained word vectors. Our study demonstrates that CNN model with pre-trained medical concept vectors could accurately identify target report-pairs with overlapping body sites and potentially accelerate the retrieving process for imaging diagnosis quality measurement.

Book ChapterDOI
20 Oct 2019
TL;DR: This study focuses on text data and considers coding as a process of identifying words or phrases and categorizing them into codes to facilitate data analysis, and proposes adding a semantic component to the nCoder.
Abstract: Coding is a process of assigning meaning to a given piece of evidence. Evidence may be found in a variety of data types, including documents, research interviews, posts from social media, conversations from learning platforms, or any source of data that may provide insights for the questions under qualitative study. In this study, we focus on text data and consider coding as a process of identifying words or phrases and categorizing them into codes to facilitate data analysis. There are a number of different approaches to generating qualitative codes, such as grounded coding, a priori coding, or using both in an iterative process. However, both qualitative and quantitative analysts face the same coding problem: when the data size is large, manually coding becomes impractical. nCoder is a tool that helps researchers to discover and code key concepts in text data with minimum human judgements. Once reliability and validity are established, nCoder automatically applies the coding scheme to the dataset. However, for concepts that occur infrequently, even with an acceptable reliability, the classifier may still result in too many false negatives. This paper explores these problems within the current nCoder and proposes adding a semantic component to the nCoder. A tool called “nCoder+” is presented with real data to demonstrate the usefulness of the semantic component. The possible ways of integrating this component and other natural language processing techniques into nCoder are discussed.

Journal ArticleDOI
TL;DR: This paper used word and paragraph embedding models learned by shallow neural networks from a multilingual legal corpus of European directives and national legislation (from Ireland, Luxembourg and Italy) to identify transpositions.
Abstract: The automated identification of national implementations (NIMs) of European directives by text similarity techniques has shown promising preliminary results. Previous works have proposed and utilized unsupervised lexical and semantic similarity techniques based on vector space models, latent semantic analysis and topic models. However, these techniques were evaluated on a small multilingual corpus of directives and NIMs. In this paper, we utilize word and paragraph embedding models learned by shallow neural networks from a multilingual legal corpus of European directives and national legislation (from Ireland, Luxembourg and Italy) to develop unsupervised semantic similarity systems to identify transpositions. We evaluate these models and compare their results with the previous unsupervised methods on a multilingual test corpus of 43 Directives and their corresponding NIMs. We also develop supervised machine learning models to identify transpositions and compare their performance with different feature sets.

Proceedings ArticleDOI
01 Dec 2019
TL;DR: This study has proposed a method that can predict the next most appropriate and suitable word in Bangla language, and also it can suggest the corresponding sentence to contribute to this technology of word prediction systems.
Abstract: Textual information exchange, by typing the information and send it to the other end, is one of the most prominent mediums of communication throughout the world. People occupy a lot of time sending emails or additional information on social networking sites where typing the whole information is redundant and time-consuming in this advanced era. To make textual information exchange more speedy and easier, word predictive systems are launched which can predict the next most likely word so that people do not have to type the next word but select it from the suggested words. In this study, we have proposed a method that can predict the next most appropriate and suitable word in Bangla language, and also it can suggest the corresponding sentence to contribute to this technology of word prediction systems. This proposed approach is, using GRU (Gated Recurrent Unit) based RNN (Recurrent Neural Network) on n-gram dataset to create such language models that can predict the word(s) from the input sequence provided. We have used a corpus dataset, collected from different sources in Bangla language to run the experiments. Compared to the other methods that have been used such as LSTM (Long Short Term Memory) based RNN on n-gram dataset and Naive Bayes with Latent Semantic Analysis, our proposed approach gives better performance. It gives an average accuracy of 99.70% for 5-gram model, 99.24% for 4-gram model, 95.84% for Tri-gram model, 78.15%, and 32.17% respectively for Bi-gram and Uni-gram models on average.

Journal ArticleDOI
TL;DR: A new summarization approach is proposed that exploits frequent itemsets to describe all of the latent concepts covered by the documents under analysis and LSA to reduce the potentially redundant set of itemset to a compact set of uncorrelated concepts.
Abstract: Sentence-based summarization aims at extracting concise summaries of collections of textual documents. Summaries consist of a worthwhile subset of document sentences. The most effective multilingual strategies rely on Latent Semantic Analysis (LSA) and on frequent itemset mining, respectively. LSA-based summarizers pick the document sentences that cover the most important concepts. Concepts are modeled as combinations of single-document terms and are derived from a term-by-sentence matrix by exploiting Singular Value Decomposition (SVD). Itemset-based summarizers pick the sentences that contain the largest number of frequent itemsets, which represent combinations of frequently co-occurring terms. The main drawbacks of existing approaches are (i) the inability of LSA to consider the correlation between combinations of multiple-document terms and the underlying concepts, (ii) the inherent redundancy of frequent itemsets because similar itemsets may be related to the same concept, and (iii) the inability of itemset-based summarizers to correlate itemsets with the underlying document concepts. To overcome the issues of both of the abovementioned algorithms, we propose a new summarization approach that exploits frequent itemsets to describe all of the latent concepts covered by the documents under analysis and LSA to reduce the potentially redundant set of itemsets to a compact set of uncorrelated concepts. The summarizer selects the sentences that cover the latent concepts with minimal redundancy. We tested the summarization algorithm on both multilingual and English-language benchmark document collections. The proposed approach performed significantly better than both itemset- and LSA-based summarizers, and better than most of the other state-of-the-art approaches.

Journal ArticleDOI
TL;DR: A new Soft Voting Technique (SVT) is proposed to improve the performance of the Global Filter-based Feature Selection Scheme (GFSS) and improves the numerical discrimination of words to identify there positive and negative membership to a class.
Abstract: In text classification, the Global Filter-based Feature Selection Scheme (GFSS) selects the top-N ranked words as features. It discards the low ranked features from some classes either partially or completely. The low rank is usually due to varying occurrence of the words (terms) in the classes. The Latent Semantic Analysis (LSA) can be used to address this issue as it eliminates the redundant terms. It assigns an equal rank to the terms that represent similar concepts or meanings, e.g. four terms “carcinoma”, “sarcoma”, “melanoma”, and “cancer” represent a similar concept, i.e. “cancer”. Thus, any selected term by the algorithms from these four terms doesn’t affect the classifier performance. However, it does not guarantee that the selection of top-N LSA ranked terms by GFSS are the representative terms of each class. An Improved Global Feature Selection Scheme (IGFSS) solves this issue by selecting an equal number of representative terms from all the classes. However, it has two issues, first, it assigns the class label and membership of each term on the basis of an individual vote of the Odds Ratio (OR) method thereby limiting the decision making capability. Second, the ratio of selected terms is determined empirically by the IGFSS and a common ratio is applied to all the classes to assign the positive and negative membership of the terms. However, the ratio of positive and negative nature terms varies from one class to another and it may be very less for one class, whereas high for other classes. Thus, one common negative features ratio used by the IGFSS affects those classes of a dataset in which there is an imbalance between positive and negative nature words. To address these issues of IGFSS, a new Soft Voting Technique (SVT) is proposed to improve the performance of GFSS. There are two main contributions in this paper: (i) The weighted average score (Soft Vote) of three methods, viz. OR, Correlation Coefficient (CC), and GSS Coefficients (GSS) improves the numerical discrimination of words to identify there positive and negative membership to a class. (ii) A mathematical expression is incorporated in the IGFSS that computes a varying ratio of positive and negative memberships of the terms for each class. The membership is based on the occurrence of the terms in the classes. The proposed SVT is evaluated using four standard classifiers applied on five bench-marked datasets. The experimental results based on Macro_F1 and Micro_F1 measures show that SVT achieves a significant improvement in the performance of classifiers in comparison of standard methods.

Journal ArticleDOI
TL;DR: A machine learning method to identify marketing intentions from large-scale The authors-Media data is proposed and the proposed Latent Semantic Analysis (LSI)-Word2vec model can reflect the semantic features and the decision tree model is simplified by decision tree pruning to save computing resources and reduce the time complexity.
Abstract: Social network services for self-media, such as Weibo, Blog, and WeChat Public, constitute a powerful medium that allows users to publish posts every day. Due to insufficient information transparency, malicious marketing of the Internet from self-media posts imposes potential harm on society. Therefore, it is necessary to identify news with marketing intentions for life. We follow the idea of text classification to identify marketing intentions. Although there are some current methods to address intention detection, the challenge is how the feature extraction of text reflects semantic information and how to improve the time complexity and space complexity of the recognition model. To this end, this paper proposes a machine learning method to identify marketing intentions from large-scale We-Media data. First, the proposed Latent Semantic Analysis (LSI)-Word2vec model can reflect the semantic features. Second, the decision tree model is simplified by decision tree pruning to save computing resources and reduce the time complexity. Finally, this paper examines the effects of classifier associations and uses the optimal configuration to help people efficiently identify marketing intention. Finally, the detailed experimental evaluation on several metrics shows that our approaches are effective and efficient. The F1 value can be increased by about 5%, and the running time is increased by 20%, which prove that the newly-proposed method can effectively improve the accuracy of marketing news recognition.

Proceedings ArticleDOI
18 Dec 2019
TL;DR: This study shows that LDA gives a better result than LSA, which doesn’t consider the relationship between documents in the corpus, while LDA does.
Abstract: The industrial world has entered the era of industrial revolution 4.0. In this era, there is an urgent data requirement from the community to support service policies. Because of that, Surabaya Government made Media Center Surabaya. This media is used to accommodate all the aspiration of Surabaya citizen. To access this media, a citizen can use Twitter. The topic which is discussed in Twitter is important information that we need to know. The information can be used to improve the performance of Surabaya Government services. Twitter data is a text data that consists of thousands of variables. Text mining is frequently used to analyze this kind of data, including topic modeling and sentiment analysis. This study would work on topic modeling focused on the algorithm employing Latent Dirichlet Allocation (LDA) and Latent Semantic Analysis (LSA). The evaluation of the algorithm performance uses the topic coherence. As unstructured data, the Twitter data need preprocessing before the analysis. The stages of preprocessing include cleansing, stemming, and stop words. The advantages of LSA are fast and easy to implement. LSA, on the other hand, doesn’t consider the relationship between documents in the corpus, while LDA does. This study shows that LDA gives a better result than LSA.

Proceedings ArticleDOI
01 Nov 2019
TL;DR: Two content-based recommendation approaches which automatically detect and recommend dependencies between requirements which are defined on a textual level by exploiting document classification techniques are presented.
Abstract: There is a high demand for intelligent decision support systems which assist stakeholders in requirements engineering tasks. Examples of such tasks are the elicitation of requirements, release planning, and the identification of requirement-dependencies. In particular, the detection of dependencies between requirements is a major challenge for stakeholders. In this paper, we present two content-based recommendation approaches which automatically detect and recommend such dependencies. The first approach identifies potential dependencies between requirements which are defined on a textual level by exploiting document classification techniques (based on Linear SVM, Naive Bayes, Random Forest, and k-Nearest Neighbors). This approach uses two different feature types (TF-IDF features vs. probabilistic features). The second recommendation approach is based on Latent Semantic Analysis and defines the baseline for the evaluation with a real-world data set. The evaluation shows that the recommendation approach based on Random Forest using probabilistic features achieves the best prediction quality of all approaches (F1: 0.89).

Journal ArticleDOI
TL;DR: A gravitational search algorithm is adopted that works on the basis of the law of gravity to optimize the summary of the document and the experimental results show better than the existing state-of-the-art methods in terms of various performance metrics.
Abstract: Text summarization is an extraction of important text from the original document. The objective of any automatic text summarization system, especially in legal domain, is to produce a summary which is close to human-generated summaries. In this article, we present the summarization of legal documents as binary optimization problem where fitness of the solution is derived based on the weighting of individual statistical features of each sentence such as length of the sentence, sentence position, degree of similarity, term frequency–inverse sentence frequency and keywords to generate summary of the document. In this paper, a gravitational search algorithm is adopted that works on the basis of the law of gravity to optimize the summary of the document. To show the efficacy of the proposed method, we compare the experimental results with particle swarm optimization, genetic algorithm, TextRank, latent semantic analysis, MEAD, MS-Word, SumBasic using ROUGE evaluation metrics on the FIRE-2014 data set. The experimental results of the proposed method show better than the existing state-of-the-art methods in terms of various performance metrics.

Book ChapterDOI
28 Mar 2019
TL;DR: Fuzzy ontology is used to check the consistency and coherence of the essay as it is the best way to overcome the vagueness of the language, and the system will also provide a score with feedback to the student.
Abstract: New learning researches proved that creativity is an essential concern in the arena of education. The best means to evaluate learning outcomes and students’ creativity is essay questions. However, to evaluate these questions is a time-consuming task and subjectivity in scoring assessments remains inevitable. Automated essay evaluation systems (AEE) provide a cost-effective and consistent alternative to human marking. Therefore, numerous automatic essay-grading systems have been developed to lessen the demands of manual essay grading. However, these systems concentrate on syntax and vocabulary, and no consideration is paid to the semantic and coherence of the essay. Moreover, few of the existing systems are able to give informative feedback that is based on extensive domain knowledge to students. In this paper, a system is evolved that uses latent semantic analysis (LSA) and fuzzy ontology to evaluate essays, where LSA will be responsible for checking the semantic. Fuzzy ontology is used to check the consistency and coherence of the essay as it is the best way to overcome the vagueness of the language, and the system will also provide a score with feedback to the student. Experimental results were good in evaluating the essay syntactically and semantically.

Proceedings ArticleDOI
01 Jan 2019
TL;DR: Three computational approaches based on Latent Semantic Analysis, Latent Dirichlet Allocation and term frequency–inverse document frequency do not match human judgement well enough to replace it and are shown to not correlated better with expert evaluation.
Abstract: In crowdsourcing ideation websites, companies can easily collect large amount of ideas. Screening through such volume of ideas is very costly and challenging, necessitating automatic approaches. It would be particularly useful to automatically evaluate idea novelty since companies commonly seek novel ideas. Three computational approaches were tested, based on Latent Semantic Analysis (LSA), Latent Dirichlet Allocation (LDA) and term frequency–inverse document frequency (TF-IDF), respectively. These three approaches were used on three set of ideas and the computed idea novelty was compared with human expert evaluation. TF-IDF based measure correlated better with expert evaluation than the other two measures. However, our results show that these approaches do not match human judgement well enough to replace it.


Journal ArticleDOI
TL;DR: The lsemantica command is presented, which implements latent semantic analysis in Stata using truncated singular value decomposition and provides a simple command for latent semanticAnalysis as well as complementary commands for text similarity comparison.
Abstract: In this article, I present the lsemantica command, which implements latent semantic analysis in Stata. Latent semantic analysis is a machine learning algorithm for word and text similarity comparison and uses truncated singular value decomposition to derive the hidden semantic relationships between words and texts. lsemantica provides a simple command for latent semantic analysis as well as complementary commands for text similarity comparison.

Journal ArticleDOI
TL;DR: A small but growing group of accounts in psychology, linguistics, and information retrieval that are exemplar-based semantic models borrow many of the ideas that have led to the prominence of exemplar models in fields such as categorisation.
Abstract: ion is a core principle of Distributional Semantic Models (DSMs) that learn semantic representations for words by applying dimensional reduction to statistical redundancies in langu...

Proceedings ArticleDOI
24 Apr 2019
TL;DR: A unifying approach representing topic- based models is proposed and from which the state-of-the-art semantic relatedness measures are divided into two distinct types of topic-based and ontology-based models.
Abstract: Over the last decades, a multitude of semantic relatedness measures have been proposed. Despite an extensive amount of work dedicated to this area of research, the understanding of their foundation is still limited in real-world applications. In this paper, a unifying approach representing topic-based models is proposed and from which the state-of-the-art semantic relatedness measures are divided into two distinct types of topic-based and ontology-based models. Regardless of extensive researches in the field of ontology-based models, topic-based models have not been taken into account considerably. Consequently, the unified approach is able to highlight equivalences among these models and propose bridges between their theoretical bases. On the other hand, presenting a comprehensive unifying approach of topic-based models induces readers to have a common understanding of them despite the differences and complexities between their architecture and configuration details. In order to evaluate topic-based models in comparison to ontology-based models, comprehensive experiments in the application of semantic relatedness of geographic phrases have been applied. Empirical results have demonstrated that not only topic-based models in comparison to ontology-based models confront with fewer restrictions in the real world, but also their performance in computing semantic relatedness of geographic phrases is significantly superior to ontology-based models.

Proceedings ArticleDOI
Songqian Li1, Kun Ma1, Xuewei Niu1, Yufeng Wang1, Ke Ji1, Ziqiang Yu1, Zhenxiang Chen1 
01 Aug 2019
TL;DR: Experimental analysis of real-world data demonstrates that the proposed pipeline that combines preprocessing, feature extraction and model fusion for a more accurate and automated prediction achieves higher accuracy than existing approaches.
Abstract: With the age of incoming of self-media, everyone can be the author of the content in the media age of big data. This has caused a mass of fake news appearing in the network. Authors of these fake news will mislead the public by spreading and it will bring economic and social benefits. Existing work focuses on using the various types of features of the article in the hope that a way to accurately identify fake news can be found, but this undermines their universality. In this paper, we propose a pipeline that combines preprocessing, feature extraction and model fusion for a more accurate and automated prediction. Specially we fusion of latent semantic analysis (LSA) and ensemble learning model results using stacking. Experimental analysis of real-world data demonstrates that our pipeline achieves higher accuracy than existing approaches.

Book ChapterDOI
TL;DR: The main goal of this paper is to explore latent topic analysis (LTA), in the context of quantum information retrieval, with results suggesting that the quantum-motivated representation is an alternative for geometrical latent topic modeling worthy of further exploration.
Abstract: The main goal of this paper is to explore latent topic analysis (LTA), in the context of quantum information retrieval. LTA is a valuable technique for document analysis and representation, which has been extensively used in information retrieval and machine learning. Different LTA techniques have been proposed, some based on geometrical modeling (such as latent semantic analysis, LSA) and others based on a strong statistical foundation. However, these two different approaches are not usually mixed. Quantum information retrieval has the remarkable virtue of combining both geometry and probability in a common principled framework. We built on this quantum framework to propose a new LTA method, which has a clear geometrical motivation but also supports a well-founded probabilistic interpretation. An initial exploratory experimentation was performed on three standard data sets. The results show that the proposed method outperforms LSA on two of the three datasets. These results suggests that the quantum-motivated representation is an alternative for geometrical latent topic modeling worthy of further exploration.

Journal ArticleDOI
TL;DR: It is established—via cross-sectional regression with selection—that the identified themes have a statistically significant effect on reporting firms’ return-on-assets (ROA), even after controlling for factors known to explain the cross-section variation in ROA and the self-selectivity of firms that engage in CSR practices.
Abstract: We propose a novel and objective statistical method known as latent semantic analysis (LSA), used in search engine procedures and information retrieval applications, as a methodological alternative for textual analysis in corporate social responsibility (CSR) research. LSA is a language processing technique that allows recognition of textual associative patterns and permits statistical extraction of common textual themes that characterize an entire set of documents, as well as tracking the relative prevalence of each theme over time and across entities. LSA possesses all the advantages of quantitative textual analysis methods (reliability control and bias reduction), is automated (meaning it can process numerous documents in minutes, as opposed to the time and resources needed to perform subjective scoring of text passages) and can be combined in mixed-method research approaches. To demonstrate the method, our empirical application analyzes the CSR reports of Hellenic companies, and first testifies that eight (five) recurring and common textual themes can explain about 50% (40%) of the variation in their CSR reports. We further establish—via cross-sectional regression with selection—that the identified themes have a statistically significant effect on reporting firms’ return-on-assets (ROA), even after controlling for factors known to explain the cross-sectional variation in ROA and the self-selectivity of firms that engage in CSR practices.