Showing papers on "Latent semantic analysis published in 2019"

PDF

Open Access

Journal Article•DOI•

Vector-Space Models of Semantic Representation From a Cognitive Perspective: A Discussion of Common Misconceptions

[...]

Fritz Günther¹, Luca Rinaldi¹, Marco Marelli¹•Institutions (1)

10 Sep 2019-Perspectives on Psychological Science

TL;DR: This article identifies common misconceptions that arise as a result of incomplete descriptions, outdated arguments, and unclear distinctions between theory and implementation of the models of semantic representation and clarify and amend these points to provide a theoretical basis for future research and discussions on vector models of semantics representation.

...read moreread less

Abstract: Models that represent meaning as high-dimensional numerical vectors-such as latent semantic analysis (LSA), hyperspace analogue to language (HAL), bound encoding of the aggregate language environment (BEAGLE), topic models, global vectors (GloVe), and word2vec-have been introduced as extremely powerful machine-learning proxies for human semantic representations and have seen an explosive rise in popularity over the past 2 decades. However, despite their considerable advancements and spread in the cognitive sciences, one can observe problems associated with the adequate presentation and understanding of some of their features. Indeed, when these models are examined from a cognitive perspective, a number of unfounded arguments tend to appear in the psychological literature. In this article, we review the most common of these arguments and discuss (a) what exactly these models represent at the implementational level and their plausibility as a cognitive theory, (b) how they deal with various aspects of meaning such as polysemy or compositionality, and (c) how they relate to the debate on embodied and grounded cognition. We identify common misconceptions that arise as a result of incomplete descriptions, outdated arguments, and unclear distinctions between theory and implementation of the models. We clarify and amend these points to provide a theoretical basis for future research and discussions on vector models of semantic representation.

...read moreread less

132 citations

Journal Article•DOI•

A Hybrid Framework for Sentiment Analysis Using Genetic Algorithm Based Feature Reduction

[...]

Farkhund Iqbal¹, Jahanzeb Maqbool Hashmi², Benjamin C. M. Fung³, Rabia Batool¹, Asad Masood Khattak¹, Saiqa Aleem¹, Patrick C. K. Hung⁴ - Show less +3 more•Institutions (4)

Zayed University¹, University of the Sciences², McGill University³, University of Ontario Institute of Technology⁴

21 Jan 2019-IEEE Access

TL;DR: An integrated framework which bridges the gap between lexicon-based and machine learning approaches to achieve better accuracy and scalability is proposed and a novel genetic algorithm (GA)-based feature reduction technique is proposed to solve the scalability issue that arises as the feature-set grows.

...read moreread less

Abstract: Due to the rapid development of Internet technologies and social media, sentiment analysis has become an important opinion mining technique. Recent research work has described the effectiveness of different sentiment classification techniques ranging from simple rule-based and lexicon-based approaches to more complex machine learning algorithms. While lexicon-based approaches have suffered from the lack of dictionaries and labeled data, machine learning approaches have fallen short in terms of accuracy. This paper proposes an integrated framework which bridges the gap between lexicon-based and machine learning approaches to achieve better accuracy and scalability. To solve the scalability issue that arises as the feature-set grows, a novel genetic algorithm (GA)-based feature reduction technique is proposed. By using this hybrid approach, we are able to reduce the feature-set size by up to 42% without compromising the accuracy. The comparison of our feature reduction technique with more widely used principal component analysis (PCA) and latent semantic analysis (LSA) based feature reduction techniques have shown up to 15.4% increased accuracy over PCA and up to 40.2% increased accuracy over LSA. Furthermore, we also evaluate our sentiment analysis framework on other metrics including precision, recall, F-measure, and feature size. In order to demonstrate the efficacy of GA-based designs, we also propose a novel cross-disciplinary area of geopolitics as a case study application for our sentiment analysis framework. The experiment results have shown to accurately measure public sentiments and views regarding various topics such as terrorism, global conflicts, and social issues. We envisage the applicability of our proposed work in various areas including security and surveillance, law-and-order, and public administration.

...read moreread less

91 citations

Journal Article•DOI•

Learning Stylometric Representations for Authorship Analysis

[...]

Steven H. H. Ding¹, Benjamin C. M. Fung¹, Farkhund Iqbal², William W. L. Cheung³•Institutions (3)

McGill University¹, Zayed University², Hong Kong Baptist University³

01 Jan 2019-IEEE Transactions on Systems, Man, and Cybernetics

TL;DR: This article proposed to incorporate different categories of linguistic features into distributed representation of words in order to learn simultaneously the writing style representations based on unlabeled texts for AA, which allows topical, lexical, syntactical, and character-level feature vectors of each document to be extracted as stylometrics.

...read moreread less

Abstract: Authorship analysis (AA) is the study of unveiling the hidden properties of authors from textual data. It extracts an author’s identity and sociolinguistic characteristics based on the reflected writing styles in the text. The process is essential for various areas, such as cybercrime investigation, psycholinguistics, political socialization, etc. However, most of the previous techniques critically depend on the manual feature engineering process. Consequently, the choice of feature set has been shown to be scenario- or dataset-dependent. In this paper, to mimic the human sentence composition process using a neural network approach, we propose to incorporate different categories of linguistic features into distributed representation of words in order to learn simultaneously the writing style representations based on unlabeled texts for AA. In particular, the proposed models allow topical, lexical, syntactical, and character-level feature vectors of each document to be extracted as stylometrics. We evaluate the performance of our approach on the problems of authorship characterization, authorship identification and authorship verification with the Twitter, blog, review, novel, and essay datasets. The experiments suggest that our proposed text representation outperforms the static stylometrics, dynamic ${n}$ -grams, latent Dirichlet allocation, latent semantic analysis, distributed memory model of paragraph vectors, distributed bag of words version of paragraph vector, word2vec representations, and other baselines.

...read moreread less

85 citations

Journal Article•DOI•

The Tool for the Automatic Analysis of Cohesion 2.0: Integrating semantic similarity and text overlap

[...]

Scott A. Crossley¹, Kristopher Kyle, Mihai Dascalu²•Institutions (2)

Georgia State University¹, Politehnica University of Bucharest²

01 Feb 2019-Behavior Research Methods

TL;DR: This study examined whether source overlap between the speaking samples found in the TOEFL-iBT integrated speaking tasks and the responses produced by test-takers was predictive of human ratings of speaking proficiency, and found that global semantic similarity as reported by word2vec was an important predictor of coherence ratings.

...read moreread less

Abstract: This article introduces the second version of the Tool for the Automatic Analysis of Cohesion (TAACO 2.0). Like its predecessor, TAACO 2.0 is a freely available text analysis tool that works on the Windows, Mac, and Linux operating systems; is housed on a user's hard drive; is easy to use; and allows for batch processing of text files. TAACO 2.0 includes all the original indices reported for TAACO 1.0, but it adds a number of new indices related to local and global cohesion at the semantic level, reported by latent semantic analysis, latent Dirichlet allocation, and word2vec. The tool also includes a source overlap feature, which calculates lexical and semantic overlap between a source and a response text (i.e., cohesion between the two texts based measures of text relatedness). In the first study in this article, we examined the effects that cohesion features, prompt, essay elaboration, and enhanced cohesion had on expert ratings of text coherence, finding that global semantic similarity as reported by word2vec was an important predictor of coherence ratings. A second study was conducted to examine the source and response indices. In this study we examined whether source overlap between the speaking samples found in the TOEFL-iBT integrated speaking tasks and the responses produced by test-takers was predictive of human ratings of speaking proficiency. The results indicated that the percentage of keywords found in both the source and response and the similarity between the source document and the response, as reported by word2vec, were significant predictors of speaking quality. Combined, these findings help validate the new indices reported for TAACO 2.0.

...read moreread less

66 citations

Journal Article•DOI•

Application of Latent Semantic Analysis to Divergent Thinking is Biased by Elaboration

[...]

Boris Forthmann¹, Oluwatosin Oyebade², Adebusola Ojo², Fritz Günther³, Heinz Holling¹ - Show less +1 more•Institutions (3)

University of Münster¹, University of Lagos², University of Tübingen³

01 Dec 2019-Journal of Creative Behavior

49 citations

Journal Article•DOI•

Word Sense Disambiguation Using Cosine Similarity Collaborates with Word2vec and WordNet

[...]

Korawit Orkphol, Wu Yang

12 May 2019-Future Internet

TL;DR: This method used Word2vec to construct a context sentence vector, and sense definition vectors then give each word sense a score using cosine similarity to compute the similarity between those sentence vectors, and shows that this method outperforms many unsupervised systems participating in the SENSEVAL-3 English lexical sample task.

...read moreread less

Abstract: Words have different meanings (i.e., senses) depending on the context. Disambiguating the correct sense is important and a challenging task for natural language processing. An intuitive way is to select the highest similarity between the context and sense definitions provided by a large lexical database of English, WordNet. In this database, nouns, verbs, adjectives, and adverbs are grouped into sets of cognitive synonyms interlinked through conceptual semantics and lexicon relations. Traditional unsupervised approaches compute similarity by counting overlapping words between the context and sense definitions which must match exactly. Similarity should compute based on how words are related rather than overlapping by representing the context and sense definitions on a vector space model and analyzing distributional semantic relationships among them using latent semantic analysis (LSA). When a corpus of text becomes more massive, LSA consumes much more memory and is not flexible to train a huge corpus of text. A word-embedding approach has an advantage in this issue. Word2vec is a popular word-embedding approach that represents words on a fix-sized vector space model through either the skip-gram or continuous bag-of-words (CBOW) model. Word2vec is also effectively capturing semantic and syntactic word similarities from a huge corpus of text better than LSA. Our method used Word2vec to construct a context sentence vector, and sense definition vectors then give each word sense a score using cosine similarity to compute the similarity between those sentence vectors. The sense definition also expanded with sense relations retrieved from WordNet. If the score is not higher than a specific threshold, the score will be combined with the probability of that sense distribution learned from a large sense-tagged corpus, SEMCOR. The possible answer senses can be obtained from high scores. Our method shows that the result (50.9% or 48.7% without the probability of sense distribution) is higher than the baselines (i.e., original, simplified, adapted and LSA Lesk) and outperforms many unsupervised systems participating in the SENSEVAL-3 English lexical sample task.

...read moreread less

44 citations

Journal Article•DOI•

Analysis and synthesis of Industry 4.0 research landscape: Using latent semantic analysis approach

[...]

Aniruddha Anil Wagire, Ajay Pal Singh Rathore, Rakesh Jain

18 Jun 2019-Journal of Manufacturing Technology Management

TL;DR: In this paper, Latent Semantic Analysis (LSA) is used to extract knowledge from the large corpus of the 503 abstracts of academic papers published in various journals and conference proceedings.

...read moreread less

Abstract: In recent years, Industry 4.0 has received immense attention from academic community, practitioners and the governments across nations resulting in explosive growth in the publication of articles, thereby making it imperative to reveal and discern the core research areas and research themes of Industry 4.0 extant literature. The purpose of this paper is to discuss research dynamics and to propose a taxonomy of Industry 4.0 research landscape along with future research directions.,A data-driven text mining approach, Latent Semantic Analysis (LSA), is used to review and extract knowledge from the large corpus of the 503 abstracts of academic papers published in various journals and conference proceedings. The adopted technique extracts several latent factors that characterise the emerging pattern of research. The cross-loading analysis of high-loaded papers is performed to identify the semantic link between research areas and themes.,LSA results uncover 13 principal research areas and 100 research themes. The study discovers “smart factory” and “new business model” as dominant research areas. A taxonomy is developed which contains five topical areas of Industry 4.0 field.,The data set developed is based on systematic article refining process which includes the keywords search in selected electronic databases and articles limited to English language only. So, there is a possibility that other related work may not be captured in the data set which may be published in other than examined databases and are in non-English language.,To the best of the authors’ knowledge, this study is the first of its kind that has used the LSA technique to reveal research trends in Industry 4.0 domain. This review will be beneficial to scholars and practitioners to understand the diversity and to draw a roadmap of Industry 4.0 research. The taxonomy and outlined future research agenda could help the practitioners and academicians to position their research work.

...read moreread less

41 citations

Journal Article•DOI•

Automated classification of building information modeling (BIM) case studies by BIM use based on natural language processing (NLP) and unsupervised learning

[...]

Namcheol Jung¹, Ghang Lee¹•Institutions (1)

Yonsei University¹

01 Aug 2019-Advanced Engineering Informatics

TL;DR: To automate and expedite the analysis tasks, this study deployed natural language processing (NLP) and commonly used unsupervised learning for text classification, namely latent semantic analysis (LSA) and latent Dirichlet allocation (LDA).

...read moreread less

39 citations

Proceedings Article•DOI•

Text Classification using Different Feature Extraction Approaches

[...]

Robert Dzisevic¹, Dmitrij Šešok¹•Institutions (1)

Vilnius Gediminas Technical University¹

25 Apr 2019

TL;DR: This paper examines the results of applying three different text feature extraction approaches while classifying short sentences and phrases into categories with a neural network to find out which method is best at capturing text features and allows the classifier to achieve highest accuracy.

...read moreread less

Abstract: In this paper, we examine the results of applying three different text feature extraction approaches while classifying short sentences and phrases into categories with a neural network in order to find out which method is best at capturing text features and allows the classifier to achieve highest accuracy. The examined feature extraction methods include a plain Term Frequency Inverse Document Frequency (TF-IDF) approach and its two modifications by applying different dimensionality reduction techniques: Latent Semantic Analysis (LSA) and Linear Discriminant Analysis (LDA). The results show that the TF-IDF feature extraction approach outperforms other methods allowing the classifier to achieve highest accuracy when working with larger datasets. Furthermore, the results show that the TF-IDF in combination with LSA approach allows the classifier to achieve similar accuracy while working with smaller datasets.

...read moreread less

38 citations

Journal Article•DOI•

A novelty detection patent mining approach for analyzing technological opportunities

[...]

Juite Wang¹, Yi-Jing Chen¹•Institutions (1)

National Chung Hsing University¹

01 Oct 2019-Advanced Engineering Informatics

TL;DR: This research develops a patent mining approach based on the novelty detection statistical technique to identify unusual patents that may provide a fresh idea for potential opportunities and is applied in the telehealth industry and research findings can help telehealth firms formulate their technology strategies.

...read moreread less

38 citations

Journal Article•DOI•

Detection of medical text semantic similarity based on convolutional neural network

[...]

Tao Zheng¹, Yimei Gao, Fei Wang², Chenhao Fan¹, Xingzhi Fu¹, Mei Li¹, Ya Zhang¹, Shaodian Zhang¹, Handong Ma¹ - Show less +5 more•Institutions (2)

Shanghai Jiao Tong University¹, Cornell University²

07 Aug 2019-BMC Medical Informatics and Decision Making

TL;DR: This study demonstrates that CNN model with pre-trained medical concept vectors could accurately identify target report-pairs with overlapping body sites and potentially accelerate the retrieving process for imaging diagnosis quality measurement.

...read moreread less

Abstract: Imaging examinations, such as ultrasonography, magnetic resonance imaging and computed tomography scans, play key roles in healthcare settings. To assess and improve the quality of imaging diagnosis, we need to manually find and compare the pre-existing reports of imaging and pathology examinations which contain overlapping exam body sites from electrical medical records (EMRs). The process of retrieving those reports is time-consuming. In this paper, we propose a convolutional neural network (CNN) based method which can better utilize semantic information contained in report texts to accelerate the retrieving process. We included 16,354 imaging and pathology report-pairs from 1926 patients who admitted to Shanghai Tongren Hospital and had ultrasonic examinations between 1st May 2017 and 31st July 2017. We adapted the CNN model to calculate the similarities among the report-pairs to identify target report-pairs with overlapping body sites, and compared the performance with other six conventional models, including keyword mapping, latent semantic analysis (LSA), latent Dirichlet allocation (LDA), Doc2Vec, Siamese long short term memory (LSTM) and a model based on named entity recognition (NER). We also utilized graph embedding method to enhance the word representation by capturing the semantic relations information from medical ontologies. Additionally, we used LIME algorithm to identify which features (or words) are decisive for the prediction results and improved the model interpretability. Experiment results showed that our CNN model gained significant improvement compared to all other conventional models on area under the receiver operating characteristic (AUROC), precision, recall and F1-score in our test dataset. The AUROC of our CNN models gained approximately 3–7% improvement. The AUROC of CNN model with graph-embedding and ontology based medical concept vectors was 0.8% higher than the model with randomly initialized vectors and 1.5% higher than the one with pre-trained word vectors. Our study demonstrates that CNN model with pre-trained medical concept vectors could accurately identify target report-pairs with overlapping body sites and potentially accelerate the retrieving process for imaging diagnosis quality measurement.

...read moreread less

Book Chapter•DOI•

nCoder+: A Semantic Tool for Improving Recall of nCoder Coding

[...]

Zhiqiang Cai¹, Amanda Siebert-Evenstone², Brendan R. Eagan², David Williamson Shaffer², Xiangen Hu¹, Xiangen Hu³, Arthur C. Graesser¹ - Show less +3 more•Institutions (3)

University of Memphis¹, University of Wisconsin-Madison², Central University, India³

20 Oct 2019

TL;DR: This study focuses on text data and considers coding as a process of identifying words or phrases and categorizing them into codes to facilitate data analysis, and proposes adding a semantic component to the nCoder.

...read moreread less

Abstract: Coding is a process of assigning meaning to a given piece of evidence. Evidence may be found in a variety of data types, including documents, research interviews, posts from social media, conversations from learning platforms, or any source of data that may provide insights for the questions under qualitative study. In this study, we focus on text data and consider coding as a process of identifying words or phrases and categorizing them into codes to facilitate data analysis. There are a number of different approaches to generating qualitative codes, such as grounded coding, a priori coding, or using both in an iterative process. However, both qualitative and quantitative analysts face the same coding problem: when the data size is large, manually coding becomes impractical. nCoder is a tool that helps researchers to discover and code key concepts in text data with minimum human judgements. Once reliability and validity are established, nCoder automatically applies the coding scheme to the dataset. However, for concepts that occur infrequently, even with an acceptable reliability, the classifier may still result in too many false negatives. This paper explores these problems within the current nCoder and proposes adding a semantic component to the nCoder. A tool called “nCoder+” is presented with real data to demonstrate the usefulness of the semantic component. The possible ways of integrating this component and other natural language processing techniques into nCoder are discussed.

...read moreread less

Journal Article•DOI•

Unsupervised and supervised text similarity systems for automated identification of national implementing measures of European directives

[...]

Rohan Nanda¹, Giovanni Siragusa¹, Luigi Di Caro¹, Guido Boella¹, Lorenzo Grossio¹, Marco Gerbaudo¹, Francesco Costamagna¹ - Show less +3 more•Institutions (1)

University of Turin¹

01 Jun 2019-Artificial Intelligence and Law

TL;DR: This paper used word and paragraph embedding models learned by shallow neural networks from a multilingual legal corpus of European directives and national legislation (from Ireland, Luxembourg and Italy) to identify transpositions.

...read moreread less

Abstract: The automated identification of national implementations (NIMs) of European directives by text similarity techniques has shown promising preliminary results. Previous works have proposed and utilized unsupervised lexical and semantic similarity techniques based on vector space models, latent semantic analysis and topic models. However, these techniques were evaluated on a small multilingual corpus of directives and NIMs. In this paper, we utilize word and paragraph embedding models learned by shallow neural networks from a multilingual legal corpus of European directives and national legislation (from Ireland, Luxembourg and Italy) to develop unsupervised semantic similarity systems to identify transpositions. We evaluate these models and compare their results with the previous unsupervised methods on a multilingual test corpus of 43 Directives and their corresponding NIMs. We also develop supervised machine learning models to identify transpositions and compare their performance with different feature sets.

...read moreread less

Proceedings Article•DOI•

Bangla Word Prediction and Sentence Completion Using GRU: An Extended Version of RNN on N-gram Language Model

[...]

Omor Faruk Rakib¹, Shahinur Akter¹, Azim Khan¹, Amit Kumar Das¹, Khan Mohammad Habibullah¹ - Show less +1 more•Institutions (1)

East West University¹

01 Dec 2019

TL;DR: This study has proposed a method that can predict the next most appropriate and suitable word in Bangla language, and also it can suggest the corresponding sentence to contribute to this technology of word prediction systems.

...read moreread less

Abstract: Textual information exchange, by typing the information and send it to the other end, is one of the most prominent mediums of communication throughout the world. People occupy a lot of time sending emails or additional information on social networking sites where typing the whole information is redundant and time-consuming in this advanced era. To make textual information exchange more speedy and easier, word predictive systems are launched which can predict the next most likely word so that people do not have to type the next word but select it from the suggested words. In this study, we have proposed a method that can predict the next most appropriate and suitable word in Bangla language, and also it can suggest the corresponding sentence to contribute to this technology of word prediction systems. This proposed approach is, using GRU (Gated Recurrent Unit) based RNN (Recurrent Neural Network) on n-gram dataset to create such language models that can predict the word(s) from the input sequence provided. We have used a corpus dataset, collected from different sources in Bangla language to run the experiments. Compared to the other methods that have been used such as LSTM (Long Short Term Memory) based RNN on n-gram dataset and Naive Bayes with Latent Semantic Analysis, our proposed approach gives better performance. It gives an average accuracy of 99.70% for 5-gram model, 99.24% for 4-gram model, 95.84% for Tri-gram model, 78.15%, and 32.17% respectively for Bi-gram and Uni-gram models on average.

...read moreread less

Journal Article•DOI•

ELSA: A Multilingual Document Summarization Algorithm Based on Frequent Itemsets and Latent Semantic Analysis

[...]

Luca Cagliero¹, Paolo Garza¹, Elena Baralis¹•Institutions (1)

Polytechnic University of Turin¹

16 Jan 2019-ACM Transactions on Information Systems

TL;DR: A new summarization approach is proposed that exploits frequent itemsets to describe all of the latent concepts covered by the documents under analysis and LSA to reduce the potentially redundant set of itemset to a compact set of uncorrelated concepts.

...read moreread less

Abstract: Sentence-based summarization aims at extracting concise summaries of collections of textual documents. Summaries consist of a worthwhile subset of document sentences. The most effective multilingual strategies rely on Latent Semantic Analysis (LSA) and on frequent itemset mining, respectively. LSA-based summarizers pick the document sentences that cover the most important concepts. Concepts are modeled as combinations of single-document terms and are derived from a term-by-sentence matrix by exploiting Singular Value Decomposition (SVD). Itemset-based summarizers pick the sentences that contain the largest number of frequent itemsets, which represent combinations of frequently co-occurring terms. The main drawbacks of existing approaches are (i) the inability of LSA to consider the correlation between combinations of multiple-document terms and the underlying concepts, (ii) the inherent redundancy of frequent itemsets because similar itemsets may be related to the same concept, and (iii) the inability of itemset-based summarizers to correlate itemsets with the underlying document concepts. To overcome the issues of both of the abovementioned algorithms, we propose a new summarization approach that exploits frequent itemsets to describe all of the latent concepts covered by the documents under analysis and LSA to reduce the potentially redundant set of itemsets to a compact set of uncorrelated concepts. The summarizer selects the sentences that cover the latent concepts with minimal redundancy. We tested the summarization algorithm on both multilingual and English-language benchmark document collections. The proposed approach performed significantly better than both itemset- and LSA-based summarizers, and better than most of the other state-of-the-art approaches.

...read moreread less

Journal Article•DOI•

Soft voting technique to improve the performance of global filter based feature selection in text corpus

[...]

Deepak Agnihotri¹, Kesari Verma¹, Priyanka Tripathi¹, Bikesh Kumar Singh¹•Institutions (1)

National Institute of Technology, Raipur¹

01 Apr 2019-Applied Intelligence

TL;DR: A new Soft Voting Technique (SVT) is proposed to improve the performance of the Global Filter-based Feature Selection Scheme (GFSS) and improves the numerical discrimination of words to identify there positive and negative membership to a class.

...read moreread less

Abstract: In text classification, the Global Filter-based Feature Selection Scheme (GFSS) selects the top-N ranked words as features. It discards the low ranked features from some classes either partially or completely. The low rank is usually due to varying occurrence of the words (terms) in the classes. The Latent Semantic Analysis (LSA) can be used to address this issue as it eliminates the redundant terms. It assigns an equal rank to the terms that represent similar concepts or meanings, e.g. four terms “carcinoma”, “sarcoma”, “melanoma”, and “cancer” represent a similar concept, i.e. “cancer”. Thus, any selected term by the algorithms from these four terms doesn’t affect the classifier performance. However, it does not guarantee that the selection of top-N LSA ranked terms by GFSS are the representative terms of each class. An Improved Global Feature Selection Scheme (IGFSS) solves this issue by selecting an equal number of representative terms from all the classes. However, it has two issues, first, it assigns the class label and membership of each term on the basis of an individual vote of the Odds Ratio (OR) method thereby limiting the decision making capability. Second, the ratio of selected terms is determined empirically by the IGFSS and a common ratio is applied to all the classes to assign the positive and negative membership of the terms. However, the ratio of positive and negative nature terms varies from one class to another and it may be very less for one class, whereas high for other classes. Thus, one common negative features ratio used by the IGFSS affects those classes of a dataset in which there is an imbalance between positive and negative nature words. To address these issues of IGFSS, a new Soft Voting Technique (SVT) is proposed to improve the performance of GFSS. There are two main contributions in this paper: (i) The weighted average score (Soft Vote) of three methods, viz. OR, Correlation Coefficient (CC), and GSS Coefficients (GSS) improves the numerical discrimination of words to identify there positive and negative membership to a class. (ii) A mathematical expression is incorporated in the IGFSS that computes a varying ratio of positive and negative memberships of the terms for each class. The membership is based on the occurrence of the terms in the classes. The proposed SVT is evaluated using four standard classifiers applied on five bench-marked datasets. The experimental results based on Macro_F1 and Micro_F1 measures show that SVT achieves a significant improvement in the performance of classifiers in comparison of standard methods.

...read moreread less

Journal Article•DOI•

Stacking-Based Ensemble Learning of Self-Media Data for Marketing Intention Detection

[...]

Yufeng Wang, Shuangrong Liu, Songqian Li, Jidong Duan, Zhihao Hou, Jia Yu, Kun Ma - Show less +3 more

10 Jul 2019-Future Internet

TL;DR: A machine learning method to identify marketing intentions from large-scale The authors-Media data is proposed and the proposed Latent Semantic Analysis (LSI)-Word2vec model can reflect the semantic features and the decision tree model is simplified by decision tree pruning to save computing resources and reduce the time complexity.

...read moreread less

Abstract: Social network services for self-media, such as Weibo, Blog, and WeChat Public, constitute a powerful medium that allows users to publish posts every day. Due to insufficient information transparency, malicious marketing of the Internet from self-media posts imposes potential harm on society. Therefore, it is necessary to identify news with marketing intentions for life. We follow the idea of text classification to identify marketing intentions. Although there are some current methods to address intention detection, the challenge is how the feature extraction of text reflects semantic information and how to improve the time complexity and space complexity of the recognition model. To this end, this paper proposes a machine learning method to identify marketing intentions from large-scale We-Media data. First, the proposed Latent Semantic Analysis (LSI)-Word2vec model can reflect the semantic features. Second, the decision tree model is simplified by decision tree pruning to save computing resources and reduce the time complexity. Finally, this paper examines the effects of classifier associations and uses the optimal configuration to help people efficiently identify marketing intention. Finally, the detailed experimental evaluation on several metrics shows that our approaches are effective and efficient. The F1 value can be increased by about 5%, and the running time is increased by 20%, which prove that the newly-proposed method can effectively improve the accuracy of marketing news recognition.

...read moreread less

Proceedings Article•DOI•

Topic modeling Twitter data using Latent Dirichlet Allocation and Latent Semantic Analysis

[...]

Siti Qomariyah, Nur Iriawan, Kartika Fithriasari

18 Dec 2019

TL;DR: This study shows that LDA gives a better result than LSA, which doesn’t consider the relationship between documents in the corpus, while LDA does.

...read moreread less

Abstract: The industrial world has entered the era of industrial revolution 4.0. In this era, there is an urgent data requirement from the community to support service policies. Because of that, Surabaya Government made Media Center Surabaya. This media is used to accommodate all the aspiration of Surabaya citizen. To access this media, a citizen can use Twitter. The topic which is discussed in Twitter is important information that we need to know. The information can be used to improve the performance of Surabaya Government services. Twitter data is a text data that consists of thousands of variables. Text mining is frequently used to analyze this kind of data, including topic modeling and sentiment analysis. This study would work on topic modeling focused on the algorithm employing Latent Dirichlet Allocation (LDA) and Latent Semantic Analysis (LSA). The evaluation of the algorithm performance uses the topic coherence. As unstructured data, the Twitter data need preprocessing before the analysis. The stages of preprocessing include cleansing, stemming, and stop words. The advantages of LSA are fast and easy to implement. LSA, on the other hand, doesn’t consider the relationship between documents in the corpus, while LDA does. This study shows that LDA gives a better result than LSA.

...read moreread less

Proceedings Article•DOI•

New Approaches to the Identification of Dependencies between Requirements

[...]

Ralph Samer¹, Martin Stettinger¹, Müslüm Atas¹, Alexander Felfernig¹, Guenther Ruhe², Gouri Deshpande² - Show less +2 more•Institutions (2)

Graz University of Technology¹, University of Calgary²

01 Nov 2019

TL;DR: Two content-based recommendation approaches which automatically detect and recommend dependencies between requirements which are defined on a textual level by exploiting document classification techniques are presented.

...read moreread less

Abstract: There is a high demand for intelligent decision support systems which assist stakeholders in requirements engineering tasks. Examples of such tasks are the elicitation of requirements, release planning, and the identification of requirement-dependencies. In particular, the detection of dependencies between requirements is a major challenge for stakeholders. In this paper, we present two content-based recommendation approaches which automatically detect and recommend such dependencies. The first approach identifies potential dependencies between requirements which are defined on a textual level by exploiting document classification techniques (based on Linear SVM, Naive Bayes, Random Forest, and k-Nearest Neighbors). This approach uses two different feature types (TF-IDF features vs. probabilistic features). The second recommendation approach is based on Latent Semantic Analysis and defines the baseline for the evaluation with a real-world data set. The evaluation shows that the recommendation approach based on Random Forest using probabilistic features achieves the best prediction quality of all approaches (F1: 0.89).

...read moreread less

Journal Article•DOI•

Summarization of legal judgments using gravitational search algorithm

[...]

Ambedkar Kanapala¹, Srikanth Jannu, Rajendra Pamula¹•Institutions (1)

Indian Institutes of Technology¹

01 Dec 2019-Neural Computing and Applications

TL;DR: A gravitational search algorithm is adopted that works on the basis of the law of gravity to optimize the summary of the document and the experimental results show better than the existing state-of-the-art methods in terms of various performance metrics.

...read moreread less

Abstract: Text summarization is an extraction of important text from the original document. The objective of any automatic text summarization system, especially in legal domain, is to produce a summary which is close to human-generated summaries. In this article, we present the summarization of legal documents as binary optimization problem where fitness of the solution is derived based on the weighting of individual statistical features of each sentence such as length of the sentence, sentence position, degree of similarity, term frequency–inverse sentence frequency and keywords to generate summary of the document. In this paper, a gravitational search algorithm is adopted that works on the basis of the law of gravity to optimize the summary of the document. To show the efficacy of the proposed method, we compare the experimental results with particle swarm optimization, genetic algorithm, TextRank, latent semantic analysis, MEAD, MS-Word, SumBasic using ROUGE evaluation metrics on the FIRE-2014 data set. The experimental results of the proposed method show better than the existing state-of-the-art methods in terms of various performance metrics.

...read moreread less

Book Chapter•DOI•

Automated Essay Evaluation Based on Fusion of Fuzzy Ontology and Latent Semantic Analysis

[...]

Saad M. Darwish¹, Sherine Kh. Mohamed¹•Institutions (1)

Information Technology Institute¹

28 Mar 2019

TL;DR: Fuzzy ontology is used to check the consistency and coherence of the essay as it is the best way to overcome the vagueness of the language, and the system will also provide a score with feedback to the student.

...read moreread less

Abstract: New learning researches proved that creativity is an essential concern in the arena of education. The best means to evaluate learning outcomes and students’ creativity is essay questions. However, to evaluate these questions is a time-consuming task and subjectivity in scoring assessments remains inevitable. Automated essay evaluation systems (AEE) provide a cost-effective and consistent alternative to human marking. Therefore, numerous automatic essay-grading systems have been developed to lessen the demands of manual essay grading. However, these systems concentrate on syntax and vocabulary, and no consideration is paid to the semantic and coherence of the essay. Moreover, few of the existing systems are able to give informative feedback that is based on extensive domain knowledge to students. In this paper, a system is evolved that uses latent semantic analysis (LSA) and fuzzy ontology to evaluate essays, where LSA will be responsible for checking the semantic. Fuzzy ontology is used to check the consistency and coherence of the essay as it is the best way to overcome the vagueness of the language, and the system will also provide a score with feedback to the student. Experimental results were good in evaluating the essay syntactically and semantically.

...read moreread less

Proceedings Article•DOI•

Towards Computational Assessment of Idea Novelty

[...]

Kai Wang¹, Boxiang Dong², Junjie Ma¹•Institutions (2)

Kean University¹, Montclair State University²

01 Jan 2019

TL;DR: Three computational approaches based on Latent Semantic Analysis, Latent Dirichlet Allocation and term frequency–inverse document frequency do not match human judgement well enough to replace it and are shown to not correlated better with expert evaluation.

...read moreread less

Abstract: In crowdsourcing ideation websites, companies can easily collect large amount of ideas. Screening through such volume of ideas is very costly and challenging, necessitating automatic approaches. It would be particularly useful to automatically evaluate idea novelty since companies commonly seek novel ideas. Three computational approaches were tested, based on Latent Semantic Analysis (LSA), Latent Dirichlet Allocation (LDA) and term frequency–inverse document frequency (TF-IDF), respectively. These three approaches were used on three set of ideas and the computed idea novelty was compared with human expert evaluation. TF-IDF based measure correlated better with expert evaluation than the other two measures. However, our results show that these approaches do not match human judgement well enough to replace it.

...read moreread less

Journal Article•DOI•

A novel fuzzy k-means latent semantic analysis (FKLSA) approach for topic modeling over medical and health text corpora

[...]

Junaid Rashid¹, Syed Muhammad Adnan Shah¹, Aun Irtaza¹•Institutions (1)

University of Engineering and Technology¹

01 Jan 2019-Journal of Intelligent and Fuzzy Systems

Journal Article•DOI•

lsemantica: A command for text similarity based on latent semantic analysis:

[...]

Carlo Schwarz¹•Institutions (1)

University of Warwick¹

14 Mar 2019-Stata Journal

TL;DR: The lsemantica command is presented, which implements latent semantic analysis in Stata using truncated singular value decomposition and provides a simple command for latent semanticAnalysis as well as complementary commands for text similarity comparison.

...read moreread less

Abstract: In this article, I present the lsemantica command, which implements latent semantic analysis in Stata. Latent semantic analysis is a machine learning algorithm for word and text similarity comparison and uses truncated singular value decomposition to derive the hidden semantic relationships between words and texts. lsemantica provides a simple command for latent semantic analysis as well as complementary commands for text similarity comparison.

...read moreread less

Journal Article•DOI•

When does abstraction occur in semantic memory: insights from distributional models

[...]

Michael N. Jones¹•Institutions (1)

Indiana University¹

26 Nov 2019-Language, cognition and neuroscience

TL;DR: A small but growing group of accounts in psychology, linguistics, and information retrieval that are exemplar-based semantic models borrow many of the ideas that have led to the prominence of exemplar models in fields such as categorisation.

...read moreread less

Abstract: ion is a core principle of Distributional Semantic Models (DSMs) that learn semantic representations for words by applying dimensional reduction to statistical redundancies in langu...

...read moreread less

Proceedings Article•DOI•

Unified Topic-Based Semantic Models: A Study in Computing the Semantic Relatedness of Geographic Terms

[...]

Hossein Sadr¹, Mojdeh Nazari Soleimandarabi¹, Mir Mohsen Pedram², Mohammad Teshnelab³•Institutions (3)

Islamic Azad University¹, Kharazmi University², K.N.Toosi University of Technology³

24 Apr 2019

TL;DR: A unifying approach representing topic- based models is proposed and from which the state-of-the-art semantic relatedness measures are divided into two distinct types of topic-based and ontology-based models.

...read moreread less

Abstract: Over the last decades, a multitude of semantic relatedness measures have been proposed. Despite an extensive amount of work dedicated to this area of research, the understanding of their foundation is still limited in real-world applications. In this paper, a unifying approach representing topic-based models is proposed and from which the state-of-the-art semantic relatedness measures are divided into two distinct types of topic-based and ontology-based models. Regardless of extensive researches in the field of ontology-based models, topic-based models have not been taken into account considerably. Consequently, the unified approach is able to highlight equivalences among these models and propose bridges between their theoretical bases. On the other hand, presenting a comprehensive unifying approach of topic-based models induces readers to have a common understanding of them despite the differences and complexities between their architecture and configuration details. In order to evaluate topic-based models in comparison to ontology-based models, comprehensive experiments in the application of semantic relatedness of geographic phrases have been applied. Empirical results have demonstrated that not only topic-based models in comparison to ontology-based models confront with fewer restrictions in the real world, but also their performance in computing semantic relatedness of geographic phrases is significantly superior to ontology-based models.

...read moreread less

Proceedings Article•DOI•

Stacking-Based Ensemble Learning on Low Dimensional Features for Fake News Detection

[...]

Songqian Li¹, Kun Ma¹, Xuewei Niu¹, Yufeng Wang¹, Ke Ji¹, Ziqiang Yu¹, Zhenxiang Chen¹ - Show less +3 more•Institutions (1)

University of Jinan¹

01 Aug 2019

TL;DR: Experimental analysis of real-world data demonstrates that the proposed pipeline that combines preprocessing, feature extraction and model fusion for a more accurate and automated prediction achieves higher accuracy than existing approaches.

...read moreread less

Abstract: With the age of incoming of self-media, everyone can be the author of the content in the media age of big data. This has caused a mass of fake news appearing in the network. Authors of these fake news will mislead the public by spreading and it will bring economic and social benefits. Existing work focuses on using the various types of features of the article in the hope that a way to accurately identify fake news can be found, but this undermines their universality. In this paper, we propose a pipeline that combines preprocessing, feature extraction and model fusion for a more accurate and automated prediction. Specially we fusion of latent semantic analysis (LSA) and ensemble learning model results using stacking. Experimental analysis of real-world data demonstrates that our pipeline achieves higher accuracy than existing approaches.

...read moreread less

Book Chapter•DOI•

Quantum Latent Semantic Analysis

[...]

Fabio A. González¹, Juan C. Caicedo¹•Institutions (1)

National University of Colombia¹

07 Mar 2019-arXiv: Learning

TL;DR: The main goal of this paper is to explore latent topic analysis (LTA), in the context of quantum information retrieval, with results suggesting that the quantum-motivated representation is an alternative for geometrical latent topic modeling worthy of further exploration.

...read moreread less

Abstract: The main goal of this paper is to explore latent topic analysis (LTA), in the context of quantum information retrieval. LTA is a valuable technique for document analysis and representation, which has been extensively used in information retrieval and machine learning. Different LTA techniques have been proposed, some based on geometrical modeling (such as latent semantic analysis, LSA) and others based on a strong statistical foundation. However, these two different approaches are not usually mixed. Quantum information retrieval has the remarkable virtue of combining both geometry and probability in a common principled framework. We built on this quantum framework to propose a new LTA method, which has a clear geometrical motivation but also supports a well-founded probabilistic interpretation. An initial exploratory experimentation was performed on three standard data sets. The results show that the proposed method outperforms LSA on two of the three datasets. These results suggests that the quantum-motivated representation is an alternative for geometrical latent topic modeling worthy of further exploration.

...read moreread less

Journal Article•DOI•

Latent semantic analysis of corporate social responsibility reports (with an application to Hellenic firms)

[...]

Ioanna Kountouri¹, Eleftherios G. Manousakis¹, Andrianos E. Tsekrekos¹•Institutions (1)

Athens University of Economics and Business¹

01 Mar 2019-International Journal of Disclosure and Governance

TL;DR: It is established—via cross-sectional regression with selection—that the identified themes have a statistically significant effect on reporting firms’ return-on-assets (ROA), even after controlling for factors known to explain the cross-section variation in ROA and the self-selectivity of firms that engage in CSR practices.

...read moreread less

Abstract: We propose a novel and objective statistical method known as latent semantic analysis (LSA), used in search engine procedures and information retrieval applications, as a methodological alternative for textual analysis in corporate social responsibility (CSR) research. LSA is a language processing technique that allows recognition of textual associative patterns and permits statistical extraction of common textual themes that characterize an entire set of documents, as well as tracking the relative prevalence of each theme over time and across entities. LSA possesses all the advantages of quantitative textual analysis methods (reliability control and bias reduction), is automated (meaning it can process numerous documents in minutes, as opposed to the time and resources needed to perform subjective scoring of text passages) and can be combined in mixed-method research approaches. To demonstrate the method, our empirical application analyzes the CSR reports of Hellenic companies, and first testifies that eight (five) recurring and common textual themes can explain about 50% (40%) of the variation in their CSR reports. We further establish—via cross-sectional regression with selection—that the identified themes have a statistically significant effect on reporting firms’ return-on-assets (ROA), even after controlling for factors known to explain the cross-sectional variation in ROA and the self-selectivity of firms that engage in CSR practices.

...read moreread less

Journal Article•DOI•

Automated Bahasa Indonesia essay evaluation with latent semantic analysis

[...]

Amalia Amalia, Dani Gunawan, Y. Fithri, I. Aulia

01 Jun 2019