scispace - formally typeset
Search or ask a question

Showing papers on "Latent semantic analysis published in 2020"


Journal ArticleDOI
14 Jul 2020
TL;DR: Investigating the topic modeling subject and its common application areas, methods, and tools sheds light on some common topic modeling methods in a short-text context and provides direction for researchers who seek to apply these methods.
Abstract: With the growth of online social network platforms and applications, large amounts of textual user-generated content are created daily in the form of comments, reviews, and short-text messages. As a result, users often find it challenging to discover useful information or more on the topic being discussed from such content. Machine learning and natural language processing algorithms are used to analyze the massive amount of textual social media data available online, including topic modeling techniques that have gained popularity in recent years. This paper investigates the topic modeling subject and its common application areas, methods, and tools. Also, we examine and compare five frequently used topic modeling methods, as applied to short textual social data, to show their benefits practically in detecting important topics. These methods are latent semantic analysis, latent Dirichlet allocation, non-negative matrix factorization, random projection, and principal component analysis. Two textual datasets were selected to evaluate the performance of included topic modeling methods based on the topic quality and some standard statistical evaluation metrics, like recall, precision, F-score, and topic coherence. As a result, latent Dirichlet allocation and non-negative matrix factorization methods delivered more meaningful extracted topics and obtained good results. The paper sheds light on some common topic modeling methods in a short-text context and provides direction for researchers who seek to apply these methods.

134 citations


Journal ArticleDOI
TL;DR: A hybrid bibliometric approach that combines both direct citation network analysis and text analytics was proposed to examine the related research articles retrieved from the Web of Science database to enable the rapid understanding of the overall research development of the TEL in higher education.

126 citations


Journal ArticleDOI
TL;DR: A knowledge-based method is proposed, modeling the problem with semantic space and semantic path hidden behind a given sentence, and achieving state-of-the-art performance in several WSD datasets.
Abstract: Word Sense Disambiguation (WSD) has been a basic and on-going issue since its introduction in natural language processing (NLP) community. Its application lies in many different areas including sentiment analysis, Information Retrieval (IR), machine translation and knowledge graph construction. Solutions to WSD are mostly categorized into supervised and knowledge-based approaches. In this paper, a knowledge-based method is proposed, modeling the problem with semantic space and semantic path hidden behind a given sentence. The approach relies on the well-known Knowledge Base (KB) named WordNet and models the semantic space and semantic path by Latent Semantic Analysis (LSA) and PageRank respectively. Experiments has proven the method’s effectiveness, achieving state-of-the-art performance in several WSD datasets.

87 citations


Journal ArticleDOI
TL;DR: A new topic modeling method called W2V-LSA, which is based on Word2vec and Spherical k-means clustering to better capture and represent the context of a corpus, which can be a competitive alternative for better topic modeling to provide direction for future research in technology trend analysis.
Abstract: Blockchain has become one of the core technologies in Industry 4.0. To help decision-makers establish action plans based on blockchain, it is an urgent task to analyze trends in blockchain technology. However, most of existing studies on blockchain trend analysis are based on effort demanding full-text investigation or traditional bibliometric methods whose study scope is limited to a frequency-based statistical analysis. Therefore, in this paper, we propose a new topic modeling method called Word2vec-based Latent Semantic Analysis (W2V-LSA), which is based on Word2vec and Spherical k-means clustering to better capture and represent the context of a corpus. We then used W2V-LSA to perform an annual trend analysis of blockchain research by country and time for 231 abstracts of blockchain-related papers published over the past five years. The performance of the proposed algorithm was compared to Probabilistic LSA, one of the common topic modeling techniques. The experimental results confirmed the usefulness of W2V-LSA in terms of the accuracy and diversity of topics by quantitative and qualitative evaluation. The proposed method can be a competitive alternative for better topic modeling to provide direction for future research in technology trend analysis and it is applicable to various expert systems related to text mining.

79 citations


Book ChapterDOI
08 Apr 2020
TL;DR: The overall results of the study were that semantics is paramount in processing natural languages and aid in machine learning.
Abstract: Semantics is a branch of linguistics, which aims to investigate the meaning of language. Semantics deals with the meaning of sentences and words as fundamentals in the world. Semantic analysis within the framework of natural language processing evaluates and represents human language and analyzes texts written in the English language and other natural languages with the interpretation similar to those of human beings. This study aimed to critically review semantic analysis and revealed that explicit semantic analysis, latent semantic analysis, and sentiment analysis contribute to the leaning of natural languages and texts, enable computers to process natural languages, and reveal opinion attitudes in texts. The future prospect is in the domain of sentiment lexes. The overall results of the study were that semantics is paramount in processing natural languages and aid in machine learning. This study has covered various aspects including the Natural Language Processing (NLP), Latent Semantic Analysis (LSA), Explicit Semantic Analysis (ESA), and Sentiment Analysis (SA) in different sections of this study. However, LSA has been covered in detail with specific inputs from various sources. This study also highlights the future prospects of semantic analysis domain and finally the study is concluded with the result section where areas of improvement are highlighted and the recommendations are made for the future research. This study also highlights the weakness and the limitations of the study in the discussion (Sect. 4) and results (Sect. 5).

44 citations


Journal ArticleDOI
TL;DR: Experiments shows that the proposed deeper version of latent factor model significantly outperforms all state-of-the-art collaborative filtering techniques.

42 citations


Journal ArticleDOI
TL;DR: The main objective of this research paper is to design a system which would generate multimodal, nonparametric Bayesian model, and multilayered probability latent semantic analysis (pLSA)-based visual dictionary (BM-MpLSA).
Abstract: The main objective of this research paper is to design a system which would generate multimodal, nonparametric Bayesian model, and multilayered probability latent semantic analysis (pLSA)-based visual dictionary (BM-MpLSA). Advancement in technology and the exuberance of sports lovers have necessitated a requirement for automatic action recognition in the live video seed of sports. The fundamental requirement for such model is the creation of visual dictionary for each sports domain. This multimodal nonparametric model has two novel co-occurrence matrix creation—one for image feature vector and the other for textual entities. This matrix provides a basic scaling parameter for the unobserved random variables, and it is an extension of multilayered pLSA-based visual dictionary creation. This paper precisely concentrates on the creation of visual dictionary for Basketball. From the sports event images, the feature vector extracted is modified as SIFT and MPEG 7’s-based dominant color, color layout, scalable color and edge histograms. After quantization and analysis of these vector values, the visual vocabulary would be created by integrating them into the domain specific visual ontology for semantic understanding. The accuracy rate of this work is compared with respect to the action held on image based on performance.

36 citations


Journal ArticleDOI
TL;DR: A new plagiarism detection technique between C++ and Java source codes based on semantics in multimedia-based e-Learning and smart assessment methodology is proposed and the experimental results show better semantic similarity results for plagiarism Detection based on comparison.
Abstract: The multimedia-based e-Learning methodology provides virtual classrooms to students. The teacher uploads learning materials, programming assignments and quizzes on university’ Learning Management System (LMS). The students learn lessons from uploaded videos and then solve the given programming tasks and quizzes. The source code plagiarism is a serious threat to academia. However, identifying similar source code fragments between different programming languages is a challenging task. To solve the problem, this paper proposed a new plagiarism detection technique between C++ and Java source codes based on semantics in multimedia-based e-Learning and smart assessment methodology. First, it transforms source codes into tokens to calculate semantic similarity in token by token comparison. After that, it finds semantic similarity in scalar value for the complete source codes written in C++ and Java. To analyse the experiment, we have taken the dataset consists of four (4) case studies of Factorial, Bubble Sort, Binary Search and Stack data structure in both C++ and Java. The entire experiment is done in R Studio with R version 3.4.2. The experimental results show better semantic similarity results for plagiarism detection based on comparison.

36 citations


Journal ArticleDOI
TL;DR: The Semantic Scale Network is introduced, an easy-to-use online application that can support expert judgments on scale redundancy without access to empirical data or awareness of every potentially related scale.
Abstract: Psychological measurement and theory are afflicted with an ongoing proliferation of new constructs and scales. Given the often redundant nature of new scales, psychological science is struggling with arbitrary measurement, construct dilution, and disconnection between research groups. To address these issues, we introduce an easy-to-use online application: the Semantic Scale Network. The purpose of this application is to automatically detect semantic overlap between scales through latent semantic analysis. Authors and reviewers can enter the items of a new scale into the application, and receive quantifications of semantic overlap with related scales in the application's corpus. Contrary to traditional assessments of scale overlap, the application can support expert judgments on scale redundancy without access to empirical data or awareness of every potentially related scale. After a brief introduction to measures of semantic similarity in texts, we introduce the Semantic Scale Network and provide best practices for interpreting its outputs. (PsycINFO Database Record (c) 2019 APA, all rights reserved).

33 citations


Journal ArticleDOI
TL;DR: This paper addresses a comparison study on scientific unstructured text document classification (e-books) based on the full text where applying the most popular topic modeling approach (LDA, LSA) to cluster the words into a set of topics as important keywords for classification.
Abstract: With the rapid growth of information technology, the amount of unstructured text data in digital libraries is rapidly increased and has become a big challenge in analyzing, organizing and how to classify text automatically in E-research repository to get the benefit from them is the cornerstone. The manual categorization of text documents requires a lot of financial, human resources for management. In order to get so, topic modeling are used to classify documents. This paper addresses a comparison study on scientific unstructured text document classification (e-books) based on the full text where applying the most popular topic modeling approach (LDA, LSA) to cluster the words into a set of topics as important keywords for classification. Our dataset consists of (300) books contain about 23 million words based on full text. In the used topic models (LSA, LDA) each word in the corpus of vocabulary is connected with one or more topics with a probability, as estimated by the model. Many (LDA, LSA) models were built with different values of coherence and pick the one that produces the highest coherence value. The result of this paper showed that LDA has better results than LSA and the best results obtained from the LDA method was ( 0.592179 ) of coherence value when the number of topics was 20 while the LSA coherence value was (0.5773026) when the number of topics was 10.

30 citations


Journal ArticleDOI
TL;DR: This work aims to recommend relevant literature articles for datasets with the ultimate goal of increasing the productivity of researchers by recommending relevant literature for datasets using an information retrieval paradigm for literature recommendation.

Journal ArticleDOI
TL;DR: A framework for personalized web page recommendation based on a hybridized strategy is proposed and the Personalization is achieved by prioritizing the Web pages based on the Prioritization Vector.
Abstract: The World Wide Web is constantly evolving and is the most dynamic information repository in the world that has ever existed. Since the information on the web is changing continuously and owing to the presence of a large number of similar web pages, it is very challenging to retrieve the most relevant information. With a large number of malicious and fake web pages, it is required to retrieve Web Pages that are trustworthy. Personalization of the recommendation of web pages is certainly necessary to estimate the user interests for suggesting web pages as per their choices. Moreover, the Web is tending towards a more organized Semantic Web which primarily requires semantic techniques for recommending the Web Pages. In this paper, a framework for personalized web page recommendation based on a hybridized strategy is proposed. Web Pages are recommended based on the user query by analyzing the …

Journal ArticleDOI
TL;DR: A deep learning framework combining word embeddings, bi-directional long short-term memory (Bi-LSTM), and convolutional neural networks (CNN) to identify emotion labels from psychiatric social texts is proposed.
Abstract: Discussion features in online communities can be effectively used to diagnose depression and allow other users or experts to provide self-help resources to those in need. Automatic emotion identification models can quickly and effectively highlight indicators of emotional stress in the text of such discussions. Such communities also provide patients with important knowledge to help better understand their condition. This study proposes a deep learning framework combining word embeddings, bi-directional long short-term memory (Bi-LSTM), and convolutional neural networks (CNN) to identify emotion labels from psychiatric social texts. The Bi-LSTM is a powerful mechanism for extracting features from sequential data in which a sentence consists of multiple words in a particular sequence. CNN is another powerful feature extractor which can convolute many blocks to capture important features. Our proposed deep learning framework also applies word representation techniques to represent semantic relationships between words. The paper thus combines two powerful feature extraction methods with word embedding to automatically identify indicators of emotional stress. Experimental results show that our proposed framework outperformed other models using traditional feature extraction such as bag-of-words (BOW), latent semantic analysis (LSA), independent component analysis (ICA), and LSA+ICA.

Journal ArticleDOI
15 Apr 2020
TL;DR: The experiments show that according to the type of content and metric, the performance of the feature extraction methods is very different; in some cases are better than the others, and in other cases is the inverse.
Abstract: This paper analyses the capabilities of different techniques to build a semantic representation of educational digital resources. Educational digital resources are modeled using the Learning Object Metadata (LOM) standard, and these semantic representations can be obtained from different LOM fields, like the title, description, among others, in order to extract the features/characteristics from the digital resources. The feature extraction methods used in this paper are the Best Matching 25 (BM25), the Latent Semantic Analysis (LSA), Doc2Vec, and the Latent Dirichlet allocation (LDA). The utilization of the features/descriptors generated by them are tested in three types of educational digital resources (scientific publications, learning objects, patents), a paraphrase corpus and two use cases: in an information retrieval context and in an educational recommendation system. For this analysis are used unsupervised metrics to determine the feature quality proposed by each one, which are two similarity functions and the entropy. In addition, the paper presents tests of the techniques for the classification of paraphrases. The experiments show that according to the type of content and metric, the performance of the feature extraction methods is very different; in some cases are better than the others, and in other cases is the inverse.

Journal ArticleDOI
TL;DR: This work modify the Weighted TF_IDF (Term Frequency Inverse Document Frequency) algorithm to summarize books into relevant keywords and finds that it is an efficient algorithm to automate text summarization and produce an effective summary which is then converted from text to speech.
Abstract: Owing to the phenomenal growth in communication technology, most of us hardly have time to read books. This habit of reading is slowly diminishing because of the busy lives of people. For visually challenged people, the situation is even worse. In order to address this impedes, we develop a better and more accurate methodology than the existing ones. In this work, in order to save the efforts for reading the complete text every time, we modify the Weighted TF_IDF (Term Frequency Inverse Document Frequency) algorithm to summarize books into relevant keywords. Then, we compare the modified algorithm with that of the existing algorithms of TextRank Algorithm, Luhn’s Algorithm, LexRank Algorithm, Latent Semantic Analysis(LSA). From the comparative analysis, we find that Weighted TF_IDF is an efficient algorithm to automate text summarization and produce an effective summary which is then converted from text to speech. Thus, the proposed algorithm would highly be useful for blind people.

Journal ArticleDOI
TL;DR: A metric to measure the diversity of a set of captions is proposed, which is derived from latent semantic analysis (LSA), and then kernelize LSA using CIDEr, and the experimental results show that maximizing the determinant of the ensemble matrix outperforms other methods considerably improving diversity and accuracy.
Abstract: In this paper, we first propose a metric to measure the diversity of a set of captions, which is derived from latent semantic analysis (LSA), and then kernelize LSA using CIDEr similarity. Compared with mBLEU, our proposed diversity metrics show a relatively strong correlation to human evaluation. We conduct extensive experiments, finding that the models that aim to generate captions with higher CIDEr scores normally obtain lower diversity scores, which generally learn to describe images using common words. To bridge this "diversity" gap, we consider several methods for training caption models to generate diverse captions. First, we show that balancing the cross-entropy loss and CIDEr reward in reinforcement learning during training can effectively control the tradeoff between diversity and accuracy. Second, we develop approaches that directly optimize our diversity metric and CIDEr score using reinforcement learning. Third, we combine accuracy and diversity into a single measure using an ensemble matrix and then maximize the determinant of the ensemble matrix via reinforcement learning to boost diversity and accuracy, which outperforms its counterparts on the oracle test. Finally, we develop a DPP selection algorithm to select a subset of captions from a large number of candidate captions.

Journal ArticleDOI
TL;DR: In this article, a multi-stage approach consisting of feature engineering within natural language processing, lemmatization, feature selection, feature extraction, improved learning techniques for resampling and cross-validation, and the configuration of hyperparameters is proposed.
Abstract: A phishing attack is a threat based on fraudulent communication, usually by e-mail, where the cybercriminals, impersonating a trusted person or organization, try to lure and coax a target. Phishing detection approaches that obtain highly representational features from the text of these e-mails are a suitable strategy to counter these threats since these features can be used to train machine learning algorithms, thus generating models able to classify mail samples as phishing or legitimate messages. This paper proposes a multi-stage approach to detect phishing e-mail attacks using natural language processing and machine learning. The proposed multi-stage approach consists of feature engineering within natural language processing, lemmatization, feature selection, feature extraction, improved learning techniques for resampling and cross-validation, and the configuration of hyperparameters. We present two methods of the proposed approach, the first one exploiting the Chi-Square statistics and the Mutual Information to improve the dimensionality reduction, while the second method associates Principal Component Analysis (PCA) and Latent Semantic Analysis (LSA). Both methods handle the problems of the “curse of dimensionality”, the sparsity, and the amount of information that must be obtained from the context in the Vector Space Model (VSM) representation. These methods yield reduced feature sets that, combined with the XGBoost and Random Forest machine learning algorithms, lead to an F1-measure of 100% success rate, for validation tests with the SpamAssassin Public Corpus and the Nazario Phishing Corpus datasets. Even considering just the text in e-mail bodies, the proposed multi-stage phishing detection approach outperforms state-of-the-art schemes for an accredited data set, requiring a much smaller number of features and presenting lower computational cost.

Journal ArticleDOI
TL;DR: This paper proposes and assesses new extractive single-document summarization approaches based on analogical proportions, and suggests two algorithms to quantify the relevance/irrelevance of an extracted keyword from the input text, to build its summary.
Abstract: Automatic text summarization is the process of generating or extracting a brief representation of an input text There are several algorithms for extractive summarization in the literature tested by using English and other languages datasets; however, only few extractive Arabic summarizers exist due to the lack of large collection in Arabic language This paper proposes and assesses new extractive single-document summarization approaches based on analogical proportions which are statements of the form “a is to b as c is to d” The goal is to study the capability of analogical proportions to represent the relationship between documents and their corresponding summaries For this purpose, we suggest two algorithms to quantify the relevance/irrelevance of an extracted keyword from the input text, to build its summary In the first algorithm, the analogical proportion representing this relationship is limited to check the existence/non-existence of the keyword in any document or summary in a binary way without considering keyword frequency in the text, whereas the analogical proportion of the second algorithm considers this frequency We have assessed and compared these two algorithms with some language-independent summarizers (LexRank, TextRank, Luhn and LSA (Latent Semantic Analysis)) using our large corpus ANT (Arabic News Texts) and a small test collection EASC (Essex Arabic Summaries Corpus) by computing ROUGE (Recall-Oriented Understudy for Gisting Evaluation) and BLEU (BiLingual Evaluation Understudy) metrics The best-achieved results are ROUGE-1 = 096 and BLEU-1 = 065 corresponding to educational documents from EASC collection which outperform the best LexRank algorithm The proposed algorithms are also compared with three other Arabic extractive summarizers, using EASC collection, and show better results in terms of ROUGE-1 = 075 and BLEU-1 = 047 for the first algorithm, and ROUGE-1 = 074 and BLEU-1 = 049 for the second one Experimental results show the interest of analogical proportions for text summarization In particular, analogical summarizers significantly outperform three among four language-independent summarizers in the case of BLEU-1 for ANT collection and they are not significantly outperformed by any other summarizer in the case of EASC collection

Proceedings ArticleDOI
13 May 2020
TL;DR: The algorithm with better divergence is implemented that can handle the organizational requirements by presenting the top areas that need to improve/concentrate depending on the analytics made by the algorithm on the available discrete data, and by implementing visualization techniques, the results will be even displayed in graphical format.
Abstract: The process of converting unstructured data into a structured readable format is becoming hard day by day Till day every organization consists of more than 80% of its operational data in an unreadable format The proposed method helps in converting unreadable data to a readable structured format with the help of Machine learning were classification, and clustering plays a crucial role in converting the operational data into data models and visualize the processed information to the end-user As organizations have specific requirements, considering them, we are going to implement latent dirichlet allocation (LDA) and latent semantic analysis (LSA), which were able to handle discrete data Also, a comparison is made to test divergence, throughput, quality, and response time, as both of them can classify the data based on the content and by giving labels to each category The algorithm with better divergence is implemented that can handle the organizational requirements by presenting the top areas that need to improve/concentrate depending on the analytics made by the algorithm on the available discrete data, and by implementing visualization techniques, the results will be even displayed in graphical format

Journal ArticleDOI
TL;DR: Experimental results prove that the proposed latent feature-based transfer learning (TL) strategy has a significant advantage over gear fault diagnosis, especially under varying working conditions.
Abstract: Gears are often operated under various working conditions, which may cause the training and testing data have different but related distributions when conducting gear fault diagnosis. To address this issue, a latent feature-based transfer learning (TL) strategy is proposed in this paper. First, the bag-of-fault-words (BOFW) model combined with the continuous wavelet transform (CWT) method is developed to extract and represent every fault feature parameter as a histogram. Before identifying the gear fault, the latent feature-based TL strategy is carried out, which adopts the joint dual-probabilistic latent semantic analysis (JD-PLSA) to model the shared and domain-specific latent features. After that, a mapping matrix between two domains can be constructed by using Pearson’s correlation coefficients (PCCs) to effectively transfer shared and mapped domain specific latent knowledge and to reduce the gap between two domains. Then, a Fisher kernel-based support vector machine (FSVM) is used to identify the gear fault types. To verify the effectiveness of the proposed approach, gear data sets gathered from Spectra Quest’s drivetrain dynamics simulator (DDS) are analyzed. Experimental results prove that the proposed approach has a significant advantage over gear fault diagnosis, especially under varying working conditions.

Journal ArticleDOI
TL;DR: The results of this case study suggest that the clustering quality and knowledge map of the domain can be improved by considering the document similarity along with their co-citation strength by incorporating the semantic similarity using latent semantic analysis for the abstracts of the top-cited documents.
Abstract: Document co-citation analysis (DCA) is employed across various academic disciplines and contexts to characterise the structure of knowledge. Since the introduction of the method for DCA by Small (J Am Soc Inf Sci 24(4):265–269, 1973) a variety of modifications towards optimising its results have been proposed by several researchers. We recommend a new approach to improve the results of DCA by integrating the concept of the document similarity measure into it. Our proposed method modifies DCA by incorporating the semantic similarity using latent semantic analysis for the abstracts of the top-cited documents. The interaction of these two measures results in a new measure that we call as the semantic similarity adjusted co-citation index. The effectiveness of the proposed method is evaluated through an empirical study of the tourism supply chain (TSC), where we employ the techniques of the network and cluster analyses. The study also comprehensively explores the resulting knowledge structures from both the methods. The results of our case study suggest that the clustering quality and knowledge map of the domain can be improved by considering the document similarity along with their co-citation strength.

Journal ArticleDOI
TL;DR: This study proposes an unsupervised sentiment lexicon learning methodology scalable to new domains of the same genre and achieves a maximum accuracy of 86% and outperforms methods recently presented in the literature.
Abstract: Sentiment lexicon learning is of paramount importance in sentiment analysis One of the most considerable challenges in learning sentiment lexicons is their domain-specific behavior Transferring knowledge acquired from a sentiment lexicon from one domain to another is an open research problem In this study, we attempt to address this challenge by presenting a transfer learning approach that creates new learning insights for multiple domains of the same genre We propose an unsupervised sentiment lexicon learning methodology scalable to new domains of the same genre Incremental learning and the methodology learn polarity seed words from corpora of multiple automatically selected source domains This process then transfers its genre-level knowledge of corpus-learned seed words to the target domains The corpus-learned seed words are used for sentiment lexicon generation for multiple target domains of the same genre The sentiment lexicon learning process is based on the latent semantic analysis technique and uses unlabeled training data from the source and target domains The experiment was performed using 24 domains of the same genre, ie, consumer product review The proposed model displays the best results using standard evaluation measures compared with the competitive baselines The proposed genre-based unsupervised approach achieves a maximum accuracy of 86% and outperforms methods recently presented in the literature

Journal ArticleDOI
TL;DR: A critical review of latent semantic analysis (LSA) to clarify some of the misunderstandings regarding LSA and other space models and proposes using long LSA experiences in other models, especially in predicting models such as word2vec.
Abstract: In recent years, latent semantic analysis (LSA) has reached a level of maturity at which its presence is ubiquitous in technology as well as in simulation of cognitive processes. In spite of this, in recent years there has been a trend of subjecting LSA to some criticisms, usually because it is compared to other models in very specific tasks and conditions and sometimes without having good knowledge of what the semantic representation of LSA means, and without exploiting all the possibilities of which LSA is capable other than the cosine. This paper provides a critical review to clarify some of the misunderstandings regarding LSA and other space models. The historical stability of the predecessors of LSA, the representational structure of word meaning and the multiple topologies that could arise from a semantic space, the computation of similarity, the myth that LSA dimensions have no meaning, the computational and algorithm plausibility to account for meaning acquisition in LSA (in contrast to others models based on online mechanisms), the possibilities of spatial models to substantiate recent proposals, and, in general, the characteristics of classic vector models and their ease and flexibility to simulate some cognitive phenomena will be reviewed. The review highlights the similarity between LSA and other techniques and proposes using long LSA experiences in other models, especially in predicting models such as word2vec. In sum, it emphasizes the lessons that can be learned from comparing LSA-based models to other models, rather than making statements about "the best."

Proceedings ArticleDOI
14 Dec 2020
TL;DR: In this article, the authors investigate document semantic similarity based on Latent Semantic Analysis (LSA) and Latent Dirichlet Allocation (LDA) for automatic essay assessment.
Abstract: The demand of scoring natural language responses has created a need for new computational tools that can be applied to automatically grade student essays. Systems for automatic essay assessment have been commercially available since 1990's. However, the progress in the field was obstructed by a lack of qualitative information regarding the effectiveness of such systems. Most of the research in automatic essay grading has been associated with English writing due to its widespread use and the availability of more learner collection and language processing software for the language. In addition, there is large number of commercial software for grading programming assignments automatically. In this work, we investigate document semantic similarity based on Latent Semantic Analysis (LSA) and on Latent Dirichlet Allocation (LDA). We use an open-source Python software, Gensim, to develop and implement an essay grading system able to compare an essay to an answer-key and assign to it a grade based on semantic similarity between the two. We test our tool on variable-size essays and conduct experimental tests to compare the results obtained from human grader (professor) and those obtained from the automatic grading system. Results show high correlation between the professor grades and the grades assigned by both modeling techniques. However, LSA-based modeling showed more promising results than the LDA-based method.

Proceedings ArticleDOI
01 Jan 2020
TL;DR: This paper proposes a novel methodology of descriptive text mining, capable of offering accurate explanations in unsupervised settings and of quantifying the results based on their statistical significance and produces useful explanations about the experiences of patients and caregivers.
Abstract: Though the strong evolution of knowledge learning models has characterized the last few years, the explanation of a phenomenon from text documents, called descriptive text mining, is still a difficult and poorly addressed problem. The need to work with unlabeled data, explainable approaches, unsupervised and domain independent solutions further increases the complexity of this task. Currently, existing techniques only partially solve the problem and have several limitations. In this paper, we propose a novel methodology of descriptive text mining, capable of offering accurate explanations in unsupervised settings and of quantifying the results based on their statistical significance. Considering the strong growth of patient communities on social platforms such as Facebook, we demonstrate the effectiveness of the contribution by taking the short social posts related to Esophageal Achalasia as a typical case study. Specifically, the methodology produces useful explanations about the experiences of patients and caregivers. Starting directly from the unlabeled patient’s posts, we derive correct scientific correlations among symptoms, drugs, treatments, foods and so on.


Journal ArticleDOI
01 Sep 2020
TL;DR: A probabilistic feature Patterns (PFP) approach using feature transformation and selection method is proposed for efficient data integration and utilizing the features latent semantic analysis (F-LSA) method for indexing the unsupervised multiple heterogeneous integrated cluster data sources.
Abstract: Big Data has received much attention in the multi-domain industry. In the digital and computing world, information is generated and collected at a rate that quickly exceeds the boundaries. The traditional data integration system interconnects the limited number of resources and is built with relatively stable and generally complex and time-consuming design activities. However, the rapid growth of these large data sets creates difficulties in learning heterogeneous data structures for integration and indexing. It also creates difficulty in information retrieval for the various data analysis requirements. In this paper, a probabilistic feature Patterns (PFP) approach using feature transformation and selection method is proposed for efficient data integration and utilizing the features latent semantic analysis (F-LSA) method for indexing the unsupervised multiple heterogeneous integrated cluster data sources. The PFP approach takes the advantage of the features transformation and selection mechanism to map and cluster the data for the integration, and an analysis of the data features context relation using LSA to provide the appropriate index for fast and accurate data extraction. A huge volume of BibText dataset from different publication sources are processed to evaluated to understand the effectiveness of the proposal. The analytical study and the outcome results show the improvisation in integration and indexing of the work.

Journal ArticleDOI
01 Sep 2020
TL;DR: The generalized cross entropy approach has been applied for the first time on web data, adding prior information on the effect of semantic classes on the global sentiment, improving accuracy and adding detail to the analysis.
Abstract: In this paper, data concerning MILANO EXPO2015 is collected from the official twitter page of the event before and after its opening. In order to extract a semi-supervised ontology and to evaluate the global sentiment around the event, a variety of language processing techniques has been applied on the collected “tweets”: Latent Semantic Analysis, sentiment polarity tracking, along with gap analysis has allowed the semantic evaluation of users’ opinions. Moreover, the generalized cross entropy approach has been applied for the first time on web data, adding prior information on the effect of semantic classes on the global sentiment, improving accuracy and adding detail to the analysis.

Posted Content
TL;DR: Five new APP methods are suggested to enhance the accuracy of APP from the text through ensemble modeling (stacking) based on a hierarchical attention network (HAN) as the meta-model.
Abstract: Human personality is significantly represented by those words which he/she uses in his/her speech or writing. As a consequence of spreading the information infrastructures (specifically the Internet and social media), human communications have reformed notably from face to face communication. Generally, Automatic Personality Prediction (or Perception) (APP) is the automated forecasting of the personality on different types of human generated/exchanged contents (like text, speech, image, video, etc.). The major objective of this study is to enhance the accuracy of APP from the text. To this end, we suggest five new APP methods including term frequency vector-based, ontology-based, enriched ontology-based, latent semantic analysis (LSA)-based, and deep learning-based (BiLSTM) methods. These methods as the base ones, contribute to each other to enhance the APP accuracy through ensemble modeling (stacking) based on a hierarchical attention network (HAN) as the meta-model. The results show that ensemble modeling enhances the accuracy of APP.

Journal ArticleDOI
TL;DR: Important aspects of the possibilities of using latent semantic analysis were studied to identify tasks of scientific subject spaces and to reveal the completeness of covering the results of dissertation research science degree seekers.
Abstract: The study considers the possibilities of using latent semantic analysis for the tasks of identifying scientific subject spaces and evaluating the completeness of covering the results of dissertation research by science degree seekers. A probabilistic thematic model was built to make it possible to cluster the publications of scholars in scientific areas, taking into account the citation network, which was an important step for solving the problem of identifying scientific subject spaces. As a result of constructing the model, the problem of increasing instability of clustering the citation graph in connection with a decrease in the number of clusters was solved. This problem would arise when combining clusters built on the basis of citation graph clustering, taking into account the similarity of abstracts of scientific publications. In the article, the presentation of text documents is described based on a probabilistic thematic model using n-grams. A probabilistic thematic model was built for the task of determining the completeness of covering the materials of an author’s dissertation research in scientific publications. The approximate values of the threshold coefficients were calculated to evaluate whether the articles of an author included the research provisions that were reflected in the text of the author’s abstract of the dissertation. The probabilistic thematic model for an author’s publications was practised on the basis of the BigARTM tool. Using the constructed model and with the help of a special regularizer, a matrix was found to evaluate the relevance of topics specified by the segments of an author’s dissertation abstracts to documents that are produced by the author’s publications. Important aspects of the possibilities of using latent semantic analysis were studied to identify tasks of scientific subject spaces and to reveal the completeness of covering the results of dissertation research science degree seekers.