scispace - formally typeset
Search or ask a question

Showing papers on "Probabilistic latent semantic analysis published in 2021"


Journal ArticleDOI
TL;DR: A Context-aware Sparse Check-in Venue Prediction (CSCVP) scheme inspired by natural language processing techniques to address the above challenges and predicts the venue category information and explores the similarity between users to address data sparsity challenge by significantly reducing the prediction space.
Abstract: The proliferation of online Location-Based Social Networks (LBSN) has offered unprecedented opportunities for understanding fine-grained spatio-temporal behaviors of users and developing new location-aware applications. In this article, we focus on the problem of “Sparse User Check-in Venue Prediction,” where the goal is to predict the next venue LBSN users will visit by exploiting their sparse online check-in traces and the latent decision contexts. While efforts have been made to predict users’ check-in traces on a LBSN, several important challenges still exist. First, check-in traces contributed by LBSN users are often too sparse to provide sufficient evidence for a reliable prediction, especially when the prediction space is huge (e.g., hundreds of thousands of venues in large cities). Second, the user's decision context on which venue to visit next is often latent and has not been incorporated by current venue prediction models. Third, the dynamic and non-deterministic dependency between check-ins is either ignored or replaced by a simplified “consecutiveness” assumption in existing solutions, leading to sub-optimal prediction results. In this article, we develop a Context-aware Sparse Check-in Venue Prediction (CSCVP) scheme inspired by natural language processing techniques to address the above challenges. In particular, CSCVP predicts the venue category information and explores the similarity between users to address data sparsity challenge by significantly reducing the prediction space. It also leverages the Probabilistic Latent Semantic Analysis (PLSA) model to incorporate the user decision context into the prediction model. Finally, we develop a novel Temporal Adaptive Ngram (TA-Ngram) model in CSCVP to capture the dynamic and non-deterministic dependency between check-ins. We evaluate CSCVP using three real-world LBSN datasets. The results show that our scheme significantly improves accuracy (30.9 percent improvement) of the state-of-the-art user check-in venue prediction solutions.

10 citations


Journal ArticleDOI
TL;DR: In this paper, the authors presented experiments with multiple approaches of topic modeling like Latent Semantic Analysis (LSA), Probabilistic LSA (PLSA), Latent Dirichlet Allocation (LDA), and Non-negative Matrix Factorization (NMF) on 0.8 million Urdu tweets.
Abstract: The understanding and analyzing of available content on Social media Platforms such as Twitter and Facebook, through various topic modeling methods is not supervised. However, despite several existing conventional techniques, they have had limited success when applied directly for filtering and quick comprehension of short-text contents due to text sparseness and noise. Thus, it always has been challenging to discover reliable latent topics from online discussion texts that prevail with low words co-occurrence and availability of large size social media benchmark datasets, even for resource-rich languages. The existing literature lacks such work for Urdu text to unveil niche topics even with conventional topic models, mainly due to the lack of benchmark datasets, limited availability of pre-processing tools/ algorithms, and time and compute limitations on large-sized datasets. This work presents experiments with multiple approaches of topic modeling like Latent Semantic Analysis (LSA), Probabilistic Latent Semantic Analysis (PLSA), Latent Dirichlet Allocation (LDA), and Non-negative Matrix Factorization (NMF) on 0.8 million Urdu tweets. These tweets are collected through Twitter API by giving various hashtags as a query to avoid dominance of single topic in the dataset. In addition, we have pre-processed the text of the tweets, prepared the three variants of the collected dataset, and extracted multiple features to represent documents on different n-grams. Furthermore, all these techniques are compared and evaluated on the dataset variants, using both qualitative and quantitative measures. We have also demonstrated the results of these approaches through visualization methods, graphs depicting tweets size per topic, word clouds, and hashtags analysis, giving insights about algorithms performances on finalized topics. Observed results reveal that NMF outperformed the techniques with TF-IDF feature vectors in Urdu tweets text, while LDA performed best with merging short-text strategy into long pseudo documents.

9 citations


Journal ArticleDOI
TL;DR: Online feedback for sentiment analysis is defined to encourage a more custom-made shopping experience, resulting in higher retention rates and viable improvement, and it is necessary to forecast the scale of e-commerce transactions.

9 citations


Journal ArticleDOI
TL;DR: In this paper, a mathematical comparison between NMF, PLSA, and LDA for the analysis of mass spectrometry imaging (MSI) data is presented, which includes a detailed evaluation of Kullback-Leibler NMF (KL-NMF) for the first time.
Abstract: RATIONALE Non-negative matrix factorization (NMF) has been used extensively for the analysis of mass spectrometry imaging (MSI) data, visualizing simultaneously the spatial and spectral distributions present in a slice of tissue. The statistical framework offers two related NMF methods: probabilistic latent semantic analysis (PLSA) and latent Dirichlet allocation (LDA), which is a generative model. This work offers a mathematical comparison between NMF, PLSA, and LDA, and includes a detailed evaluation of Kullback-Leibler NMF (KL-NMF) for MSI for the first time. We will inspect the results for MSI data analysis as these different mathematical approaches impose different characteristics on the data and the resulting decomposition. METHODS The four methods (NMF, KL-NMF, PLSA, and LDA) are compared on seven different samples: three originated from mice pancreas and four from human-lymph-node tissues, all obtained using matrix-assisted laser desorption/ionization time-of-flight mass spectrometry (MALDI-TOF MS). RESULTS Where matrix factorization methods are often used for the analysis of MSI data, we find that each method has different implications on the exactness and interpretability of the results. We have discovered promising results using KL-NMF, which has only rarely been used for MSI so far, improving both NMF and PLSA, and have shown that the hitherto stated equivalent KL-NMF and PLSA algorithms do differ in the case of MSI data analysis. LDA, assumed to be the better method in the field of text mining, is shown to be outperformed by PLSA in the setting of MALDI-MSI. Additionally, the molecular results of the human-lymph-node data have been thoroughly analyzed for better assessment of the methods under investigation. CONCLUSIONS We present an in-depth comparison of multiple NMF-related factorization methods for MSI. We aim to provide fellow researchers in the field of MSI a clear understanding of the mathematical implications using each of these analytical techniques, which might affect the exactness and interpretation of the results.

6 citations


Journal ArticleDOI
TL;DR: This work proposes the application of two probabilistic graphical Latent Dirichlet Allocation (LDA) and Probabilistic Latent Semantic Analysis (PLSA) algorithms to generate latent topic terms as possible aspects to improve the performance of the machine learning classification algorithm in ABSA.

6 citations


DOI
01 Oct 2021
TL;DR: In this paper, the authors applied Parsing techniques on various websites to extract the HTML and XML data which includes the textual data and also applied Preprocessing techniques to clean the data.
Abstract: Text classification and Topic Modelling is the backbone for the text analysis of huge amount of corpus of data. With an increase in unstructured data around us, it is very difficult to analyse the data very easily. There is a need for some methods that can be applied to the data to get the sensitive and semantic information from the corpus. Text classification is categorization of text in organised way for the interpretation of sensitive information from the text, while Topic modelling is finding the abstract topic for the collection of text or document. Topic modelling is used frequently to find semantic information from the textual data. In this paper we applied Parsing techniques on various websites to extract the HTML and XML data which includes the textual data and also applied Preprocessing techniques to clean the data. For the text classification purpose some of the Machine learning based classifiers that we have used in our experiment are Naive Bayes and also Logistic Regression Classifier. The models of the document are built using three different topic modelling methods which are Latent Semantic Analysis, Probabilistic Latent Semantic Analysis and Latent Dirichlet Allocation. In the further experiment we have done analysis and also comparison based upon the performance of the models and classifiers on the processed textual data.

3 citations


Journal ArticleDOI
TL;DR: If the latent variable is ordinal and manifest variables are nominal, an approach to handle the restrictions is given for latent class analysis of the models with error of measurement using log linear models, which reduces overall uncertainty, and inferences become more precise.
Abstract: This article deals with the latent class analysis of models with error of measurement. If the latent variable is ordinal and manifest variables are nominal, an approach to handle the restrictions is given for latent class analysis of the models with error of measurement using log linear models. By this way, we include ordinal nature of the latent variable into the analysis. Therefore, overall uncertainty is decreased, and our inferences become more precise. The new approach is applied to a women's liberation data set. Latent class analysis is frequently used in social sciences and education. Main aim of the analysis is to explain the association structure between manifest vari- ables by using unobserved variables, namely latent variables. Latent class analy- sis is a categorical analogous of the factor analysis when the latent and manifest variables are categorical. Log-linear models are widely used for the analysis of contingency tables. It is possible to represent a latent class model as a log-linear model using conditional response probabilities. This representation is called as log-linear parametrization. This is a special case of Formann's linear logstic latent class analysis (Formann, 1992). Error of measurement models are probabilistic versions of Guttman scale (Guttman, 1950), and considered as restricted latent class models. In the log- linear parametrization of the latent class models, types of manifest and latent variables are important issues because latent class models specialize according to the typology of the variables. For example, if the latent variable is metrical and manifest variables are nominal then appropriate analysis is to carry on latent class analysis with linear restrictions or use nominal response models (Heinen, 1996). In our concerned models with error of measurement, manifest variables

3 citations


Journal ArticleDOI
01 Mar 2021
TL;DR: The experimental results show that the sentiment analysis method integrated with PLSA and K-means can obtain higher classification accuracy than the PLSA model method alone.
Abstract: Aiming at the shortage of research on micro-blog short text fine-grained sentiment classification, a fine-grained sentiment classification method about micro-blog short text based on PLSA model and K-means clustering model was proposed. PLSA is used to calculate the probability matrix between documents and topics, words and topics in the corpus. In terms of the probability distribution of words and topics, K-means algorithm is used to cluster the probability distribution of words on topics and merge the similar topics. Based on the sentiment ontology library, emotion recognition is carried out for the merged topics. Then, according to the merged document and topic probability matrix, the document sentiment category is classified. The experimental results show that the sentiment analysis method integrated with PLSA and K-means can obtain higher classification accuracy than the PLSA model method alone.

3 citations


Proceedings ArticleDOI
24 Mar 2021
TL;DR: In this article, the convolutional neural networks (CNNs) are employed together with probabilistic latent semantic analysis (PLSA) which are capable of mining hidden semantics of images and then fed into a discriminative support vector machine (SVM) to build a classification model.
Abstract: An efficient medical image classification system has gained high interest in the scientific community. This paper presents a classification algorithm that aims to gain a high accuracy rate by addressing some of the typical challenges involved in classification of large medical datasets. In this paper, the convolutional neural networks (CNNs) are employed together with probabilistic latent semantic analysis (PLSA) which are capable of mining hidden semantics of images. This high-level semantic representation of the images is then fed into a discriminative support vector machine (SVM) to build a classification model. An ensemble of machine learning models is also employed to utilize the capability of classification models created from different sets of data. The evaluation is based on a medical image dataset consisting of 11,000 X-ray images from 116 distinct categories. The classification accuracy rate obtained by the proposed classification model is 94.5 %. The results show that the proposed classification model outperformed the methods in the literature evaluated on the same benchmark dataset.

1 citations


Posted Content
TL;DR: Wang et al. as mentioned in this paper applied the methods of bow-tie structure and Hodge decomposition to locate the users in the upstream, downstream, and core of the entire crypto flow.
Abstract: How crypto flows among Bitcoin users is an important question for understanding the structure and dynamics of the cryptoasset at a global scale. We compiled all the blockchain data of Bitcoin from its genesis to the year 2020, identified users from anonymous addresses of wallets, and constructed monthly snapshots of networks by focusing on regular users as big players. We apply the methods of bow-tie structure and Hodge decomposition in order to locate the users in the upstream, downstream, and core of the entire crypto flow. Additionally, we reveal principal components hidden in the flow by using non-negative matrix factorization, which we interpret as a probabilistic model. We show that the model is equivalent to a probabilistic latent semantic analysis in natural language processing, enabling us to estimate the number of such hidden components. Moreover, we find that the bow-tie structure and the principal components are quite stable among those big players. This study can be a solid basis on which one can further investigate the temporal change of crypto flow, entry and exit of big players, and so forth.

1 citations


Posted Content
Gangli Liu1
TL;DR: In this article, an extension called Semantic Center of Mass (SCOM) is proposed, and used to discover the abstract "topic" of a document, under a framework model called Understanding Map Supervised Topic Model (UM-S-TM).
Abstract: Inspired by the notion of Center of Mass in physics, an extension called Semantic Center of Mass (SCOM) is proposed, and used to discover the abstract "topic" of a document. The notion is under a framework model called Understanding Map Supervised Topic Model (UM-S-TM). The devise aim of UM-S-TM is to let both the document content and a semantic network -- specifically, Understanding Map -- play a role, in interpreting the meaning of a document. Based on different justifications, three possible methods are devised to discover the SCOM of a document. Some experiments on artificial documents and Understanding Maps are conducted to test their outcomes. In addition, its ability of vectorization of documents and capturing sequential information are tested. We also compared UM-S-TM with probabilistic topic models like Latent Dirichlet Allocation (LDA) and probabilistic Latent Semantic Analysis (pLSA).

Journal ArticleDOI
TL;DR: Big data analysis methods and semantic model analysis methods are adopted and constructs semantic analysis models through PLSA method calculations and shows that the accuracy and applicability of the semantic analysis model is increased and the accuracy of the data set is improved.
Abstract: Due to the common progress and interdependence of wireless sensor networks and language, Chinese semantic analysis under wireless sensor networks has become more and more important. Although there are many research results on wireless networks and Chinese semantics, there are few researches on the influence and relationship between them. Wireless sensor networks have strong application relevance, and the key technologies that need to be solved are also different for different application backgrounds. In order to reveal the basic laws and development trends of online Chinese semantic behavior expression in the context of wireless sensor networks, this paper adopts big data analysis methods and semantic model analysis methods and constructs semantic analysis models through PLSA method calculations, so that the construction process conforms to this research topic. Research the accuracy and applicability of the semantic analysis model. Through word extraction of 1.05 million word data of 1,103 documents on Baidu Tieba, HowNet, and citeulike websites, the data set was integrated into a data set, and the PLSA model was verified with this data set. In addition, through the construction of the wireless sensor network, the semantic analysis results in the expression of Chinese behavior are obtained. The results show that the accuracy of the data set extracted from 1103 documents increases with the increase of the number of documents. Second, after using the PLSA model to perform semantic analysis on the data set, the accuracy of the data set is improved. Compared with traditional semantic analysis, the model and the big data analysis framework have obvious advantages. With the continuous development of Internet big data, the big data methods used to count Chinese semantics are also constantly updated, and their efficiency is constantly improving. These updated semantic analysis models and statistical methods are constantly eliminating the uncertainty of modern online Chinese. The basic laws and development trends of statistical Chinese semantics also provide new application scenarios for online Chinese behavior. It also laid a ladder for subsequent scholars.


Proceedings ArticleDOI
09 Apr 2021
TL;DR: In this paper, a Chinese FastText text classification method combing Term Frequency-Relevance Frequency (TF-RF) and improved random walk model is suggested in the paper, the method makes TF-R weight choice to N-gram processed dictionaries during the input stage of the FastText model, making semantic analysis by using Probabilistic Latent Semantic Analysis (PLSA), and supplements to feature words; then utilizes the improved Random Walk model to improve the accuracy, and the improved model is more suitable for Chinese text classification.
Abstract: FastText is a text classification model by Facebook. As the model is simple in structure, it has the advantage of fast and efficient. However, when the model is used in Chinese text classification, the accurate rate will decrease. To this end, a Chinese FastText text classification method combing Term Frequency-Relevance Frequency (TF-RF) and improved random walk model is suggested in the paper. The method makes TF-R weight choice to N-gram processed dictionaries during the input stage of the FastText model, making semantic analysis by using Probabilistic Latent Semantic Analysis (PLSA), and supplements to feature words; then utilizes the improved random walk model to improve the accuracy, and the improved model is more suitable for Chinese text classification. The experiment result shows that improved model in the paper has a better effect to Chinese text classification.