scispace - formally typeset
Search or ask a question

Showing papers on "Probabilistic latent semantic analysis published in 2020"


Posted Content
TL;DR: This model does not require stop-word lists, stemming or lemmatization, and it automatically finds the number of topics, and the resulting topic vectors are jointly embedded with the document and word vectors with distance between them representing semantic similarity.
Abstract: Topic modeling is used for discovering latent semantic structure, usually referred to as topics, in a large collection of documents. The most widely used methods are Latent Dirichlet Allocation and Probabilistic Latent Semantic Analysis. Despite their popularity they have several weaknesses. In order to achieve optimal results they often require the number of topics to be known, custom stop-word lists, stemming, and lemmatization. Additionally these methods rely on bag-of-words representation of documents which ignore the ordering and semantics of words. Distributed representations of documents and words have gained popularity due to their ability to capture semantics of words and documents. We present $\texttt{top2vec}$, which leverages joint document and word semantic embedding to find $\textit{topic vectors}$. This model does not require stop-word lists, stemming or lemmatization, and it automatically finds the number of topics. The resulting topic vectors are jointly embedded with the document and word vectors with distance between them representing semantic similarity. Our experiments demonstrate that $\texttt{top2vec}$ finds topics which are significantly more informative and representative of the corpus trained on than probabilistic generative models.

130 citations


Journal ArticleDOI
TL;DR: This survey conducts a comprehensive review of various short text topic modeling techniques proposed in the literature, and presents three categories of methods based on Dirichlet multinomial mixture, global word co-occurrences, and self-aggregation, with example of representative approaches in each category and analysis of their performance on various tasks.
Abstract: Analyzing short texts infers discriminative and coherent latent topics that is a critical and fundamental task since many real-world applications require semantic understanding of short texts. Traditional long text topic modeling algorithms (e.g., PLSA and LDA) based on word co-occurrences cannot solve this problem very well since only very limited word co-occurrence information is available in short texts. Therefore, short text topic modeling has already attracted much attention from the machine learning research community in recent years, which aims at overcoming the problem of sparseness in short texts. In this survey, we conduct a comprehensive review of various short text topic modeling techniques proposed in the literature. We present three categories of methods based on Dirichlet multinomial mixture, global word co-occurrences, and self-aggregation, with example of representative approaches in each category and analysis of their performance on various tasks. We develop the first comprehensive open-source library, called STTM, for use in Java that integrates all surveyed algorithms within a unified interface, benchmark datasets, to facilitate the expansion of new methods in this research field. Finally, we evaluate these state-of-the-art methods on many real-world datasets and compare their performance against one another and versus long text topic modeling algorithm.

101 citations


Journal ArticleDOI
TL;DR: A Fisher kernel function based on Probabilistic Latent Semantic Analysis is proposed in this paper for sentiment analysis by Support Vector Machine, and the results show that the effect of the method, compared with the comparison method, is obviously improved.
Abstract: At present, in the mainstream sentiment analysis methods represented by the Support Vector Machine, the vocabulary and the latent semantic information involved in the text are not well considered, and sentiment analysis of text is dependent overly on the statistics of sentiment words. Thus, a Fisher kernel function based on Probabilistic Latent Semantic Analysis is proposed in this paper for sentiment analysis by Support Vector Machine. The Fisher kernel function based on the model is derived from the Probabilistic Latent Semantic Analysis model. By means of this method, latent semantic information involving the probability characteristics can be used as the classification characteristics, along with the improvement of the effect of classification for support vector machine, and the problem of ignoring the latent semantic characteristics in text sentiment analysis can be addressed. The results show that the effect of the method proposed in this paper, compared with the comparison method, is obviously improved.

38 citations


Journal ArticleDOI
TL;DR: The main objective of this research paper is to design a system which would generate multimodal, nonparametric Bayesian model, and multilayered probability latent semantic analysis (pLSA)-based visual dictionary (BM-MpLSA).
Abstract: The main objective of this research paper is to design a system which would generate multimodal, nonparametric Bayesian model, and multilayered probability latent semantic analysis (pLSA)-based visual dictionary (BM-MpLSA). Advancement in technology and the exuberance of sports lovers have necessitated a requirement for automatic action recognition in the live video seed of sports. The fundamental requirement for such model is the creation of visual dictionary for each sports domain. This multimodal nonparametric model has two novel co-occurrence matrix creation—one for image feature vector and the other for textual entities. This matrix provides a basic scaling parameter for the unobserved random variables, and it is an extension of multilayered pLSA-based visual dictionary creation. This paper precisely concentrates on the creation of visual dictionary for Basketball. From the sports event images, the feature vector extracted is modified as SIFT and MPEG 7’s-based dominant color, color layout, scalable color and edge histograms. After quantization and analysis of these vector values, the visual vocabulary would be created by integrating them into the domain specific visual ontology for semantic understanding. The accuracy rate of this work is compared with respect to the action held on image based on performance.

36 citations


Proceedings ArticleDOI
20 Apr 2020
TL;DR: A new method to overcome the overfitting issue of pLSI is provided by using the amortized inference with word embedding as input, instead of the Dirichlet prior in LDA.
Abstract: Existing topic modeling approaches possess several issues, including the overfitting issue of Probablistic Latent Semantic Indexing (pLSI), the failure of capturing the rich topical correlations among topics in Latent Dirichlet Allocation (LDA), and high inference complexity. In this paper, we provide a new method to overcome the overfitting issue of pLSI by using the amortized inference with word embedding as input, instead of the Dirichlet prior in LDA. For generative topic model, the large number of free latent variables is the root of overfitting. To reduce the number of parameters, the amortized inference replaces the inference of latent variable with a function which possesses the shared (amortized) learnable parameters. The number of the shared parameters is fixed and independent of the scale of the corpus. To overcome the limited application of amortized inference to independent and identically distributed (i.i.d) data, a novel graph neural network, Graph Attention TOpic Network (GATON), is proposed to model the topic structure of non-i.i.d documents according to the following two observations. First, pLSI can be interpreted as stochastic block model (SBM) on a specific bi-partite graph. Second, graph attention network (GAT) can be explained as the semi-amortized inference of SBM, which relaxes the i.i.d data assumption of vanilla amortized inference. GATON provides a novel scheme, i.e. graph convolution operation based scheme, to integrate word similarity and word co-occurrence structure. Specifically, the bag-of-words document representation is modeled as a bi-partite graph topology. Meanwhile, word embedding, which captures the word similarity, is modeled as attribute of the word node and the term frequency vector is adopted as the attribute of the document node. Based on the weighted (attention) graph convolution operation, the word co-occurrence structure and word similarity patterns are seamlessly integrated for topic identification. Extensive experiments demonstrate that the effectiveness of GATON on topic identification not only benefits the document classification, but also significantly refines the input word embedding.

26 citations


Journal ArticleDOI
TL;DR: This paper proposes to build powerful semantic features using the probabilistic latent semantic analysis (pLSA) model, by employing the pre-trained deep convolutional neural networks (CNNs) as feature extractors rather than relying on the hand-crafted features.
Abstract: Scene classification is one of the most fundamental task in interpretation of high-resolution remote sensing (HRRS) images. Many recent works show that the probabilistic topic models which are capable of mining latent semantics of images can be effectively applied to HRRS scene classification. However, the existing approaches based on topic models simply utilize low-level hand-crafted features to form semantic features, which severely limit the representative capability of the semantic features derived from topic models. To alleviate this problem, this paper propose to build powerful semantic features using the probabilistic latent semantic analysis (pLSA) model, by employing the pre-trained deep convolutional neural networks (CNNs) as feature extractors rather than relying on the hand-crafted features. Specifically, we develop two methods to generate semantic features, called multi-scale deep semantic representation (MSDS) and multi-level deep semantic representation (MLDS), by extracting CNN features from different layers: (1) in MSDS, the final semantic features are learned by the pLSA with multi-scale features extracted from the convolutional layer of a pre-trained CNN; (2) in MLDS, we extract CNN features for densely sampled image patches at different size level from the fully-connected layer of a pre-trained CNN, and concatenate the sematic features learned by the pLSA at each level. We comprehensively evaluate the two methods on two public HRRS scene datasets, and achieve significant performance improvement over the state-of-the-art. The outstanding results demonstrate that the pLSA model is capable of discovering considerably discriminative semantic features from the deep CNN features.

23 citations


Journal ArticleDOI
TL;DR: Experimental results prove that the proposed latent feature-based transfer learning (TL) strategy has a significant advantage over gear fault diagnosis, especially under varying working conditions.
Abstract: Gears are often operated under various working conditions, which may cause the training and testing data have different but related distributions when conducting gear fault diagnosis. To address this issue, a latent feature-based transfer learning (TL) strategy is proposed in this paper. First, the bag-of-fault-words (BOFW) model combined with the continuous wavelet transform (CWT) method is developed to extract and represent every fault feature parameter as a histogram. Before identifying the gear fault, the latent feature-based TL strategy is carried out, which adopts the joint dual-probabilistic latent semantic analysis (JD-PLSA) to model the shared and domain-specific latent features. After that, a mapping matrix between two domains can be constructed by using Pearson’s correlation coefficients (PCCs) to effectively transfer shared and mapped domain specific latent knowledge and to reduce the gap between two domains. Then, a Fisher kernel-based support vector machine (FSVM) is used to identify the gear fault types. To verify the effectiveness of the proposed approach, gear data sets gathered from Spectra Quest’s drivetrain dynamics simulator (DDS) are analyzed. Experimental results prove that the proposed approach has a significant advantage over gear fault diagnosis, especially under varying working conditions.

17 citations


Journal ArticleDOI
Haiyu Song1, Pengjie Wang1, Jian Yun1, Wei Li1, Bo Xue1, Gang Wu1 
TL;DR: A novel annotation method based on topic model, namely local learning-based probabilistic latent semantic analysis (LL-PLSA) that significantly outperforms the state-of-the-art especially in terms of overall metrics.
Abstract: Automatic image annotation plays a significant role in image understanding, retrieval, classification, and indexing. Today, it is becoming increasingly important in order to annotate large-scale social media images from content-sharing websites and social networks. These social images are usually annotated by user-provided low-quality tags. The topic model is considered as a promising method to describe these weak-labeling images by learning latent representations of training samples. The recent annotation methods based on topic models have two shortcomings. First, they are difficult to scale to a large-scale image dataset. Second, they can not be used to online image repository because of continuous addition of new images and new tags. In this paper, we propose a novel annotation method based on topic model, namely local learning-based probabilistic latent semantic analysis (LL-PLSA), to solve the above problems. The key idea is to train a weighted topic model for a given test image on its semantic neighborhood consisting of a fixed number of semantically and visually similar images. This method can scale to a large-scale image database, as training samples involved in modeling are a few nearest neighbors rather than the entire database. Moreover, this proposed topic model, online customized for the test image, naturally addresses the issue of continuous addition of new images and new tags in a database. Extensive experiments on three benchmark datasets demonstrate that the proposed method significantly outperforms the state-of-the-art especially in terms of overall metrics.

15 citations


Proceedings ArticleDOI
01 Sep 2020
TL;DR: Adaptive Online Biterm Topic Model (AOBTM) is proposed to model topics in short texts adaptively to alleviate the sparsity problem in short-texts and considers the statistical-data for an optimal number of previous time-slices.
Abstract: Analysis of mobile app reviews has shown its important role in requirement engineering, software maintenance and evolution of mobile apps. Mobile app developers check their users’ reviews frequently to clarify the issues experienced by users or capture the new issues that are introduced due to a recent app update. App reviews have a dynamic nature and their discussed topics change over time. The changes in the topics among collected reviews for different versions of an app can reveal important issues about the app update. A main technique in this analysis is using topic modeling algorithms. However, app reviews are short texts and it is challenging to unveil their latent topics over time. Conventional topic models such as Latent Dirichlet Allocation (LDA) and Probabilistic Latent Semantic Analysis (PLSA) suffer from the sparsity of word co-occurrence patterns while inferring topics for short texts. Furthermore, these algorithms cannot capture topics over numerous consecutive time-slices (or versions). Online topic modeling algorithms such as Online LDA (OLDA) and Online Biterm Topic Model (OBTM) speed up the inference of topic models for the texts collected in the latest time-slice by saving a fraction of data from the previous time-slice. But these algorithms do not analyze the statistical-data of all the previous time-slices, which can confer contributions to the topic distribution of the current time-slice.In this paper, we propose Adaptive Online Biterm Topic Model (AOBTM) to model topics in short texts adaptively. AOBTM alleviates the sparsity problem in short-texts and considers the statistical-data for an optimal number of previous time-slices. We also propose parallel algorithms to automatically determine the optimal number of topics and the best number of previous versions that should be considered in topic inference phase. Automatic evaluation on collections of app reviews and real-world short text datasets confirm that AOBTM can find more coherent topics and outperforms the state-of-the-art baselines. For reproducibility of the results, we open source all scripts.

14 citations


Journal ArticleDOI
TL;DR: A two-stage hybrid probabilistic topic model is proposed to improve the quality of automatic image annotation and achieves not only superior annotation accuracy but also better retrieval performance.
Abstract: Refining image annotation has become one of the core research topics in computer vision and pattern recognition due to its great potentials in image retrieval. However, it is still in its infancy and is not sophisticated enough to extract perfect semantic concepts just according to the image low-level features. In this paper, we propose a two-stage hybrid probabilistic topic model to improve the quality of automatic image annotation. To start with, a probabilistic latent semantic analysis model with asymmetric modalities is learned to estimate the posterior probabilities of each annotation keyword, during which the image-to-word relation can be well established. Next, a label similarity graph is constructed by a weighted linear combination of label similarity and visual similarity of images associated with the corresponding labels. By this way, the information from image low-level visual features and high-level semantic concepts can be seamlessly integrated by fully taking into account the word-to-word and image-to-image relations. Finally, the rank-two relaxation heuristics is exploited to further mine the correlation of the candidate annotations so as to capture the refining results, which plays a critical role in semantic based image retrieval. Extensive experiments show that the proposed model achieves not only superior annotation accuracy but also better retrieval performance.

13 citations


Journal ArticleDOI
TL;DR: A novel Bi-clustering based Memetic Algorithm for Recommender Systems (Bi-MARS) based on the collaborative behavior of memes generates recommendations by finding the closest similarity vector of the target user, which contributes to the computational accuracy of the algorithm and thus the relevance of items.

Journal ArticleDOI
TL;DR: A multi-modal aggregated posterior aligning neural network based on Wasserstein Auto-encoders (WAE) which learns a shared latent space for visual features and semantic attributes and provides a reliable way to synthesize latent features for training classification models.
Abstract: The visual-semantic gap between the visual space (visual features) and semantic space (semantic attributes) is one of the main problems in the Generalized Zero-Shot Learning (GZSL) task. The essence of this problem is that the structure of manifolds in these two spaces is inconsistent, which makes it difficult to learn embeddings that unify visual features and semantic attributes for similarity measurement. In this work, we tackle this problem by proposing a multi-modal aggregated posterior aligning neural network based on Wasserstein Auto-encoders (WAE) which learns a shared latent space for visual features and semantic attributes. The key to our approach is that the aggregated posterior distribution of the latent representations encoded from visual features of each class is encouraged to be aligned with a Gaussian distribution predicted by the corresponding semantic attribute in the latent space. On one hand, requiring the latent manifolds of visual features and semantic attributes to be consistent preserves the inter-class association between seen and unseen classes. On the other hand, the aggregated posterior of each class is directly defined as a Gaussian in the latent space, which provides a reliable way to synthesize latent features for training classification models. Using the AWA1, AWA2, CUB, aPY, FLO, and SUN benchmark datasets, we extensively conducted comparative evaluations to demonstrate the advantages of our method over state-of-the-art approaches.

Journal ArticleDOI
30 Mar 2020-Entropy
TL;DR: This paper proposes a novel approach for analyzing the influence of different regularization types on results of topic modeling and concludes that regularization may introduce unpredictable distortions into topic models that need further research.
Abstract: Topic modeling is a popular technique for clustering large collections of text documents. A variety of different types of regularization is implemented in topic modeling. In this paper, we propose a novel approach for analyzing the influence of different regularization types on results of topic modeling. Based on Renyi entropy, this approach is inspired by the concepts from statistical physics, where an inferred topical structure of a collection can be considered an information statistical system residing in a non-equilibrium state. By testing our approach on four models—Probabilistic Latent Semantic Analysis (pLSA), Additive Regularization of Topic Models (BigARTM), Latent Dirichlet Allocation (LDA) with Gibbs sampling, LDA with variational inference (VLDA)—we, first of all, show that the minimum of Renyi entropy coincides with the “true” number of topics, as determined in two labelled collections. Simultaneously, we find that Hierarchical Dirichlet Process (HDP) model as a well-known approach for topic number optimization fails to detect such optimum. Next, we demonstrate that large values of the regularization coefficient in BigARTM significantly shift the minimum of entropy from the topic number optimum, which effect is not observed for hyper-parameters in LDA with Gibbs sampling. We conclude that regularization may introduce unpredictable distortions into topic models that need further research.

Posted Content
TL;DR: The PCA-AE is a novel autoencoder whose latent space verifies two properties: Firstly, the dimensions are organised in decreasing importance with respect to the data at hand, and secondly, the components of the latent space are statistically independent.
Abstract: Autoencoders and generative models produce some of the most spectacular deep learning results to date. However, understanding and controlling the latent space of these models presents a considerable challenge. Drawing inspiration from principal component analysis and autoencoder, we propose the Principal Component Analysis Autoencoder (PCAAE). This is a novel autoencoder whose latent space verifies two properties. Firstly, the dimensions are organised in decreasing importance with respect to the data at hand. Secondly, the components of the latent space are statistically independent. We achieve this by progressively increasing the latent space during training, and with a covariance loss applied to the latent codes. The resulting autoencoder produces a latent space which separates the intrinsic attributes of the data into different components of the latent space, in a completely unsupervised manner. We also describe an extension of our approach to the case of powerful, pre-trained GANs. We show results on both synthetic examples of shapes and on a state-of-the-art GAN. For example, we are able to separate the color shade scale of hair and skin, pose of faces and the gender in the CelebA, without accessing any labels. We compare the PCAAE with other state-of-the-art approaches, in particular with respect to the ability to disentangle attributes in the latent space. We hope that this approach will contribute to better understanding of the intrinsic latent spaces of powerful deep generative models.

Journal ArticleDOI
TL;DR: Important aspects of the possibilities of using latent semantic analysis were studied to identify tasks of scientific subject spaces and to reveal the completeness of covering the results of dissertation research science degree seekers.
Abstract: The study considers the possibilities of using latent semantic analysis for the tasks of identifying scientific subject spaces and evaluating the completeness of covering the results of dissertation research by science degree seekers. A probabilistic thematic model was built to make it possible to cluster the publications of scholars in scientific areas, taking into account the citation network, which was an important step for solving the problem of identifying scientific subject spaces. As a result of constructing the model, the problem of increasing instability of clustering the citation graph in connection with a decrease in the number of clusters was solved. This problem would arise when combining clusters built on the basis of citation graph clustering, taking into account the similarity of abstracts of scientific publications. In the article, the presentation of text documents is described based on a probabilistic thematic model using n-grams. A probabilistic thematic model was built for the task of determining the completeness of covering the materials of an author’s dissertation research in scientific publications. The approximate values of the threshold coefficients were calculated to evaluate whether the articles of an author included the research provisions that were reflected in the text of the author’s abstract of the dissertation. The probabilistic thematic model for an author’s publications was practised on the basis of the BigARTM tool. Using the constructed model and with the help of a special regularizer, a matrix was found to evaluate the relevance of topics specified by the segments of an author’s dissertation abstracts to documents that are produced by the author’s publications. Important aspects of the possibilities of using latent semantic analysis were studied to identify tasks of scientific subject spaces and to reveal the completeness of covering the results of dissertation research science degree seekers.

Journal ArticleDOI
16 May 2020-Entropy
TL;DR: The paper shows that the renormalization procedure allows for finding an approximation of the optimal number of topics at least 30 times faster than the grid search without significant loss of quality.
Abstract: In practice, to build a machine learning model of big data, one needs to tune model parameters. The process of parameter tuning involves extremely time-consuming and computationally expensive grid search. However, the theory of statistical physics provides techniques allowing us to optimize this process. The paper shows that a function of the output of topic modeling demonstrates self-similar behavior under variation of the number of clusters. Such behavior allows using a renormalization technique. A combination of renormalization procedure with the Renyi entropy approach allows for quick searching of the optimal number of topics. In this paper, the renormalization procedure is developed for the probabilistic Latent Semantic Analysis (pLSA), and the Latent Dirichlet Allocation model with variational Expectation-Maximization algorithm (VLDA) and the Latent Dirichlet Allocation model with granulated Gibbs sampling procedure (GLDA). The experiments were conducted on two test datasets with a known number of topics in two different languages and on one unlabeled test dataset with an unknown number of topics. The paper shows that the renormalization procedure allows for finding an approximation of the optimal number of topics at least 30 times faster than the grid search without significant loss of quality.

Book ChapterDOI
01 Jan 2020
TL;DR: This chapter introduces three diversity and novelty boosting approaches including maximal marginal relevance, probabilistic latent semantic analysis and relevance-novelty graphical model, and three Diversity and novelty assessment measures including weighted subtopic precision, α-nDCG, and geNov.
Abstract: In this chapter, we review biomedical information retrieval techniques and focus on diversity and novelty boosting methods and their evaluation metrics. We introduce three diversity and novelty boosting approaches including maximal marginal relevance, probabilistic latent semantic analysis and relevance-novelty graphical model, and three diversity and novelty assessment measures including weighted subtopic precision, α-nDCG, and geNov. Experimental results on a set of state-of-the-art diversity and novelty evaluation metrics are also presented with regard to their sensitiveness to ranking qualities, their discriminative powers, and time efficiencies. We also conduct experiments with a larger dataset to reexamine these diversity and/or novelty metrics and present the results.

Posted Content
TL;DR: The results indicate that a cost- and time-effective performance summary of an airline and its competitors can be obtained from OCR and provide implications for post-pandemic preparedness in the airline industry considering the unprecedented impact of coronavirus disease 2019 and predictions on similar pandemics in the future.
Abstract: To understand the important dimensions of service quality from the passenger's perspective and tailor service offerings for competitive advantage, airlines can capitalize on the abundantly available online customer reviews (OCR). The objective of this paper is to discover company- and competitor-specific intelligence from OCR using an unsupervised text analytics approach. First, the key aspects (or topics) discussed in the OCR are extracted using three topic models - probabilistic latent semantic analysis (pLSA) and two variants of Latent Dirichlet allocation (LDA-VI and LDA-GS). Subsequently, we propose an ensemble-assisted topic model (EA-TM), which integrates the individual topic models, to classify each review sentence to the most representative aspect. Likewise, to determine the sentiment corresponding to a review sentence, an ensemble sentiment analyzer (E-SA), which combines the predictions of three opinion mining methods (AFINN, SentiStrength, and VADER), is developed. An aspect-based opinion summary (AOS), which provides a snapshot of passenger-perceived strengths and weaknesses of an airline, is established by consolidating the sentiments associated with each aspect. Furthermore, a bi-gram analysis of the labeled OCR is employed to perform root cause analysis within each identified aspect. A case study involving 99,147 airline reviews of a US-based target carrier and four of its competitors is used to validate the proposed approach. The results indicate that a cost- and time-effective performance summary of an airline and its competitors can be obtained from OCR. Finally, besides providing theoretical and managerial implications based on our results, we also provide implications for post-pandemic preparedness in the airline industry considering the unprecedented impact of coronavirus disease 2019 (COVID-19) and predictions on similar pandemics in the future.

Journal ArticleDOI
TL;DR: This work uses the Nonnegative Matrix Factorization with the Kullback–Leibler divergence to prove, when the number of model components is enough and a limit condition is reached, that the Singular Value Decomposition and the Probabilistic Latent Semantic Analysis empirical distributions are arbitrary close.
Abstract: The Probabilistic Latent Semantic Analysis has been related with the Singular Value Decomposition. Several problems occur when this comparative is done. Data class restrictions and the existence of several local optima mask the relation, being a formal analogy without any real significance. Moreover, the computational difficulty in terms of time and memory limits the technique applicability. In this work, we use the Nonnegative Matrix Factorization with the Kullback–Leibler divergence to prove, when the number of model components is enough and a limit condition is reached, that the Singular Value Decomposition and the Probabilistic Latent Semantic Analysis empirical distributions are arbitrary close. Under such conditions, the Nonnegative Matrix Factorization and the Probabilistic Latent Semantic Analysis equality is obtained. With this result, the Singular Value Decomposition of every nonnegative entries matrix converges to the general case Probabilistic Latent Semantic Analysis results and constitutes the unique probabilistic image. Moreover, a faster algorithm for the Probabilistic Latent Semantic Analysis is provided.

Proceedings ArticleDOI
01 Jul 2020
TL;DR: A new regularizer for Theta and Phi matrices in probabilistic Latent Semantic Analysis (pLSA) model is designed to increase the quality of topic models, trained on unbalanced collections.
Abstract: This article proposes a new approach for building topic models on unbalanced collections in topic modelling, based on the existing methods and our experiments with such methods Real-world data collections contain topics in various proportions, and often documents of the relatively small theme become distributed all over the larger topics instead of being grouped into one topic To address this issue, we design a new regularizer for Theta and Phi matrices in probabilistic Latent Semantic Analysis (pLSA) model We make sure this regularizer increases the quality of topic models, trained on unbalanced collections Besides, we conceptually support this regularizer by our experiments

Proceedings ArticleDOI
01 Jan 2020
TL;DR: An enhanced metadata extraction method for robust object search and a person re-identification and can be applied to a wide range of surveillance systems such as search for missing children in a large public space and crowd monitoring system.
Abstract: Recently, surveillance cameras are ubiquitous for both real-time monitoring and recording important moments. Temporarily seamless surveillance using multiple cameras requires increasing amount of human efforts and enormous size of storage. The use of dynamic cameras further requires advanced computer vision algorithms, and is another challenge for intelligent visual surveillance. To solve those problems, we present an enhanced metadata extraction method for robust object search and a person re-identification. More specifically, the proposed method accurately extracts an object region using a modified DeepLab version 3, and then extracts metadata including representative color, size, aspect ratio, and moving trajectory of the object. The proposed metadata extraction method can be applied to a wide range of surveillance systems such as search for missing children in a large public space and crowd monitoring system.

Proceedings ArticleDOI
23 Sep 2020
TL;DR: The results show that pLSA can perform clustering of multidimensional spectral data without compromising the original information, and that Bayesian network is a powerful technique to visualize relationships between multiple variables.
Abstract: In the agricultural field, crop diseases and physiological disorders have a large impact on its yield and quality. Minimizing damage is very important, but most of the work such as patrols for pest control is done by humans, which is a heavy burden for farmers. Therefore, it is required to reduce the burden by using an automatic detection system for the diseases. In this paper, hyperspectral imaging and AI technology were applied to detect gray mold on tomato leaves. By using probabilistic latent semantic analysis (pLSA) and Bayesian network, optimum 8 wavelengths were selected for machine learning. The prediction accuracy using the selected wavelengths was not significantly decreased compared to using all wavelengths. The results show that pLSA can perform clustering of multidimensional spectral data without compromising the original information, and that Bayesian network is a powerful technique to visualize relationships between multiple variables.

Journal ArticleDOI
15 Mar 2020
TL;DR: This paper provides an overview of effective EM-like algorithms for learning latent Dirichlet allocation (LDA) and additively regularized topic models (ARTM) and reviews 14 effective implementations of topic modeling algorithms proposed in the literature over the past 10 years.
Abstract: Topic modeling is an area of natural language processing that has been actively developed in the last 15 years. A probabilistic topic model extracts a set of hidden topics from a collection of text documents. It defines each topic by a probability distribution over words and describes each document with a probability distribution over topics. The exploding volume of text data motivates the community to constantly upgrade topic modeling algorithms for multiprocessor systems. In this paper, we provide an overview of effective EM-like algorithms for learning latent Dirichlet allocation (LDA) and additively regularized topic models (ARTM). Firstly, we review 11 techniques for efficient topic modeling based on synchronous and asynchronous parallel computing, distributed data storage, streaming, batch processing, RAM optimization, and fault tolerance improvements. Secondly, we review 14 effective implementations of topic modeling algorithms proposed in the literature over the past 10 years, which use different combinations of the techniques above. Their comparison shows the lack of a perfect universal solution. All improvements described are applicable to all kinds of topic modeling algorithms: PLSA, LDA, MAP, VB, GS, and ARTM.

Journal ArticleDOI
TL;DR: A multi-task topic analysis (MTTA) framework to analyze cancer hallmark-specific topics from documents and conducts a comprehensive performance evaluation for the MTTA framework, comparing it with several approaches.
Abstract: The hallmarks of cancer represent an essential concept for discovering novel knowledge about cancer and for extracting the complexity of cancer. Due to the lack of topic analysis frameworks optimized specifically for cancer data, the studies on topic modeling in cancer research still have a strong challenge. Recently, deep learning (DL) based approaches were successfully employed to learn semantic and contextual information from scientific documents using word embeddings according to the hallmarks of cancer (HoC). However, those are only applicable to labeled data. There is a comparatively small number of documents that are labeled by experts. In the real world, there is a massive number of unlabeled documents that are available online. In this paper, we present a multi-task topic analysis (MTTA) framework to analyze cancer hallmark-specific topics from documents. The MTTA framework consists of three main subtasks: (1) cancer hallmark learning (CHL)—used to learn cancer hallmarks on existing labeled documents; (2) weak label propagation (WLP)—used to classify a large number of unlabeled documents with the pre-trained model in the CHL task; and (3) topic modeling (ToM)—used to discover topics for each hallmark category. In the CHL task, we employed a convolutional neural network (CNN) with pre-trained word embedding that represents semantic meanings obtained from an unlabeled large corpus. In the ToM task, we employed a latent topic model such as latent Dirichlet allocation (LDA) and probabilistic latent semantic analysis (PLSA) model to catch the semantic information learned by the CNN model for topic analysis. To evaluate the MTTA framework, we collected a large number of documents related to lung cancer in a case study. We also conducted a comprehensive performance evaluation for the MTTA framework, comparing it with several approaches.


Journal ArticleDOI
01 Nov 2020
TL;DR: A new approach for automatic learning of terminological ontologies from textual corpus based on probabilistic models that captures semantic relationships between word-topic and topic-document in terms of probability distributions to build a topic ontology and ontology graph with minimum human intervention is described.
Abstract: The ontology enrichment process is text-based and the application domain in hand is circumscribed to the content of the related texts. However, the main challenge in ontology enrichment is its learning, since there is still a lack of relevant approach able to achieve automatic enrichment from a textual corpus or dataset of various topics. In this paper, we describe a new approach for automatic learning of terminological ontologies from textual corpus based on probabilistic models. In our approach, two topic modeling algorithms are explored, namely LDA and pLSA for learning topic ontology. The objective is to capture semantic relationships between word-topic and topic-document in terms of probability distributions to build a topic ontology and ontology graph with minimum human intervention. Experimental analysis on building a topic ontology and retrieving corresponding topic ontology for a user query demonstrates the effectiveness of the proposed approach.

Journal ArticleDOI
TL;DR: Simulations and MALDI spectra of a stroke-damaged rat brain show MS signals from pathological tissue can be quantified and linear Poisson modelling advances pLSA, giving covariances on model parameters and supporting χ2 testing for the presence/absence of MS signal components.
Abstract: MOTIVATION Probabilistic latent semantic analysis (pLSA) is commonly applied to describe mass spectra (MS) images. However, the method does not provide certain outputs necessary for the quantitative scientific interpretation of data. In particular, it lacks assessment of statistical uncertainty and the ability to perform hypothesis testing. We show how linear Poisson modelling advances pLSA, giving covariances on model parameters and supporting χ2 testing for the presence/absence of MS signal components. As an example, this is useful for the identification of pathology in MALDI biological samples. We also show potential wider applicability, beyond MS, using magnetic resonance imaging (MRI) data from colorectal xenograft models. RESULTS Simulations and MALDI spectra of a stroke-damaged rat brain show MS signals from pathological tissue can be quantified. MRI diffusion data of control and radiotherapy-treated tumours further show high sensitivity hypothesis testing for treatment effects. Successful χ2 and degrees-of-freedom are computed, allowing null-hypothesis thresholding at high levels of confidence. AVAILABILITY AND IMPLEMENTATION Open-source image analysis software available from TINA Vision, www.tina-vision.net. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.

Journal ArticleDOI
TL;DR: The proposed approach serve as swift and simple analytical tool for the analysis of fluorescent mixtures without involving pre-separation step and was quite precise in analysing both calibration and validation set samples.

Journal ArticleDOI
15 Jul 2020
TL;DR: Two methods to extract patterns of spatio-temporal activity are described that provide viewpoints for separating their activities of a workday into time segments of appropriate size in order to obtain a grasp of how the activities vary with the time of day.
Abstract: . A pedestrian tracking system on highly accurate laser scanners is an effective method to understand the usage of the facility space. While this system is capable of gathering an enormous volume of tracking data, specialized skills and significant amounts of labor are needed to get a reliable bird’s-eye view of the spatio-temporal characteristics of the observed data. In this paper, two methods to extract patterns of spatio-temporal activity are described. These can provide a broad overview of the office-worker’s activities in the office throughout a workday and an easily under-stood visualization that indicates what time segment, what location and what activities are taking place. One is a time segment extraction model that identifies characteristic time intervals in the time series data of office-worker’s activities using a classification model based on information loss minimization model. The other is a day scene extraction model that identifies daily scenes from simultaneous behavior patterns in spatio-temporal distributions using a latent class model with PLSI (Probabilistic latent semantic indexing). These methods provide viewpoints for separating their activities of a workday into time segments of appropriate size in order to obtain a grasp of how the activities vary with the time of day. Simultaneous behavior patterns in time, space, and activity are extracted, thereby allowing representation of typical scenes such as morning meetings and extended conversations between co-workers.

Proceedings ArticleDOI
01 Jan 2020
TL;DR: This work proposes a novel clustering technique for BA which can find hidden routines in ubiquitous data and also captures the pattern in the routines and efficiently works on high dimensional data for BA without performing any computationally expensive reduction operations.
Abstract: Behavioral analysis (BA) on ubiquitous sensor data is the task of finding the latent distribution of features for modeling user-specific characteristics. These characteristics, in turn, can be used for a number of tasks including resource management, power efficiency, and smart home applications. In recent years, the employment of topic models for BA has been found to successfully extract the dynamics of the sensed data. Topic modeling is popularly performed on text data for mining inherent topics. The task of finding the latent topics in textual data is done in an unsupervised manner. In this work we propose a novel clustering technique for BA which can find hidden routines in ubiquitous data and also captures the pattern in the routines. Our approach efficiently works on high dimensional data for BA without performing any computationally expensive reduction operations. We evaluate three different techniques namely LDA, the Non-negative Matrix Factorization (NMF) and the Probabilistic Latent Semantic Analysis (PLSA) for comparative study. We have analyzed the efficiency of the methods by using performance indices like perplexity and silhouette on three real-world ubiquitous sensor datasets namely, the Intel Lab Data, Kyoto Data, and MERL data. Through rigorous experiments, we achieve silhouette scores of 0.7049 over the Intel Lab dataset, 0.6547 over the Kyoto dataset and 0.8312 over the MERL dataset for clustering.