scispace - formally typeset
Search or ask a question

Showing papers on "Probabilistic latent semantic analysis published in 2019"


Posted Content
TL;DR: This survey conducts a comprehensive review of various short text topic modeling techniques proposed in the literature, and presents three categories of methods based on Dirichlet multinomial mixture, global word co-occurrences, and self-aggregation, with example of representative approaches in each category and analysis of their performance on various tasks.
Abstract: Analyzing short texts infers discriminative and coherent latent topics that is a critical and fundamental task since many real-world applications require semantic understanding of short texts. Traditional long text topic modeling algorithms (e.g., PLSA and LDA) based on word co-occurrences cannot solve this problem very well since only very limited word co-occurrence information is available in short texts. Therefore, short text topic modeling has already attracted much attention from the machine learning research community in recent years, which aims at overcoming the problem of sparseness in short texts. In this survey, we conduct a comprehensive review of various short text topic modeling techniques proposed in the literature. We present three categories of methods based on Dirichlet multinomial mixture, global word co-occurrences, and self-aggregation, with example of representative approaches in each category and analysis of their performance on various tasks. We develop the first comprehensive open-source library, called STTM, for use in Java that integrates all surveyed algorithms within a unified interface, benchmark datasets, to facilitate the expansion of new methods in this research field. Finally, we evaluate these state-of-the-art methods on many real-world datasets and compare their performance against one another and versus long text topic modeling algorithm.

98 citations


Journal ArticleDOI
TL;DR: The dynamical and uncertain data characteristics are both taken into consideration for the regression modeling purpose and the linear dynamic system is introduced for incorporation of the dynamical data feature.
Abstract: Dynamic and uncertainty are two main features of the industrial process data which should be paid attention when carrying out process data modeling and analytics. In this paper, the dynamical and uncertain data characteristics are both taken into consideration for the regression modeling purpose. Based on the probabilistic latent variable modeling framework, the linear dynamic system is introduced for incorporation of the dynamical data feature. The expectation–maximization Algorithm is introduced for parameter learning of the dynamical probabilistic latent variable model, based on which a new soft sensing scheme is then formulated for online prediction of key/quality variables in the process. An industrial case study illustrates the necessity and effectiveness of introducing the dynamical data information into the probabilistic latent variable model.

65 citations


Journal ArticleDOI
01 Jan 2019
TL;DR: A novel maximum entropy-PLSA model is proposed, which uses the probabilistic latent semantic analysis to extract seed emotion words from the Wikipedia and the training corpus and uses important emotional classification features to classify words.
Abstract: Sentiment analysis is an important field of study in natural language processing. In the massive data and irregular data, sentiment classification with high accuracy is a major challenge in sentiment analysis. To address this problem, a novel maximum entropy-PLSA model is proposed. In this model, we first use the probabilistic latent semantic analysis to extract the seed emotion words from the Wikipedia and the training corpus. Then features are extracted from these seed emotion words, which are the input of the maximum entropy model for training the maximum entropy model. The test set is processed similarly into the maximum entropy model for emotional classification. Meanwhile, the training set and the test set are divided by the K-fold method. The maximum entropy classification based on probabilistic latent semantic analysis uses important emotional classification features to classify words, such as the relevance of words and parts of speech in the context, the relevance with degree adverbs, the similarity with the benchmark emotional words and so on. The experiments prove that the classification method proposed by this paper outperforms the compared methods.

62 citations


Journal ArticleDOI
TL;DR: A collaborative recommendation algorithm based on improved probabilistic latent semantic model is proposed in this paper, which introduces popularity factor into Probabilistic Latent Semantic Analysis to derive probabilism matrix factorization model.
Abstract: In order to effectively solve the problem of new items and obviously improve the accuracy of the recommended results, we proposed a collaborative recommendation algorithm based on improved probabilistic latent semantic model in this paper, which introduces popularity factor into probabilistic latent semantic analysis to derive probabilistic matrix factorization model. The core idea is to integrate the semantic knowledge into the recommendation process to overcome the shortcomings of the traditional recommendation algorithm. We introduced popularity factor to form a quintuple vector so as to understand user preference, and can integrate the probabilistic matrix factorization to solve the problem of data sparsity on basis of Probabilistic Latent Semantic Analysis; then the probabilistic matrix factorization model is adopted to construct the weighted similarity function to compute the recommendation result. Experimental study on real-world data-sets demonstrates that our proposed method can outperform three state-of-the art methods in recommendation accuracy.

28 citations


Journal ArticleDOI
TL;DR: The research shows that the application of individual-based spatial–temporal data in human mobility and space–time interaction study can help to analyse urban spatial structure and understand the actual regional function from a new perspective.
Abstract: Urban system is shaped by the interactions between different regions and regions planned by the government, then reshaped by human activities and residents’ needs. Understanding the changes of regi...

21 citations


Journal ArticleDOI
TL;DR: The researcher used Word embedding to obtain vector values in the deep learning method from Long Short-Term Memory for sentiment classification and proposed the Probabilistic Latent Semantic Analysis (PLSA) method to produce a hidden topic.
Abstract: In the industrial era 5.0, product reviews are necessary for the sustainability of a company. Product reviews are a User Generated Content (UGC) feature which describes customer satisfaction. The researcher used five hotel aspects including location, meal, service, comfort, and cleanliness to measure customer satisfaction. Each product review was preprocessed into a term list document. In this context, we proposed the Probabilistic Latent Semantic Analysis (PLSA) method to produce a hidden topic. Semantic Similarity was used to classify topics into five hotel aspects. The Term Frequency-Inverse Corpus Frequency (TF-ICF) method was used for weighting each term list, which had been expanded from each cluster in the document. The researcher used Word embedding to obtain vector values in the deep learning method from Long Short-Term Memory (LSTM) for sentiment classification. The result showed that the combination of the PLSA + TF ICF 100% + Semantic Similarity method was superior are 0.840 in the fifth categorization of the hotel aspects; the Word Embedding + LSTM method outperformed the sentiment classification at value 0.946; the service aspect received positive sentiment value higher are 45.545 than the other aspects; the comfort aspect received negative sentiment value higher are 12.871 than the other aspects. Other results also showed that sentiment was affected by the aspects.

20 citations


Journal ArticleDOI
05 Jul 2019-Entropy
TL;DR: It is demonstrated that Sharma–Mittal entropy is a convenient tool for selecting both the number of topics and the values of hyper-parameters, simultaneously controlling for semantic stability, which none of the existing metrics can do.
Abstract: Topic modeling is a popular approach for clustering text documents. However, current tools have a number of unsolved problems such as instability and a lack of criteria for selecting the values of model parameters. In this work, we propose a method to solve partially the problems of optimizing model parameters, simultaneously accounting for semantic stability. Our method is inspired by the concepts from statistical physics and is based on Sharma-Mittal entropy. We test our approach on two models: probabilistic Latent Semantic Analysis (pLSA) and Latent Dirichlet Allocation (LDA) with Gibbs sampling, and on two datasets in different languages. We compare our approach against a number of standard metrics, each of which is able to account for just one of the parameters of our interest. We demonstrate that Sharma-Mittal entropy is a convenient tool for selecting both the number of topics and the values of hyper-parameters, simultaneously controlling for semantic stability, which none of the existing metrics can do. Furthermore, we show that concepts from statistical physics can be used to contribute to theory construction for machine learning, a rapidly-developing sphere that currently lacks a consistent theoretical ground.

19 citations


Proceedings ArticleDOI
14 Jul 2019
TL;DR: A Multiterm Topic Model (MTM), which directly modeling the generative process of multiterms, can infer the word distributions of each topic and the topic distribution of each short text to alleviate the sparsity problem in short text modeling.
Abstract: Since effective semantic representations are utilized in many practical applications, inferring discriminative and coherent latent topics from short texts is a critical and basic task. Traditional topic models like Probabilistic Latent Semantic Analysis (PLSA) and Latent Dirichlet Allocation (LDA) behave not well on short texts due to data sparsity problem. One novel model called Biterm Topic Model (BTM) which models unordered word-pairs (i.e., biterms) from whole corpus was proposed to solve this problem. However, both the performance and efficiency of BTM are reduced because of many irrelevant and useless biterms. In this paper, we propose a Multiterm Topic Model (MTM) for short text topic modeling. MTM extracts variable-length and more correlative word patterns (i.e., multiterms) from the whole corpus. By directly modeling the generative process of multiterms, MTM can infer the word distributions of each topic and the topic distribution of each short text to alleviate the sparsity problem in short text modeling. With the the proper amount of flexible multiterms, learning process of MTM is enhanced. Through extensive experiments on two real-world short text collections, we show that MTM is more efficient and outperforms the baseline models in terms of topic coherence and text classification.

17 citations


Journal ArticleDOI
TL;DR: Evaluation and comparison of hybrid topic models are presented in the experimental section for demonstrating the efficiency with different distance measures, include, Euclidean distance, cosine distance, and multi-viewpoint cosine similarity.
Abstract: Social media and in particular, microblogs are becoming an important data source for disease surveillance, behavioral medicine, and public healthcare. Topic Models are widely used in microblog analytics for analyzing and integrating the textual data within a corpus. This paper uses health tweets as microblogs and attempts the health data clustering by topic models. The traditional topic models, such as Latent Semantic Indexing (LSI), Probabilistic Latent Schematic Indexing (PLSI), Latent Dirichlet Allocation (LDA), Non-negative Matrix Factorization (NMF), and integer Joint NMF(intJNMF) methods are used for health data clustering; however, they are intractable to assess the number of health topic clusters. Proper visualizations are essential to extract the information from and identifying trends of data, as they may include thousands of documents and millions of words. For visualization of topic clouds and health tendency in the document collection, we present hybrid topic models by integrating traditional topic models with VAT. Proposed hybrid topic models viz., Visual Non-negative Matrix Factorization (VNMF), Visual Latent Dirichlet Allocation (VLDA), Visual Probabilistic Latent Schematic Indexing (VPLSI) and Visual Latent Schematic Indexing (VLSI) are promising methods for accessing the health tendency and visualization of topic clusters from benchmarked and Twitter datasets. Evaluation and comparison of hybrid topic models are presented in the experimental section for demonstrating the efficiency with different distance measures, include, Euclidean distance, cosine distance, and multi-viewpoint cosine similarity.

16 citations


Journal ArticleDOI
TL;DR: This work describes a method for probabilistic topic analysis in image and text based on a new representation of graph-regularized PLSA (GPLSA), and proposes efficient multiplicative iterative algorithms for GPLSA with three popular regularizers, namely l1, l2 and symmetric KL divergences.

10 citations


Journal ArticleDOI
TL;DR: An unsupervised technique known as probabilistic Latent Semantic Analysis (pLSA) along with Bag of Visual Words to discriminate diseased images from normal ones is proposed and achieved better performance measures compared to the existing methods.

Journal ArticleDOI
TL;DR: The experimental results show that the parallel versions of the DEpLSA and the traditional pLSA approach can provide accurate HU results fast enough for practical use, accelerating the corresponding serial versions in at least 30x in the GTX 1080 and up to 147X in the Tesla P100 GPU, which are quite significant acceleration factors that increase with the image size, thus allowing for the possibility of the fast processing of massive HS data repositories.
Abstract: Hyperspectral unmixing (HU) is an important task for remotely sensed hyperspectral (HS) data exploitation. It comprises the identification of pure spectral signatures (endmembers) and their corresponding fractional abundances in each pixel of the HS data cube. Several methods have been developed for (semi-) supervised and automatic identification of endmembers and abundances. Recently, the statistical dual-depth sparse probabilistic latent semantic analysis (DEpLSA) method has been developed to tackle the HU problem as a latent topic-based approach in which both endmembers and abundances can be simultaneously estimated according to the semantics encapsulated by the latent topic space. However, statistical models usually lead to computationally demanding algorithms and the computational time of the DEpLSA is often too high for practical use, in particular, when the dimensionality of the HS data cube is large. In order to mitigate this limitation, this article resorts to graphical processing units (GPUs) to provide a new parallel version of the DEpLSA, developed using the NVidia compute device unified architecture. Our experimental results, conducted using four well-known HS datasets and two different GPU architectures (GTX 1080 and Tesla P100), show that our parallel versions of the DEpLSA and the traditional pLSA approach can provide accurate HU results fast enough for practical use, accelerating the corresponding serial versions in at least 30x in the GTX 1080 and up to 147x in the Tesla P100 GPU, which are quite significant acceleration factors that increase with the image size, thus allowing for the possibility of the fast processing of massive HS data repositories.

Journal ArticleDOI
TL;DR: The proposed DpLSA is effective for face recognition under single training sample and possesses a certain degree of robustness to illumination, pose, as well as occlusion.
Abstract: Face recognition is still a challenging issue due to the presence of intrinsic complexity, external variations and number limitation of training samples. In this paper, a novel face recognition method based on probabilistic latent semantic analysis (pLSA) model is developed, which mainly contains two stages: bag-of-words features extraction and semantic representation learning. In the first stage, to extract more structure information, the region-specific dictionary strategy is employed, i.e., generating a dictionary for each region. The encoded and sum-pooled features of all regions are concatenated together. In the second stage, a discriminative pLSA (DpLSA) model is presented, which initializes the word-topic distribution $$P(w|z_k)$$ by the center point of the training data from category k. As a result, the problem of how to choose appropriate number of topics in classical topic model is alleviated, and the training process of DpLSA is very fast only requiring few iterations. Moreover, the discovered topic-document distribution $$P\left( z|d\right) $$ is discriminative and semantic with the dominant topic entry corresponds to the category label of image d, which enables performing classification by $$P\left( z|d\right) $$ directly. Extensive experiments on four representative databases demonstrate that the proposed DpLSA is effective for face recognition under single training sample and possesses a certain degree of robustness to illumination, pose, as well as occlusion.

Proceedings ArticleDOI
01 Oct 2019
TL;DR: A Fisher kernel function based on Probabilistic Latent Semantic Analysis is proposed for sentiment analysis by Support Vector Machine, and the results show that compared with the comparison method, the effect of the method proposed is obviously improved.
Abstract: At present, in the mainstream sentiment analysis methods represented by Support Vector Machine, the vocabulary and the latent semantic information contained in the text cannot be well considered, and sentiment analysis of text is overly dependent on the statistics of sentiment words. In this paper, a Fisher kernel function based on Probabilistic Latent Semantic Analysis is proposed for sentiment analysis by Support Vector Machine. The Fisher kernel function based on the model is derived by the Probabilistic Latent Semantic Analysis model. By means of this method, latent semantic information containing the probability features can be used as the classification features, the effect of classification for support vector machine can be improved, and the problem of not considering the latent semantic features in text sentiment analysis is solved. The results show that compared with the comparison method, the effect of the method proposed in this paper is obviously improved.

Journal ArticleDOI
TL;DR: The present work successfully proposes application of Akaike information criterion (AIC) assisted probabilistic latent semantic analysis (pLSA) algorithm on TSFS data sets.

Journal ArticleDOI
TL;DR: Zhang et al. as mentioned in this paper proposed a new methodology for hierarchical photo organization into topics and topic-related categories by applying probabilistic Latent Semantic Analysis, and automatically assigned a name to each topic by relying on a lexical database.

Book ChapterDOI
01 Jan 2019
TL;DR: This paper proposes a framework to analyze sentiment from the sentence, by applying independent component analysis (ICA) in coordination with probabilistic latent semantic analysis, which is efficient as its outcome is more precise and accurate than the lexicons-based sentiment analysis.
Abstract: Twitter is an ocean of diverse topics while sentiment classifier is always limited to a specific domain or topic. Twitter lacks data labeling and a mechanism to acquire sentiment labels. Sentiment words extracted from the twitter are generalized. It is important to find the correct sentiment from a tweet; otherwise, it may generate different sentiment than the desired. Sentiment analysis work done up to now has limitation, i.e., it is based on predefined lexicons. Sentiment of a word based on this lexicon is not generalized. State of art of our work suggests a solution that will make sentiment analysis based on the sentence sentiments and not just only based on predefined lexicons. In this paper, we propose a framework to analyze sentiment from the sentence, by applying independent component analysis (ICA) in coordination with probabilistic latent semantic analysis. We view pLSA as word categorization technique based on some topics, wherein a given corpus is split among different topics. We further utilize these topics for tagging sentiment with the help of ICA, in this way, we are able to assign more accurate sentiment of a sentence than the existing approaches. The proposed work is efficient as its outcome is more precise and accurate than the lexicons-based sentiment analysis. With adequate unsupervised machine learning training, accurate outcomes with a normal precision rate of 77.98% are accomplished.

Proceedings ArticleDOI
24 Jul 2019
TL;DR: The researcher uses sentiment analysis with a machine learning approach and uses Fuzzy K-Nearest Neighbor (FK-NN) as the classification method and predicted results show that Sentiment Analysis FK-NN is slightly close to the results of the previous research method, namely Probabilistic Latent Semantic Analysis (PLSA).
Abstract: Social media has grown so rapidly, so people easily to share their opinions, moments, etc. There are several types of research about social media, one of which is Sentiment Analysis (SA) that can also be referred to as opinions meaning (OM). Sentiment Analysis focuses on the classification of patterns that are derived from words that are positive words, negative words, and neutral words. In this paper, the researcher uses sentiment analysis with a machine learning approach and uses Fuzzy K-Nearest Neighbor (FK-NN) as the classification method. The dataset uses English text classification, to predicted sentiment of customer reviews about the positive or negative review. The predicted results show that Sentiment Analysis FK-NN is slightly close to the results of the previous research method, namely Probabilistic Latent Semantic Analysis (PLSA), which FK-NN is 72.05% and PLSA is 76%.

Proceedings ArticleDOI
01 Jul 2019
TL;DR: A novel rotation invariant probabilistic Latent Semantic Analysis (RI-pLSA) model is proposed to learn latent semantic representations for object detection by imposing a rotation-invariant regularization term on the objective function of pLSA to enforce the learned representation from all rotations of the same sample to be as consistent as possible.
Abstract: Object detection in very high resolution (VHR) optical remote sensing images is a fundamental yet challenging problem for the field of remote sensing image analysis. The detection performance is heavily dependent on the representation capability of the extracted features. Recently, convolutional neural networks (CNNs) have made a breakthrough for various applications in nature images. However, it is problematic to directly apply CNN to perform object detection in VHR optical remote sensing images due to the problem of object rotation variations. To address this issue, a novel rotation invariant probabilistic Latent Semantic Analysis (RI-pLSA) model is proposed to learn latent semantic representations for object detection. This is achieved by imposing a rotation-invariant regularization term on the objective function of pLSA to enforce the learned representation from all rotations of the same sample to be as consistent as possible. Additionally, the proposed RI-pLSA model takes the CNN features as input, which generates more powerful semantic representation for object detection. Comprehensive experiments on a publicly available ten-class object detection dataset demonstrate the superiority and effectiveness of our method compared with state-of-the-arts.

Journal ArticleDOI
Yongjing Yin1, Jiali Zeng1, Hongji Wang1, Keqing Wu1, Bin Luo1, Jinsong Su1 
TL;DR: The proposed lexical resource-constrained topic model is an extension of probabilistic latent semantic analysis, which automatically learns word-level distributed representations forward relatedness measurement and introduces generalized expectation maximization (GEM) algorithm for statistical estimation.
Abstract: Word relatedness computation is an important supporting technology for many tasks in natural language processing. Traditionally, there have been two distinct strategies for word relatedness measurement: one utilizes corpus-based models, whereas the other leverages external lexical resources. However, either solution has its strengths and weaknesses. In this paper, we propose a lexical resource-constrained topic model to integrate the two complementary strategies effectively. Our model is an extension of probabilistic latent semantic analysis, which automatically learns word-level distributed representations forward relatedness measurement. Furthermore, we introduce generalized expectation maximization (GEM) algorithm for statistical estimation. The proposed model not merely inherit the advantage of conventional topic models in dimension reduction, but it also refines parameter estimation by using word pairs that are known to be related. The experimental results in different languages demonstrate the effectiveness of our model in topic extraction and word relatedness measurement.

Proceedings ArticleDOI
01 Aug 2019
TL;DR: The study concluded that the PLSA algorithm is more efficient on processed data than raw web data and processing time was reduced when preprocessing was used to eliminate redundant latent variables.
Abstract: The explosion of web as an information source recently has brought more interesting challenges to document annotation and classification of web documents as well as to the information and collaboration filtering world. Not only does it introduce performance issues due to the huge number of documents, there is even bigger challenges in that most of the data are unlabeled. The arrival of social media and blogging has increased web data as well. Thus the machine learning community has taken up the interest in using unsupervised learning in classification of such big data. New learning called semi-supervised learning relies on assumptions that unlabeled data can help bring interesting patterns, which the industry can use. The study uses the Probability Latent Semantic Analysis algorithm which is an unsupervised machine learning algorithm to retrieve information based on latent classes of documents and terms. PLSA is the topic modeling tool used by the study to deduce hidden topics across the documents by looking at the terms and documents and inferring using the Expectation Maximization algorithm which words belong to which topic. The model was used to discover if given documents infer one or more topics. The study concluded that the PLSA algorithm is more efficient on processed data than raw web data and processing time was reduced when preprocessing was used to eliminate redundant latent variables. An increase of k, the number of topics improved the topic quality.

Proceedings ArticleDOI
01 Feb 2019
TL;DR: A novel topic modeling technique for NGS data analysis using Probabilistic Latent Semantic Analysis (PLSA) and it is shown that PLSA outperforms compare with NNMF and LDA topic models.
Abstract: The generation of substantial quantities of low-cost, high-quality next-generation sequences (NGS) has empowered mainstream researchers to address various biological and medical research problems. The vast information delivered by NGS technologies presents a big challenge for data processing, analysis, data mining, and text mining. This paper proposes a novel topic modeling technique for NGS data analysis using Probabilistic Latent Semantic Analysis (PLSA). The proposed method has four tasks: NGS dataset construction, preprocessing of data, topic modeling, and text mining using PLSA topic outputs. The NGS data of Salmonella enterica strains were used as the dataset in this procedure. The topic modeling performance is measured using standard clustering comparison measures such as Adjusted Rand Index, Normalized Mutual Information, Normalized Information Distance, and Normalized Variation of Information. The performance of PLSA topic modeling on NGS data was compared with Non-negative Matrix Factorization (NNMF) algorithm and existing Latent Dirichlet Allocation (LDA) algorithm. The Evaluations have shown that PLSA outperforms compare with NNMF and LDA topic models.

Proceedings ArticleDOI
01 Jul 2019
TL;DR: An Open Multi-Processing (OpenMP) implementation of the pLSA algorithm for unsupervised Synthetic Aperture Radar and Multi-Spectral Imaging image categorization is presented, suggesting that multi-core systems are an important architecture for the efficient processing of both SAR and MSI datasets.
Abstract: The probabilistic Latent Semantic Analysis (pLSA) model has recently shown a great potential to uncover highly descriptive semantic features from limited amounts of remote sensing data. Nonetheless, the high computational cost of this algorithm often constraints its operational application for land cover categorization tasks. In this scenario, this paper presents an Open Multi-Processing (OpenMP) implementation of the pLSA algorithm for unsupervised Synthetic Aperture Radar (SAR) and Multi-Spectral Imaging (MSI) image categorization. The experimental results suggest that multi-core systems are an important architecture for the efficient processing of both SAR and MSI datasets. Specifically, the proposed approach is able to cover a real scenario exhibiting good results in both accuracy and performance terms.

Patent
22 Feb 2019
TL;DR: In this paper, a mining area distribution thematic information extraction method based on multi-source remote sensing images was proposed, comprising extracting land subsidence information, calculating uplift information characteristics of target area according to radiant value of remote sensing image interferogram of the target area, dividing possible mining areas according to sudden change characteristicsof uplift, and further identifying combined with radar image, extracting land sub-surface information; establishing a thematic semantic model of mining area, and adopting the probabilistic latent semantic analysis method to model the latent semantics, and the scene of mining
Abstract: The invention provides a mining area distribution thematic information extraction method based on multi-source remote sensing images, comprising extracting land subsidence information, calculating uplift information characteristics of target area according to radiant value of remote sensing image interferogram of target area, dividing possible mining area according to sudden change characteristicsof uplift, and further identifying combined with radar image, extracting land subsidence information; establishing a thematic semantic model of mining area, and adopting the probabilistic latent semantic analysis method to model the latent semantics, and the scene of mining area is expressed by the feature vector of latent semantics; extracting information from multi-source remote sensing imagesof mining area. The invention not only greatly reduces the cost, but also is convenient to operate and greatly improves the working efficiency by using the mode of combining the multi-source optical images and the radar images. For the influence of vegetation, the vegetation index is used to extract the thematic information from the non-vegetation-covered area.

Posted Content
TL;DR: For very large corpora where the number of documents can be in the order of billions, using a neural auto-encoder based document embedding is more scalable then using a lookup table embedding as classically done.
Abstract: In this paper we present a model for unsupervised topic discovery in texts corpora. The proposed model uses documents, words, and topics lookup table embedding as neural network model parameters to build probabilities of words given topics, and probabilities of topics given documents. These probabilities are used to recover by marginalization probabilities of words given documents. For very large corpora where the number of documents can be in the order of billions, using a neural auto-encoder based document embedding is more scalable then using a lookup table embedding as classically done. We thus extended the lookup based document embedding model to continuous auto-encoder based model. Our models are trained using probabilistic latent semantic analysis (PLSA) assumptions. We evaluated our models on six datasets with a rich variety of contents. Conducted experiments demonstrate that the proposed neural topic models are very effective in capturing relevant topics. Furthermore, considering perplexity metric, conducted evaluation benchmarks show that our topic models outperform latent Dirichlet allocation (LDA) model which is classically used to address topic discovery tasks.

Dissertation
01 Jan 2019
TL;DR: Image Emotion Recognition using Region-based Multi-level Features using region-based multi-level features to solve the challenge of recognizing human emotion in the dark.
Abstract: According to psychology studies, human emotion can be invoked by different kinds of visual stimuli. Recognizing human emotion automatically from visual contents has been studied for years. Emotion recognition is an essential component of human-computer interaction and has been involved in many applications, such as advertisement, entertainment, education, and accommodation system. Compared to other computer vision tasks, visual emotion recognition is more challenging as it involves analyzing abstract emotional states which are complexity and subjectivity. For complexity, emotion can be evoked by different kinds of visual content and the same kind of visual content may evoke various kinds of emotions. For subjectivity, people from different cultural background may have different kinds of emotions for the same kind of visual content. Automatic visual emotion recognition system consists of several tuned processing steps which are integrated into a pipeline. Previous methods often relay on hand-tuned features which can introduce strong assumptions about the properties of human emotion. However, the vague assumptions related to the abstract concept of emotion and learning the processing pipeline from limited data often narrows the generalization of the visual emotion recognition system. Considering the two challenges on complexity and subjectivity as mentioned above, more information should be used for image-based emotion analysis. Features from different level including low-level visual features, such as color, shape, line and texture, mid-level image aesthetics and composition and high-level image semantic need to be taken into consideration. Local information extracted from emotion-related image regions can provide further support for image emotion classification. In recent years, deep learning methods have achieved great success in many computer vision tasks. The state-of-art deep learning methods can achieve performances slightly under or even above human performances in some challenging tasks, such as facial recognition and object detection. The Convolutional Neural Networks applied in deep learning methods consist of hierarchical structures which can learn increasingly abstract concept from local to global view than hand-crafted features. This observation suggests exploring the application of CNN structure to image emotion classification. This thesis is based on three articles, which contribute to the field of image emotion classification. The first article is an in-depth analysis of the impact of emotional regions in images for image emotion classification. In the model, multi-scale blocks are first extracted from the image to cover different emotional regions. Then, in order to bridge the gap between low-level visual features and high-level emotions, a mid-level representation, exploiting Probabilistic Latent Semantic Analysis (pLSA) is introduced to learn a set of mid-level representations as a set of latent topics from affective images. Finally, Multiple Instance Learning (MIL), based on the multi-scale…

Book ChapterDOI
12 Dec 2019
TL;DR: Zhang et al. as mentioned in this paper proposed a new model named HRGM-pLSA to learn the latent semantic attributes, which used the prior knowledge of predefined attributes to utilize the hypergraph to present the complex correlation of the attributes in the images.
Abstract: Semantic attributes are proven to help improve the performance of a series of applications in the field of computer vision. The semantic attributes are usually defined by humans and labeled manually, but whether these attributes can discriminate the images or videos well should be doubted, because these attributes are not optimized. In order to solve this problem, we proposed a new model named HRGM-pLSA to learn the latent semantic attributes. Our model employs GM-pLSA to learn the latent topics in the images. To utilize the prior knowledge of predefined attributes, we use the hypergraph to present the complex correlation of the attributes in the images. We construct a regularization term of a hypergraph and integrate it into the GM-pLSA to make the learned latent semantic attributes get better performance compared to primary attributes. In this paper, we evaluate the quality of attributes in a practical application: image retrieval. The performance of our model in the application demonstrates the value of our model.