scispace - formally typeset
Proceedings ArticleDOI

A Hybrid Document Feature Extraction Method Using Latent Dirichlet Allocation and Word2Vec

TLDR
Experimental results indicate that document features generated by the hybrid method are useful to improve classification performance by consolidating both global and local relationships.
Abstract
Latent Dirichlet Allocation (LDA) is a probabilistic topic model to discover latent topics from documents and describe each document with a probability distribution over the discovered topics. It defines a global hierarchical relationship from words to a topic and then from topics to a document. Word2Vec is a word-embedding model to predict a target word from its surrounding contextual words. In this paper, we propose a hybrid approach to extract features from documents with bag-of-distances in a semantic space. By using both Word2Vec and LDA, our hybrid method not only generates the relationships between documents and topics, but also integrates the contextual relationships among words. Experimental results indicate that document features generated by our hybrid method are useful to improve classification performance by consolidating both global and local relationships.

read more

Citations
More filters
Journal ArticleDOI

A Review of Text Corpus-Based Tourism Big Data Mining

TL;DR: A detailed and up-to-date review of text mining techniques that have been, or have the potential to be, applied to modern tourism big data analysis and their applications in tourist profiling, destination image analysis, market demand, etc.
Proceedings ArticleDOI

LDA Meets Word2Vec: A Novel Model for Academic Abstract Clustering

TL;DR: A novel clustering model that uses abstract text instead of keywords to cluster because keywords may be ambiguous and cause unsatisfied clustering results shown by previous work, and Experimental results show that the clusteringresults of PW-LDA are much more accurate and stable than state-of-the-art techniques.
Journal ArticleDOI

Tourism Review Sentiment Classification Using a Bidirectional Recurrent Neural Network with an Attention Mechanism and Topic-Enriched Word Vectors

TL;DR: A bidirectional gated recurrent unit neural network model (BiGRULA) is proposed for sentiment analysis by combining a topic model (lda2vec) and an attention mechanism that allows for more coherent topics from these reviews and achieves good performance in sentiment classification.
Journal ArticleDOI

Exploring the donation allocation of online charitable crowdfunding based on topical and spatial analysis: Evidence from the Tencent GongYi

TL;DR: A comparative analysis of four types of crowdfunding projects to examine differences in their general characteristics and donation allocation suggests that the success rates for these four different types of online charitable crowdfunding vary and that the key influencing factor among them is the type of project executors.
Journal ArticleDOI

Classification of movie reviews using term frequency-inverse document frequency and optimized machine learning algorithms

TL;DR: This study provides the implementation of various machine learning models to measure the polarity of the sentiments presented in user reviews on the IMDb website and indicates that the SVM obtains the highest accuracy when used with TF-IDF features and achieves an accuracy of 89.55%.
References
More filters
Journal Article

Scikit-learn: Machine Learning in Python

TL;DR: Scikit-learn is a Python module integrating a wide range of state-of-the-art machine learning algorithms for medium-scale supervised and unsupervised problems, focusing on bringing machine learning to non-specialists using a general-purpose high-level language.
Journal ArticleDOI

Latent dirichlet allocation

TL;DR: This work proposes a generative model for text and other collections of discrete data that generalizes or improves on several previous models including naive Bayes/unigram, mixture of unigrams, and Hofmann's aspect model.
Proceedings ArticleDOI

Glove: Global Vectors for Word Representation

TL;DR: A new global logbilinear regression model that combines the advantages of the two major model families in the literature: global matrix factorization and local context window methods and produces a vector space with meaningful substructure.
Proceedings Article

Latent Dirichlet Allocation

TL;DR: This paper proposed a generative model for text and other collections of discrete data that generalizes or improves on several previous models including naive Bayes/unigram, mixture of unigrams, and Hof-mann's aspect model, also known as probabilistic latent semantic indexing (pLSI).
Related Papers (5)