scispace - formally typeset
Search or ask a question

Showing papers on "Word embedding published in 2015"


Proceedings Article
25 Jan 2015
TL;DR: A recurrent convolutional neural network is introduced for text classification without human-designed features to capture contextual information as far as possible when learning word representations, which may introduce considerably less noise compared to traditional window-based neural networks.
Abstract: Text classification is a foundational task in many NLP applications. Traditional text classifiers often rely on many human-designed features, such as dictionaries, knowledge bases and special tree kernels. In contrast to traditional methods, we introduce a recurrent convolutional neural network for text classification without human-designed features. In our model, we apply a recurrent structure to capture contextual information as far as possible when learning word representations, which may introduce considerably less noise compared to traditional window-based neural networks. We also employ a max-pooling layer that automatically judges which words play key roles in text classification to capture the key components in texts. We conduct experiments on four commonly used datasets. The experimental results show that the proposed method outperforms the state-of-the-art methods on several datasets, particularly on document-level datasets.

1,981 citations


Journal ArticleDOI
TL;DR: It is revealed that much of the performance gains of word embeddings are due to certain system design choices and hyperparameter optimizations, rather than the embedding algorithms themselves, and these modifications can be transferred to traditional distributional models, yielding similar gains.
Abstract: Recent trends suggest that neural-network-inspired word embedding models outperform traditional count-based distributional models on word similarity and analogy detection tasks. We reveal that much of the performance gains of word embeddings are due to certain system design choices and hyperparameter optimizations, rather than the embedding algorithms themselves. Furthermore, we show that these modifications can be transferred to traditional distributional models, yielding similar gains. In contrast to prior reports, we observe mostly local or insignificant performance differences between the methods, with no global advantage to any single approach over the others.

1,374 citations


Journal ArticleDOI
TL;DR: This paper implemented and compared several important RNN architectures, including Elman, Jordan, and hybrid variants, and implemented these networks with the publicly available Theano neural network toolkit and completed experiments on the well-known airline travel information system (ATIS) benchmark.
Abstract: Semantic slot filling is one of the most challenging problems in spoken language understanding (SLU). In this paper, we propose to use recurrent neural networks (RNNs) for this task, and present several novel architectures designed to efficiently model past and future temporal dependencies. Specifically, we implemented and compared several important RNN architectures, including Elman, Jordan, and hybrid variants. To facilitate reproducibility, we implemented these networks with the publicly available Theano neural network toolkit and completed experiments on the well-known airline travel information system (ATIS) benchmark. In addition, we compared the approaches on two custom SLU data sets from the entertainment and movies domains. Our results show that the RNN-based models outperform the conditional random field (CRF) baseline by 2% in absolute error reduction on the ATIS benchmark. We improve the state-of-the-art by 0.5% in the Entertainment domain, and 6.7% for the movies domain.

562 citations


Journal ArticleDOI
TL;DR: A machine learning-based approach to extract mentions of adverse drug reactions (ADRs) from highly informal text in social media, suitable for social media mining, as it relies on large volumes of unlabeled data, thus diminishing the need for large, annotated training data sets.

495 citations


Proceedings ArticleDOI
01 Jan 2015
TL;DR: A solution which normalizes the word vectors on a hypersphere and constrains the linear transform as an orthogonal transform and can offer better performance on a word similarity task and an English-toSpanish word translation task is proposed.
Abstract: Word embedding has been found to be highly powerful to translate words from one language to another by a simple linear transform. However, we found some inconsistence among the objective functions of the embedding and the transform learning, as well as the distance measurement. This paper proposes a solution which normalizes the word vectors on a hypersphere and constrains the linear transform as an orthogonal transform. The experimental results confirmed that the proposed solution can offer better performance on a word similarity task and an English-toSpanish word translation task.

436 citations


Proceedings ArticleDOI
17 Oct 2015
TL;DR: This work proposes to go from word-level to text-level semantics by combining insights from methods based on external sources of semantic knowledge with word embeddings, and derives multiple types of meta-features from the comparison of the word vectors for short text pairs, and from the vector means of their respective word embedDings.
Abstract: Determining semantic similarity between texts is important in many tasks in information retrieval such as search, query suggestion, automatic summarization and image finding. Many approaches have been suggested, based on lexical matching, handcrafted patterns, syntactic parse trees, external sources of structured semantic knowledge and distributional semantics. However, lexical features, like string matching, do not capture semantic similarity beyond a trivial level. Furthermore, handcrafted patterns and external sources of structured semantic knowledge cannot be assumed to be available in all circumstances and for all domains. Lastly, approaches depending on parse trees are restricted to syntactically well-formed texts, typically of one sentence in length. We investigate whether determining short text similarity is possible using only semantic features---where by semantic we mean, pertaining to a representation of meaning---rather than relying on similarity in lexical or syntactic representations. We use word embeddings, vector representations of terms, computed from unlabelled data, that represent terms in a semantic space in which proximity of vectors can be interpreted as semantic similarity. We propose to go from word-level to text-level semantics by combining insights from methods based on external sources of semantic knowledge with word embeddings. A novel feature of our approach is that an arbitrary number of word embedding sets can be incorporated. We derive multiple types of meta-features from the comparison of the word vectors for short text pairs, and from the vector means of their respective word embeddings. The features representing labelled short text pairs are used to train a supervised learning algorithm. We use the trained model at testing time to predict the semantic similarity of new, unlabelled pairs of short texts We show on a publicly available evaluation set commonly used for the task of semantic similarity that our method outperforms baseline methods that work under the same conditions.

426 citations


Proceedings Article
25 Jan 2015
TL;DR: The experimental results show that the TWE models outperform typical word embedding models including the multi-prototype version on contextual word similarity, and also exceed latent topic models and other representative document models on text classification.
Abstract: Most word embedding models typically represent each word using a single vector, which makes these models indiscriminative for ubiquitous homonymy and polysemy. In order to enhance discriminativeness, we employ latent topic models to assign topics for each word in the text corpus, and learn topical word embeddings (TWE) based on both words and their topics. In this way, contextual word embeddings can be flexibly obtained to measure contextual word similarity. We can also build document representations, which are more expressive than some widely-used document models such as latent topic models. In the experiments, we evaluate the TWE models on two tasks, contextual word similarity and text classification. The experimental results show that our models outperform typical word embedding models including the multi-prototype version on contextual word similarity, and also exceed latent topic models and other representative document models on text classification. The source code of this paper can be obtained from https://github.com/largelymfs/topical_word_embeddings.

414 citations


Proceedings ArticleDOI
01 Jul 2015
TL;DR: PPDB 2.0 includes a discriminatively re-ranked set of paraphrases that achieve a higher correlation with human judgments than PPDB 1.0's heuristic rankings.
Abstract: We present a new release of the Paraphrase Database. PPDB 2.0 includes a discriminatively re-ranked set of paraphrases that achieve a higher correlation with human judgments than PPDB 1.0’s heuristic rankings. Each paraphrase pair in the database now also includes finegrained entailment relations, word embedding similarities, and style annotations.

321 citations


Proceedings Article
Xinxiong Chen1, Lei Xu1, Zhiyuan Liu1, Maosong Sun1, Huanbo Luan1 
25 Jul 2015
TL;DR: A character-enhanced word embedding model (CWE) is presented to address the issues of character ambiguity and non-compositional words, and the effectiveness of CWE on word relatedness computation and analogical reasoning is evaluated.
Abstract: Most word embedding methods take a word as a basic unit and learn embeddings according to words' external contexts, ignoring the internal structures of words. However, in some languages such as Chinese, a word is usually composed of several characters and contains rich internal information. The semantic meaning of a word is also related to the meanings of its composing characters. Hence, we take Chinese for example, and present a character-enhanced word embedding model (CWE). In order to address the issues of character ambiguity and non-compositional words, we propose multiple prototype character embeddings and an effective word selection method. We evaluate the effectiveness of CWE on word relatedness computation and analogical reasoning. The results show that CWE outperforms other baseline methods which ignore internal character information. The codes and data can be accessed from https://github.com/Leonard-Xu/CWE.

265 citations


Proceedings Article
01 Jan 2015
TL;DR: This article proposed density-based distributed embeddings and presented a method for learning representations in the space of Gaussian distributions, which can capture uncertainty about a representation and its relationships, expressing asymmetries more naturally than dot product or cosine similarity.
Abstract: Current work in lexical distributed representations maps each word to a point vector in low-dimensional space. Mapping instead to a density provides many interesting advantages, including better capturing uncertainty about a representation and its relationships, expressing asymmetries more naturally than dot product or cosine similarity, and enabling more expressive parameterization of decision boundaries. This paper advocates for density-based distributed embeddings and presents a method for learning representations in the space of Gaussian distributions. We compare performance on various word embedding benchmarks, investigate the ability of these embeddings to model entailment and other asymmetric relationships, and explore novel properties of the representation.

262 citations


Proceedings ArticleDOI
09 Aug 2015
TL;DR: A generalized language model is constructed, where the mutual independence between a pair of words (say t and t') no longer holds and the vector embeddings of the words are made use of to derive the transformation probabilities between words.
Abstract: Word2vec, a state-of-the-art word embedding technique has gained a lot of interest in the NLP community. The embedding of the word vectors helps to retrieve a list of words that are used in similar contexts with respect to a given word. In this paper, we focus on using the word embeddings for enhancing retrieval effectiveness. In particular, we construct a generalized language model, where the mutual independence between a pair of words (say t and t') no longer holds. Instead, we make use of the vector embeddings of the words to derive the transformation probabilities between words. Specifically, the event of observing a term t in the query from a document d is modeled by two distinct events, that of generating a different term t', either from the document itself or from the collection, respectively, and then eventually transforming it to the observed query term t. The first event of generating an intermediate term from the document intends to capture how well does a term contextually fit within a document, whereas the second one of generating it from the collection aims to address the vocabulary mismatch problem by taking into account other related terms in the collection. Our experiments, conducted on the standard TREC collection, show that our proposed method yields significant improvements over LM and LDA-smoothed LM baselines.

Posted Content
TL;DR: The authors analyze three critical components in training word embeddings: model, corpus, and training parameters, and evaluate each word embedding in three ways: analyzing its semantic properties, using it as a feature for supervised tasks, and using it to initialize neural networks.
Abstract: We analyze three critical components of word embedding training: the model, the corpus, and the training parameters. We systematize existing neural-network-based word embedding algorithms and compare them using the same corpus. We evaluate each word embedding in three ways: analyzing its semantic properties, using it as a feature for supervised tasks and using it to initialize neural networks. We also provide several simple guidelines for training word embeddings. First, we discover that corpus domain is more important than corpus size. We recommend choosing a corpus in a suitable domain for the desired task, after that, using a larger corpus yields better results. Second, we find that faster models provide sufficient performance in most cases, and more complex models can be used if the training corpus is sufficiently large. Third, the early stopping metric for iterating should rely on the development set of the desired task rather than the validation loss of training embedding.

Proceedings ArticleDOI
01 Jul 2015
TL;DR: This paper proposes to learn continuous word embeddings with metadata of category information within cQA pages for question retrieval with the framework of fisher kernel to deal with the variable size of word embedding vectors.
Abstract: Community question answering (cQA) has become an important issue due to the popularity of cQA archives on the web. This paper is concerned with the problem of question retrieval. Question retrieval in cQA archives aims to find the existing questions that are semantically equivalent or relevant to the queried questions. However, the lexical gap problem brings about new challenge for question retrieval in cQA. In this paper, we propose to learn continuous word embeddings with metadata of category information within cQA pages for question retrieval. To deal with the variable size of word embedding vectors, we employ the framework of fisher kernel to aggregated them into the fixedlength vectors. Experimental results on large-scale real world cQA data set show that our approach can significantly outperform state-of-the-art translation models and topic-based models for question re-

Proceedings ArticleDOI
07 Dec 2015
TL;DR: Objects2action is a semantic word embedding that is spanned by a skip-gram model of thousands of object categories that proposes a mechanism to exploit multiple-word descriptions of actions and objects and demonstrates how to extend the zero-shot approach to the spatio-temporal localization of actions in video.
Abstract: The goal of this paper is to recognize actions in video without the need for examples. Different from traditional zero-shot approaches we do not demand the design and specification of attribute classifiers and class-to-attribute mappings to allow for transfer from seen classes to unseen classes. Our key contribution is objects2action, a semantic word embedding that is spanned by a skip-gram model of thousands of object categories. Action labels are assigned to an object encoding of unseen video based on a convex combination of action and object affinities. Our semantic embedding has three main characteristics to accommodate for the specifics of actions. First, we propose a mechanism to exploit multiple-word descriptions of actions and objects. Second, we incorporate the automated selection of the most responsive objects per action. And finally, we demonstrate how to extend our zero-shot approach to the spatio-temporal localization of actions in video. Experiments on four action datasets demonstrate the potential of our approach.

Proceedings ArticleDOI
01 Jan 2015
TL;DR: A simple wrapper method that uses off-the-shelf word embedding algorithms to learn task-specific bilingual word embeddings that is independent of the choice of embedding algorithm, does not require parallel data, and can be adapted to specific tasks by re-defining the equivalence classes.
Abstract: We introduce a simple wrapper method that uses off-the-shelf word embedding algorithms to learn task-specific bilingual word embeddings. We use a small dictionary of easily-obtainable task-specific word equivalence classes to produce mixed context-target pairs that we use to train off-the-shelf embedding models. Our model has the advantage that it (a) is independent of the choice of embedding algorithm, (b) does not require parallel data, and (c) can be adapted to specific tasks by re-defining the equivalence classes. We show how our method outperforms off-the-shelf bilingual embeddings on the task of unsupervised cross-language partof-speech (POS) tagging, as well as on the task of semi-supervised cross-language super sense (SuS) tagging.

Proceedings ArticleDOI
22 Jun 2015
TL;DR: It is demonstrated that word embedding vectors perform better than binary vectors as a representation of the tags associated with an image and the CCA model is compared to a simple CNN based linear regression model, which allows the CNN layers to be trained using back-propagation.
Abstract: We propose simple and effective models for the image annotation that make use of Convolutional Neural Network (CNN) features extracted from an image and word embedding vectors to represent their associated tags. Our first set of models is based on the Canonical Correlation Analysis (CCA) framework that helps in modeling both views - visual features (CNN feature) and textual features (word embedding vectors) of the data. Results on all three variants of the CCA models, namely linear CCA, kernel CCA and CCA with k-nearest neighbor (CCA-KNN) clustering, are reported. The best results are obtained using CCA-KNN which outperforms previous results on the Corel-5k and the ESP-Game datasets and achieves comparable results on the IAPRTC-12 dataset. In our experiments we evaluate CNN features in the existing models which bring out the advantages of it over dozens of handcrafted features. We also demonstrate that word embedding vectors perform better than binary vectors as a representation of the tags associated with an image. In addition we compare the CCA model to a simple CNN based linear regression model, which allows the CNN layers to be trained using back-propagation.

Proceedings Article
25 Jul 2015
TL;DR: It is pointed out that SGNS is essentially a representation learning method, which learns to represent the co-occurrence vector for a word, and that extended supervised word embedding can be established based on the proposed representation learning view.
Abstract: Recently significant advances have been witnessed in the area of distributed word representations based on neural networks, which are also known as word embeddings. Among the new word embedding models, skip-gram negative sampling (SGNS) in the word2vec toolbox has attracted much attention due to its simplicity and effectiveness. However, the principles of SGNS remain not well understood, except for a recent work that explains SGNS as an implicit matrix factorization of the pointwise mutual information (PMI) matrix. In this paper, we provide a new perspective for further understanding SGNS. We point out that SGNS is essentially a representation learning method, which learns to represent the co-occurrence vector for a word. Based on the representation learning view, SGNS is in fact an explicit matrix factorization (EMF) of the words' co-occurrence matrix. Furthermore, extended supervised word embedding can be established based on our proposed representation learning view.

Proceedings ArticleDOI
01 Jun 2015
TL;DR: A simple model for lexical substitution, based on the popular skip-gram word embedding model, which is efficient, very simple to implement, and at the same time achieves state-ofthe-art results in an unsupervised setting.
Abstract: The lexical substitution task requires identifying meaning-preserving substitutes for a target word instance in a given sentential context. Since its introduction in SemEval-2007, various models addressed this challenge, mostly in an unsupervised setting. In this work we propose a simple model for lexical substitution, which is based on the popular skip-gram word embedding model. The novelty of our approach is in leveraging explicitly the context embeddings generated within the skip-gram model, which were so far considered only as an internal component of the learning process. Our model is efficient, very simple to implement, and at the same time achieves state-ofthe-art results on lexical substitution tasks in an unsupervised setting.

Proceedings ArticleDOI
08 Dec 2015
TL;DR: This paper used neural word embeddings within the well known translation language model for information retrieval, which captures implicit semantic relations between the words in queries and those in relevant documents, thus producing more accurate estimations of document relevance.
Abstract: Recent advances in neural language models have contributed new methods for learning distributed vector representations of words (also called word embeddings). Two such methods are the continuous bag-of-words model and the skipgram model. These methods have been shown to produce embeddings that capture higher order relationships between words that are highly effective in natural language processing tasks involving the use of word similarity and word analogy. Despite these promising results, there has been little analysis of the use of these word embeddings for retrieval. Motivated by these observations, in this paper, we set out to determine how these word embeddings can be used within a retrieval model and what the benefit might be. To this aim, we use neural word embeddings within the well known translation language model for information retrieval. This language model captures implicit semantic relations between the words in queries and those in relevant documents, thus producing more accurate estimations of document relevance. The word embeddings used to estimate neural language models produce translations that differ from previous translation language model approaches; differences that deliver improvements in retrieval effectiveness. The models are robust to choices made in building word embeddings and, even more so, our results show that embeddings do not even need to be produced from the same corpus being used for retrieval.

Proceedings ArticleDOI
01 Jan 2015
TL;DR: This paper proposes a novel approach to train word embeddings to capture antonyms by utilizing supervised synonym and antonym information from thesauri, as well as distributional information from large-scale unlabelled text data.
Abstract: This paper proposes a novel approach to train word embeddings to capture antonyms. Word embeddings have shown to capture synonyms and analogies. Such word embeddings, however, cannot capture antonyms since they depend on the distributional hypothesis. Our approach utilizes supervised synonym and antonym information from thesauri, as well as distributional information from large-scale unlabelled text data. The evaluation results on the GRE antonym question task show that our model outperforms the state-of-the-art systems and it can answer the antonym questions in the F-score of 89%.

Proceedings ArticleDOI
01 Jan 2015
TL;DR: Experimental results show that, in combination with a back-off method based on string similarity, word embeddings outperform a method using count-based distributional similarity.
Abstract: This paper presents the first attempt to use word embeddings to predict the compositionality of multiword expressions. We consider both single- and multi-prototype word embeddings. Experimental results show that, in combination with a back-off method based on string similarity, word embeddings outperform a method using count-based distributional similarity. Our best results are competitive with, or superior to, state-of-the-art methods over three standard compositionality datasets, which include two types of multiword expressions and two languages.

Posted Content
TL;DR: This study proposes to use BLSTM-RNN with word embedding for part-of-speech (POS) tagging task and can also achieve a good performance comparable with the Stanford POS tagger.
Abstract: Bidirectional Long Short-Term Memory Recurrent Neural Network (BLSTM-RNN) has been shown to be very effective for tagging sequential data, e.g. speech utterances or handwritten documents. While word embedding has been demoed as a powerful representation for characterizing the statistical properties of natural language. In this study, we propose to use BLSTM-RNN with word embedding for part-of-speech (POS) tagging task. When tested on Penn Treebank WSJ test set, a state-of-the-art performance of 97.40 tagging accuracy is achieved. Without using morphological features, this approach can also achieve a good performance comparable with the Stanford POS tagger.

Proceedings ArticleDOI
01 Jul 2015
TL;DR: This work aims to compare the performance of two state-of-the-art word embedding methods, namely word2vec and GloVe on a basic task of reflecting semantic similarity and relatedness of biomedical concepts.
Abstract: Recently there is a surge in interest in learning vector representations of words using huge corpus in unsupervised manner. Such word vector representations, also known as word embedding, have been shown to improve the performance of machine learning models in several NLP tasks. However efficiency of such representation has not been systematically evaluated in biomedical domain. In this work our aim is to compare the performance of two state-of-the-art word embedding methods, namely word2vec and GloVe on a basic task of reflecting semantic similarity and relatedness of biomedical concepts. For this, vector representations of all unique words in the corpus of more than 1 million full-length research articles in biomedical domain are obtained from the two methods. These word vectors are evaluated for their ability to reflect semantic similarity and semantic relatedness of word-pairs in a benchmark data set of manually curated semantic similar and related words available at http:// rxinformatics.umn.edu. We observe that parameters of these models do affect their ability to capture lexicosemantic properties and word2vec with particular language modeling seems to perform better than others.

Proceedings Article
30 Jul 2015
TL;DR: A hybrid matrix factorisation model representing users and items as linear combinations of their content features' latent factors outperforms both collaborative and content-based models in cold-start or sparse interaction data scenarios, and performs at least as well as a pure collaborative matrix factorsisation model where interaction data is abundant.
Abstract: I present a hybrid matrix factorisation model representing users and items as linear combinations of their content features’ latent factors. The model outperforms both collaborative and content-based models in cold-start or sparse interaction data scenarios (using both user and item metadata), and performs at least as well as a pure collaborative matrix factorisation model where interaction data is abundant. Additionally, feature embeddings produced by the model encode semantic information in a way reminiscent of word embedding approaches, making them useful for a range of related tasks such as tag recommendations.

Posted Content
TL;DR: This work proposes to use BLSTM-RNN for a unified tagging solution that can be applied to various tagging tasks including part-of-speech tagging, chunking and named entity recognition, requiring no task specific knowledge or sophisticated feature engineering.
Abstract: Bidirectional Long Short-Term Memory Recurrent Neural Network (BLSTM-RNN) has been shown to be very effective for modeling and predicting sequential data, e.g. speech utterances or handwritten documents. In this study, we propose to use BLSTM-RNN for a unified tagging solution that can be applied to various tagging tasks including part-of-speech tagging, chunking and named entity recognition. Instead of exploiting specific features carefully optimized for each task, our solution only uses one set of task-independent features and internal representations learnt from unlabeled text for all tasks.Requiring no task specific knowledge or sophisticated feature engineering, our approach gets nearly state-of-the-art performance in all these three tagging tasks.

Proceedings Article
05 Nov 2015
TL;DR: The results from both 2010 i2b2 and 2014 Semantic Evaluation data showed that the binarized word embedding features outperformed other strategies for deriving distributed word representations and can be adapted to any other clinical natural language processing research.
Abstract: Clinical Named Entity Recognition (NER) is a critical task for extracting important patient information from clinical text to support clinical and translational research. This study explored the neural word embeddings derived from a large unlabeled clinical corpus for clinical NER. We systematically compared two neural word embedding algorithms and three different strategies for deriving distributed word representations. Two neural word embeddings were derived from the unlabeled Multiparameter Intelligent Monitoring in Intensive Care (MIMIC) II corpus (403,871 notes). The results from both 2010 i2b2 and 2014 Semantic Evaluation (SemEval) data showed that the binarized word embedding features outperformed other strategies for deriving distributed word representations. The binarized embedding features improved the F1-score of the Conditional Random Fields based clinical NER system by 2.3% on i2b2 data and 2.4% on SemEval data. The combined feature from the binarized embeddings and the Brown clusters improved the F1-score of the clinical NER system by 2.9% on i2b2 data and 2.7% on SemEval data. Our study also showed that the distributed word embedding features derived from a large unlabeled corpus can be better than the widely used Brown clusters. Further analysis found that the neural word embeddings captured a wide range of semantic relations, which could be discretized into distributed word representations to benefit the clinical NER system. The low-cost distributed feature representation can be adapted to any other clinical natural language processing research.

Proceedings ArticleDOI
01 Jul 2015
TL;DR: Evaluation using the clinical abbreviation datasets from both the Vanderbilt University and the University of Minnesota showed that neural word embedding features improved the performance of the SVMbasedclinical abbreviation disambiguation system.
Abstract: This study examined the use of neural word embeddings for clinical abbreviation disambiguation, a special case of word sense disambiguation (WSD). We investigated three different methods for deriving word embeddings from a large unlabeled clinical corpus: one existing method called Surrounding based embedding feature (SBE), and two newly developed methods: Left-Right surrounding based embedding feature (LR_SBE) and MAX surrounding based embedding feature (MAX_SBE). We then added these word embeddings as additional features to a Support Vector Machines (SVM) based WSD system. Evaluation using the clinical abbreviation datasets from both the Vanderbilt University and the University of Minnesota showed that neural word embedding features improved the performance of the SVMbased clinical abbreviation disambiguation system. More specifically, the new MAX_SBE method outperformed the other two methods and achieved the state-of-the-art performance on both clinical abbreviation datasets.

Journal ArticleDOI
TL;DR: This paper presents an oversampling method based on word embedding compositionality which produces meaningful balanced training data and achieves improved results for both sentiment and emotion classification.
Abstract: Text classification often faces the problem of imbalanced training data. This is true in sentiment analysis and particularly prominent in emotion classification where multiple emotion categories are very likely to produce naturally skewed training data. Different sampling methods have been proposed to improve classification performance by reducing the imbalance ratio between training classes. However, data sparseness and the small disjunct problem remain obstacles in generating new samples for minority classes when the data are skewed and limited. Methods to produce meaningful samples for smaller classes rather than simple duplication are essential in overcoming this problem. In this paper, we present an oversampling method based on word embedding compositionality which produces meaningful balanced training data. We first use a large corpus to train a continuous skip-gram model to form a word embedding model maintaining the syntactic and semantic integrity of the word features. Then, a compositional algorithm based on recursive neural tensor networks is used to construct sentence vectors based on the word embedding model. Finally, we use the SMOTE algorithm as an oversampling method to generate samples for the minority classes and produce a fully balanced training set. Evaluation results on two quite different tasks show that the feature composition method and the oversampling method are both important in obtaining improved classification results. Our method effectively addresses the data imbalance issue and consequently achieves improved results for both sentiment and emotion classification.

Proceedings Article
25 Jan 2015
TL;DR: This paper proposes a representation learning approach to automatically learn useful features for aspect category detection and achieves the state-of-the-art performance and outperforms the best participating team as well as a few strong baselines.
Abstract: User-generated reviews are valuable resources for decision making. Identifying the aspect categories discussed in a given review sentence (e.g., "food" and "service" in restaurant reviews) is an important task of sentiment analysis and opinion mining. Given a predefined aspect category set, most previous researches leverage handcrafted features and a classification algorithm to accomplish the task. The crucial step to achieve better performance is feature engineering which consumes much human effort and may be unstable when the product domain changes. In this paper, we propose a representation learning approach to automatically learn useful features for aspect category detection. Specifically, a semi-supervised word embedding algorithm is first proposed to obtain continuous word representations on a large set of reviews with noisy labels. Afterwards, we propose to generate deeper and hybrid features through neural networks stacked on the word vectors. A logistic regression classifier is finally trained with the hybrid features to predict the aspect category. The experiments are carried out on a benchmark dataset released by SemEval-2014. Our approach achieves the state-of-the-art performance and outperforms the best participating team as well as a few strong baselines.

Proceedings ArticleDOI
Arne Köhn1
01 Sep 2015
TL;DR: It is shown that all embedding approaches behave similarly in this task, with dependency-based embeddings performing best and this effect is even more pronounced when generating low dimensionalembeddings.
Abstract: In the last two years, there has been a surge of word embedding algorithms and research on them. However, evaluation has mostly been carried out on a narrow set of tasks, mainly word similarity/relatedness and word relation similarity and on a single language, namely English. We propose an approach to evaluate embeddings on a variety of languages that also yields insights into the structure of the embedding space by investigating how well word embeddings cluster along different syntactic features. We show that all embedding approaches behave similarly in this task, with dependency-based embeddings performing best. This effect is even more pronounced when generating low dimensional embeddings.