Deep Learning for Chinese Word Segmentation and POS Tagging

Home
/
Papers
/
Deep Learning for Chinese Word Segmentation and POS Tagging

Proceedings Article•

Deep Learning for Chinese Word Segmentation and POS Tagging

Xiaoqing Zheng¹, Hanyang Chen, Tianyu Xu²•Institutions (2)

01 Oct 2013-pp 647-657

TL;DR: This study explores the feasibility of performing Chinese word segmentation and POS tagging by deep learning, and describes a perceptron-style algorithm for training the neural networks, as an alternative to maximum-likelihood method to speed up the training process and make the learning algorithm easier to be implemented.

read less

Abstract: This study explores the feasibility of performing Chinese word segmentation (CWS) and POS tagging by deep learning. We try to avoid task-specific feature engineering, and use deep layers of neural networks to discover relevant features to the tasks. We leverage large-scale unlabeled data to improve internal representation of Chinese characters, and use these improved representations to enhance supervised word segmentation and POS tagging models. Our networks achieved close to state-of-theart performance with minimal computational cost. We also describe a perceptron-style algorithm for training the neural networks, as an alternative to maximum-likelihood method, to speed up the training process and make the learning algorithm easier to be implemented.

...read moreread less

Citations

PDF

Open Access

More filters

Journal Article•DOI•

Recent Trends in Deep Learning Based Natural Language Processing [Review Article]

[...]

Tom Young¹, Devamanyu Hazarika², Soujanya Poria³, Erik Cambria³•Institutions (3)

Beijing Institute of Technology¹, National University of Singapore², Nanyang Technological University³

20 Jul 2018-IEEE Computational Intelligence Magazine

TL;DR: This paper reviews significant deep learning related models and methods that have been employed for numerous NLP tasks and provides a walk-through of their evolution.

...read moreread less

Abstract: Deep learning methods employ multiple processing layers to learn hierarchical representations of data, and have produced state-of-the-art results in many domains. Recently, a variety of model designs and methods have blossomed in the context of natural language processing (NLP). In this paper, we review significant deep learning related models and methods that have been employed for numerous NLP tasks and provide a walk-through of their evolution. We also summarize, compare and contrast the various models and put forward a detailed understanding of the past, present and future of deep learning in NLP.

...read moreread less

2,466 citations

Proceedings Article•

Deep Convolutional Neural Networks for Sentiment Analysis of Short Texts

[...]

Cicero Nogueira dos Santos¹, Maira Athanazio de Cerqueira Gatti¹•Institutions (1)

IBM¹

01 Aug 2014

TL;DR: A new deep convolutional neural network is proposed that exploits from characterto sentence-level information to perform sentiment analysis of short texts and achieves state-of-the-art results for single sentence sentiment prediction.

...read moreread less

Abstract: Sentiment analysis of short texts such as single sentences and Twitter messages is challenging because of the limited contextual information that they normally contain. Effectively solving this task requires strategies that combine the small text content with prior knowledge and use more than just bag-of-words. In this work we propose a new deep convolutional neural network that exploits from characterto sentence-level information to perform sentiment analysis of short texts. We apply our approach for two corpora of two different domains: the Stanford Sentiment Treebank (SSTb), which contains sentences from movie reviews; and the Stanford Twitter Sentiment corpus (STS), which contains Twitter messages. For the SSTb corpus, our approach achieves state-of-the-art results for single sentence sentiment prediction in both binary positive/negative classification, with 85.7% accuracy, and fine-grained classification, with 48.3% accuracy. For the STS corpus, our approach achieves a sentiment prediction accuracy of 86.4%.

...read moreread less

1,170 citations

Cites background from "Deep Learning for Chinese Word Segm..."

...Recent work has shown that large improvements in terms of model accuracy can be obtained by performing unsupervised pre-training of word embeddings (Collobert et al., 2011; Luong et al., 2013; Zheng et al., 2013; Socher et al., 2013a)....
[...]

Proceedings Article•DOI•

Learning Sentiment-Specific Word Embedding for Twitter Sentiment Classification

[...]

Duyu Tang¹, Furu Wei², Nan Yang², Ming Zhou³, Ting Liu¹, Bing Qin¹ - Show less +2 more•Institutions (3)

Harbin Institute of Technology¹, Microsoft², North China Electric Power University³

01 Jun 2014

TL;DR: Three neural networks are developed to effectively incorporate the supervision from sentiment polarity of text (e.g. sentences or tweets) in their loss functions and the performance of SSWE is improved by concatenating SSWE with existing feature set.

...read moreread less

Abstract: We present a method that learns word embedding for Twitter sentiment classification in this paper. Most existing algorithms for learning continuous word representations typically only model the syntactic context of words but ignore the sentiment of text. This is problematic for sentiment analysis as they usually map words with similar syntactic context but opposite sentiment polarity, such as good and bad, to neighboring word vectors. We address this issue by learning sentimentspecific word embedding (SSWE), which encodes sentiment information in the continuous representation of words. Specifically, we develop three neural networks to effectively incorporate the supervision from sentiment polarity of text (e.g. sentences or tweets) in their loss functions. To obtain large scale training corpora, we learn the sentiment-specific word embedding from massive distant-supervised tweets collected by positive and negative emoticons. Experiments on applying SSWE to a benchmark Twitter sentiment classification dataset in SemEval 2013 show that (1) the SSWE feature performs comparably with hand-crafted features in the top-performed system; (2) the performance is further improved by concatenating SSWE with existing feature set.

...read moreread less

1,157 citations

Cites background from "Deep Learning for Chinese Word Segm..."

...It is meaningful for some tasks such as pos-tagging (Zheng et al., 2013) as the two words have similar usages and grammatical roles, but it becomes a disaster for sentiment analysis as they have the opposite sentiment polarity....
[...]

Posted Content•

Recent Trends in Deep Learning Based Natural Language Processing

[...]

Tom Young¹, Devamanyu Hazarika², Soujanya Poria³, Erik Cambria³•Institutions (3)

Beijing Institute of Technology¹, National University of Singapore², Nanyang Technological University³

09 Aug 2017-arXiv: Computation and Language

TL;DR: Deep learning methods employ multiple processing layers to learn hierarchical representations of data and have produced state-of-the-art results in many domains as mentioned in this paper, such as natural language processing (NLP).

...read moreread less

Abstract: Deep learning methods employ multiple processing layers to learn hierarchical representations of data and have produced state-of-the-art results in many domains. Recently, a variety of model designs and methods have blossomed in the context of natural language processing (NLP). In this paper, we review significant deep learning related models and methods that have been employed for numerous NLP tasks and provide a walk-through of their evolution. We also summarize, compare and contrast the various models and put forward a detailed understanding of the past, present and future of deep learning in NLP.

...read moreread less

997 citations

Proceedings Article•

Learning Character-level Representations for Part-of-Speech Tagging

[...]

Cicero Nogueira dos Santos¹, Bianca Zadrozny¹•Institutions (1)

IBM¹

21 Jun 2014

TL;DR: A deep neural network is proposed that learns character-level representation of words and associate them with usual word representations to perform POS tagging and produces state-of-the-art POS taggers for two languages.

...read moreread less

Abstract: Distributed word representations have recently been proven to be an invaluable resource for NLP. These representations are normally learned using neural networks and capture syntactic and semantic information about words. Information about word morphology and shape is normally ignored when learning word representations. However, for tasks like part-of-speech tagging, intra-word information is extremely useful, specially when dealing with morphologically rich languages. In this paper, we propose a deep neural network that learns character-level representation of words and associate them with usual word representations to perform POS tagging. Using the proposed approach, while avoiding the use of any handcrafted feature, we produce state-of-the-art POS taggers for two languages: English, with 97.32% accuracy on the Penn Treebank WSJ corpus; and Portuguese, with 97.47% accuracy on the Mac-Morpho corpus, where the latter represents an error reduction of 12.2% on the best previous known result.

...read moreread less

627 citations

Cites background from "Deep Learning for Chinese Word Segm..."

...Recent work has showed that large improvements in terms of model accuracy can be obtained by performing unsupervised pre-training of word embeddings (Collobert et al., 2011; Luong et al., 2013; Zheng et al., 2013; Socher et al., 2013)....
[...]

1
2
3
4
…
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64

Collapse

References

PDF

Open Access

More filters

Proceedings Article•

Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data

[...]

John Lafferty¹, Andrew McCallum, Fernando Pereira•Institutions (1)

Carnegie Mellon University¹

28 Jun 2001

TL;DR: This work presents iterative parameter estimation algorithms for conditional random fields and compares the performance of the resulting models to HMMs and MEMMs on synthetic and natural-language data.

...read moreread less

Abstract: We present conditional random fields , a framework for building probabilistic models to segment and label sequence data. Conditional random fields offer several advantages over hidden Markov models and stochastic grammars for such tasks, including the ability to relax strong independence assumptions made in those models. Conditional random fields also avoid a fundamental limitation of maximum entropy Markov models (MEMMs) and other discriminative Markov models based on directed graphical models, which can be biased towards states with few successor states. We present iterative parameter estimation algorithms for conditional random fields and compare the performance of the resulting models to HMMs and MEMMs on synthetic and natural-language data.

...read moreread less

13,190 citations

"Deep Learning for Chinese Word Segm..." refers methods in this paper

...In fact, a CRF maximizes the same log-likelihood (Lafferty et al., 2001) by using a linear model in stead of a nonlinear neural network....
[...]

Probabilistic Models for Segmenting and Labeling Sequence Data

[...]

John Lafferty, Andrew McCallum, Fernando Pereira, Kevin Duh

01 Jan 2005

11,364 citations

Journal Article•

Natural Language Processing (Almost) from Scratch

[...]

Ronan Collobert, Jason Weston¹, Léon Bottou, Michael Karlen, Koray Kavukcuoglu², Pavel P. Kuksa³ - Show less +2 more•Institutions (3)

Google¹, New York University², Rutgers University³

01 Feb 2011-Journal of Machine Learning Research

TL;DR: A unified neural network architecture and learning algorithm that can be applied to various natural language processing tasks including part-of-speech tagging, chunking, named entity recognition, and semantic role labeling is proposed.

...read moreread less

Abstract: We propose a unified neural network architecture and learning algorithm that can be applied to various natural language processing tasks including part-of-speech tagging, chunking, named entity recognition, and semantic role labeling. This versatility is achieved by trying to avoid task-specific engineering and therefore disregarding a lot of prior knowledge. Instead of exploiting man-made input features carefully optimized for each task, our system learns internal representations on the basis of vast amounts of mostly unlabeled training data. This work is then used as a basis for building a freely available tagging system with good performance and minimal computational requirements.

...read moreread less

6,734 citations

"Deep Learning for Chinese Word Segm..." refers background or methods or result in this paper

...In order to make learning algorithms less dependent on the feature engineering, we chose to use a variant of the neural network architecture first proposed by (Bengio et al., 2003) for probabilistic language model, and reintroduced later by (Collobert et al., 2011) for multiple NLP tasks....
[...]
..., 2003) for probabilistic language model, and reintroduced later by (Collobert et al., 2011) for multiple NLP tasks....
[...]
...Taking the log, the conditional probability of the true path t is given by4: log p(t|c, θ) = s(c, t, θ)− log ∑ ∀t′ exp{s(c, t′, θ)} (10) 3We did not use the stochastic gradient ascent algorithm (Bottou, 1991) to train the network as (Collobert et al., 2011)....
[...]
...We did not use the stochastic gradient ascent algorithm (Bottou, 1991) to train the network as (Collobert et al., 2011)....
[...]
...Following (Bengio et al., 2003; Collobert et al., 2011), we want semantically and syntactically similar characters to be close in the embedding space....
[...]

Proceedings Article•DOI•

Discriminative training methods for hidden Markov models: theory and experiments with perceptron algorithms

[...]

Michael Collins¹•Institutions (1)

AT&T Labs¹

06 Jul 2002

TL;DR: Experimental results on part-of-speech tagging and base noun phrase chunking are given, in both cases showing improvements over results for a maximum-entropy tagger.

...read moreread less

Abstract: We describe new algorithms for training tagging models, as an alternative to maximum-entropy models or conditional random fields (CRFs). The algorithms rely on Viterbi decoding of training examples, combined with simple additive updates. We describe theory justifying the algorithms through a modification of the proof of convergence of the perceptron algorithm for classification problems. We give experimental results on part-of-speech tagging and base noun phrase chunking, in both cases showing improvements over results for a maximum-entropy tagger.

...read moreread less

2,221 citations

"Deep Learning for Chinese Word Segm..." refers background or methods in this paper

...As an alternative to maximum-likelihood method, we propose the following training algorithm inspired by the work of (Collins, 2002)....
[...]
...Intuitively it can be achieved by combining the theorems of convergence for the perceptron applied to tagging problem from (Collins, 2002) with the convergence results of backpropagation algorithm from (Rumelhart et al....
[...]
...Note that the perceptron algorithm of (Collins, 2002) was designed for discriminatively training an...
[...]

Proceedings Article•

Parsing Natural Scenes and Natural Language with Recursive Neural Networks

[...]

Richard Socher¹, Cliff Chiung-Yu Lin¹, Christopher D. Manning¹, Andrew Y. Ng¹•Institutions (1)

Stanford University¹

28 Jun 2011

TL;DR: A max-margin structure prediction architecture based on recursive neural networks that can successfully recover such structure both in complex scene images as well as sentences is introduced.

...read moreread less

Abstract: Recursive structure is commonly found in the inputs of different modalities such as natural scene images or natural language sentences Discovering this recursive structure helps us to not only identify the units that an image or sentence contains but also how they interact to form a whole We introduce a max-margin structure prediction architecture based on recursive neural networks that can successfully recover such structure both in complex scene images as well as sentences The same algorithm can be used both to provide a competitive syntactic parser for natural language sentences from the Penn Treebank and to outperform alternative approaches for semantic scene segmentation, annotation and classification For segmentation and annotation our algorithm obtains a new level of state-of-the-art performance on the Stanford background dataset (781%) The features from the image parse tree outperform Gist descriptors for scene classification by 4%

...read moreread less

1,409 citations

"Deep Learning for Chinese Word Segm..." refers background in this paper

...Several works have investigated how to use deep learning for NLP applications (Bengio et al., 2003; Collobert et al., 2011; Collobert, 2011; Socher et al., 2011)....
[...]