scispace - formally typeset
Open AccessJournal ArticleDOI

Weakly Supervised POS Tagging without Disambiguation

Reads0
Chats0
TLDR
The proposed framework for weakly supervised POS tagging based on a dictionary of words with their possible POS tags is evaluated, achieving accuracies of 93.21%, 90.9%, and 84.5% individually, which shows a significant improvement compared to the state-of-the-art approaches.
Abstract
Weakly supervised part-of-speech (POS) tagging is to learn to predict the POS tag for a given word in context by making use of partial annotated data instead of the fully tagged corpora. Weakly supervised POS tagging would benefit various natural language processing applications in such languages where tagged corpora are mostly unavailable.In this article, we propose a novel framework for weakly supervised POS tagging based on a dictionary of words with their possible POS tags. In the constrained error-correcting output codes (ECOC)-based approach, a unique L-bit vector is assigned to each POS tag. The set of bitvectors is referred to as a coding matrix with value l 1, -1r. Each column of the coding matrix specifies a dichotomy over the tag space to learn a binary classifier. For each binary classifier, its training data is generated in the following way: each pair of words and its possible POS tags are considered as a positive training example only if the whole set of its possible tags falls into the positive dichotomy specified by the column coding and similarly for negative training examples. Given a word in context, its POS tag is predicted by concatenating the predictive outputs of the L binary classifiers and choosing the tag with the closest distance according to some measure. By incorporating the ECOC strategy, the set of all possible tags for each word is treated as an entirety without the need of performing disambiguation. Moreover, instead of manual feature engineering employed in most previous POS tagging approaches, features for training and testing in the proposed framework are automatically generated using neural language modeling. The proposed framework has been evaluated on three corpora for English, Italian, and Malagasy POS tagging, achieving accuracies of 93.21%, 90.9%, and 84.5% individually, which shows a significant improvement compared to the state-of-the-art approaches.

read more

Content maybe subject to copyright    Report

warwick.ac.uk/lib-publications
Manuscript version: Author’s Accepted Manuscript
The version presented in WRAP is the author’s accepted manuscript and may differ from the
published version or Version of Record.
Persistent WRAP URL:
http://wrap.warwick.ac.uk/108974
How to cite:
Please refer to published version for the most recent bibliographic citation information.
If a published version is known of, the repository item page linked to above, will contain
details on accessing it.
Copyright and reuse:
The Warwick Research Archive Portal (WRAP) makes this work by researchers of the
University of Warwick available open access under the following conditions.
Copyright © and all moral rights to the version of the paper presented here belong to the
individual author(s) and/or other copyright owners. To the extent reasonable and
practicable the material made available in WRAP has been checked for eligibility before
being made available.
Copies of full items can be used for personal research or study, educational, or not-for-profit
purposes without prior permission or charge. Provided that the authors, title and full
bibliographic details are credited, a hyperlink and/or URL is given for the original metadata
page and the content is not changed in any way.
Publisher’s statement:
Please refer to the repository item page, publisher’s statement section, for further
information.
For more information, please contact the WRAP Team at: wrap@warwick.ac.uk.

A
Weakly Supervised Part-of-Speech (POS) Tagging without
Disambiguation
Deyu Zhou, Zhikai Zhang, Min-Ling Zhang, School of Computer Science and Engineering,
Southeast University, China
Yulan He, School of Engineering and Applied Science, Aston University, UK
Weakly supervised part-of-speech (POS) tagging aims to predict the POS tag for a given word in context
by making use of partially annotated data instead of fully tagged corpora. As POS tagging is crucial for
downstream natural language processing (NLP) tasks such as named entity recognition and information
extraction, weakly supervised POS tagging is specifically attractive in languages where tagged corpora are
mostly unavailable. In this paper, we propose a novel framework for weakly supervised POS tagging where
no annotate corpora are available and the only supervision information comes from a dictionary of words
where each of them is associated with a list of possible POS tags. Our approach is built upon error-correcting
output codes (ECOC) is which each POS tag is assigned with a unique L-bit binary vector. For a total of O
POS tags, we therefore have a coding matrix M of size O × L with value {1, 1}. Each column of the coding
matrix M specifies a dichotomy over the tag space to learn a binary classifier. For each binary classifier, its
training data is generated in the following way: a word will be considered as a positive training example only
if the whole set of its possible tags falls into the positive dichotomy specified by the column coding; and sim-
ilarly for negative training examples. Given a word, its POS tag is predicted by concatenating the predictive
outputs of the L binary classifiers and choosing the one with the closest distance according to some mea-
sure. By incorporating the ECOC strategy, the set of all possible tags for each word is treated as an entirety
without the need of performing disambiguation. Moreover, instead of manual feature engineering employed
in most previous POS tagging approaches, features for training and testing in the proposed framework are
automatically generated using neural language modeling. The proposed framework has been evaluated on
three corpora for English, Italian and Malagasy POS tagging, achieving accuracies of 93.21%, 90.9% and
84.5% respectively, which shows a significant improvement compared to the state-of-the-art approaches.
1. INTRODUCTION
Part-of-speech (POS) tagging is to assign a POS tag to a word in text based on its
context. It is crucial for downstream natural language processing (NLP) tasks such as
named entity recognition [Finkel et al. 2005], syntactic parsing [Cer et al. 2010] and
information extraction [Zhou et al. 2015]. Methods for POS tagging in general fall into
two categories: rule based and machine learning based. Rule based approaches rely on
manually designed rules while machine learning approaches require a large amount
of annotated data for training.
In low-resource languages such as Malagasy, annotated data are mostly unavailable.
It is thus attractive to explore weakly-supervised POS tagging approaches where the
supervision information comes from other sources rather than the annotated data. As
the ground-truth POS tag of a word in a sentence is not directly accessible, weakly-
supervised approaches are more difficult to train compared to supervised approaches.
One common way to address the problem of lack of annotated data is to make use of
a dictionary of words with each one associated with a set of possible POS tags. The
actual POS tag of a word in a sentence is considered as a latent variable which is iden-
tified via iterative refinement procedure. Thus, a typical setup for weakly-supervised
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted
without fee provided that copies are not made or distributed for profit or commercial advantage and that
copies bear this notice and the full citation on the first page. Copyrights for components of this work owned
by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or repub-
lish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request
permissions from permissions@acm.org.
c
YYYY ACM. 2375-4699/YYYY/01-ARTA $15.00
DOI: http://dx.doi.org/10.1145/0000000.0000000
ACM Trans. Asian Low-Resour. Lang. Inf. Process., Vol. V, No. N, Article A, Publication date: January YYYY.

A:2
Table I. An example of input and output of weakly supervised POS tagging. (PRP denotes personal pronoun, DT
for determiner, JJ for adjective, VB for verb base form, CD for cardinal number and so on)
Dictionary (Each word is associated with a list of possible POS tags)
you PRP; these DT; events NNS; took VBD; 35 CD; years NNS; ago IN RB; to IN JJ TO; place NN
VB VBP; recognize VB VBP; that DT IN NN RB VBP WDT; have JJ VBD VBN VBP;...
Input Output
You have to recognize that these
events took place 35 years .
You/PRP have/VBP to/TO recognize/VB that/IN these/DT
events/NNS took/VBD place/NN 35/CD years/NNS ago/IN ./.
POS tagging is that given a dictionary of words with their possible POS tags, we aim
to generate a correct POS tag sequence for any unannotated input sentence. This is
illustrated in Table I.
Previous weakly-supervised POS tagging approaches are largely based on expecta-
tion maximization (EM) parameters estimation using hidden Markov models (HMMs)
or conditional random fields (CRFs). For example, Merialdo [1994a] used maximum
likelihood estimation (MLE) to train a trigram HMM. Banko and Moore [2004] modi-
fied the basic HMM structure to incorporate the context on both sides of the word to
be tagged. Smith and Eisner [2005] proposed to train CRFs using contrastive estima-
tion for POS tagging. It can be observed that most of the aforementioned approaches
essentially perform disambiguation on a set of possible candidate POS tags for a word
in a sentence. Although disambiguation presents as an intuitive and reasonable strat-
egy to training weakly-supervised POS taggers, its effectiveness is largely affected by
the possible errors introduced in the previous training iterations. That is, false posi-
tive tag(s) identified in early iterations will be propagated to the next iteration which
makes it difficult for the model to identify the correct POS tag.
In this paper, we propose a novel framework for weakly supervised POS tagging
without the need of disambiguating among a set of possible POS tags, built upon error-
correcting output codes (ECOC) [Dietterich and Bakiri 1995], one of the multi-class
learning techniques. A unique L-bit vector is assigned to each POS tag. For a total of
O POS tags, a coding matrix M of size O × L can be constructed where each cell of
M has a value of {1, 1}. Each column of M specifies a dichotomy over the tag space
to learn a binary classifier. For example, given a set of POS tags {VB, DT, VBP, NN},
the column of M [-1,+1,-1,+1] separates the tag space into negative dichotomy {VB,
VBP} and positive dichotomy{DT, NN}. The key adaptation lies in how the binary
classifiers corresponding to the ECOC coding matrix M are built. For each column
of the binary coding matrix, a binary classifier is built based on training examples
derived from the dictionary of the words with their possible POS tags. Specifically, the
word will be regarded as a positive or negative training example only if all its possible
tags fall into the positive or negative dichotomy specified by the column coding. In
this way, the set of possible tags is treated as an entirety without resorting to any
disambiguation procedure. Moreover, the choice of features is a critical success factor
for POS tagging. Most of the state-of-the-art POS tagging systems extract features
based on the lexical context of the words to be tagged and their letter structures (e.g.,
presence of suffixes, capitalization and hyphenation). Obviously, such feature design
needs domain knowledge and expertise. In this paper, features employed for weakly
supervised POS tagging are generated based on neural language modelling without
manual processing. The proposed approach has been evaluated on three corpora for
English, Italian and Malagasy POS tagging, and shows a significant improvement in
accuracy compared to the state-of-the-art approaches.
The main contributions of the paper are summarized below:
We proposed a novel framework based on constrained ECOC for weakly supervised
POS tagging. In such way, the set of a word’s possible tags is treated as an entirety
ACM Trans. Asian Low-Resour. Lang. Inf. Process., Vol. V, No. N, Article A, Publication date: January YYYY.

A:3
without resorting to any disambiguation procedure. It thus avoids the problem of
iterative training based on disambiguation, which is commonly used for existing ap-
proaches to weakly supervised POS tagging.
We developed a POS tagging system without human intervention. Features employed
for POS tagging are generated automatically based on neural language modelling.
We evaluated the proposed framework on three corpora for English, Italian and
Malagasy POS tagging, and observed a significant improvement in accuracy com-
pared to the state-of-the-art approaches.
2. RELATED WORK
Supervised POS tagging has achieved very good results with per-token accuracies over
97% on the English Penn Treebank. However, there are more than 50 low-density
languages where both tagged corpora and language speakers are mostly unavail-
able [Christodoulopoulos et al. 2010]. Some of them are even dead. Therefore, POS
tagging without using any fully annotated corpora has attracted increasing interests.
Generally, based on whether to use supervised information and where the supervised
information comes from, there are three directions for handling the task: POS induc-
tion, where no prior knowledge is used; POS disambiguation, where a dictionary of
words and their possible tags is assumed to be available; and prototype-driven ap-
proaches where a small set of prototypes for each POS tag is provided instead of a
dictionary.
For fully unsupervised POS tagging or POS induction, many approaches casted the
identification of POS tags as a knowledge-free clustering problem. Brown et al. [1992]
proposed a n-gram model based on classes of words through optimizing the probability
of the corpus p(w
1
|c
1
)
Q
n
2
p(w
i
|c
i
)p(c
i
|c
i1
) using some greedy hierarchical clustering.
Following this way, Clark [2003] incorporated morphological information into cluster-
ing so that morphologically similar words are clustered together. Based on a standard
trigram HMM, Goldwarter and Griffiths [2007] proposed a fully Bayesian approach
which allowed the use of priors. A collapsed Gibbs sampler was used to inferring the
hidden POS tags. Johnson [2007] also experimented with variational Bayesian EM
apart from Gibbs sampling and his results showed that variational Bayesian con-
verges faster than Gibbs sampling for POS tagging. Using the structure of a stan-
dard HMM, Berg-Kirkpatrick et al. [2010] turned each component multinomial of the
HMM into a miniature logistic regression. By doing so, features can be easily added
to standard generative models for unsupervised learning, without requiring complex
new training methods. Different from the previous approaches, a graph clustering ap-
proach based on contextual similarity was proposed in [Biemann 2006] so that the
number of POS tags (clusters) could be induced automatically. Based on the theory of
prototypes, Abend et al. [2010] first clustered the most frequent words based on some
morphological representations. They then defined landmark clusters which served as
the cores of the induced POS categories and finally map the rest of the words to these
categories. Kairit et al. [2014] presented an approach for inducing POS classes by com-
bining morphological and distributional information in non-parametric Bayesian gen-
erative model based on distance-dependent Chinese restaurant process. As pointed
out in [Christodoulopoulos et al. 2010], due to a lack of standard and informative
evaluation techniques, it is difficult to compare the effectiveness of different clustering
methods.
For weakly-supervised POS tagging, many researchers focused on POS disambigua-
tion using tag dictionaries. Brill [1992] described a rule-based POS tagger, which cap-
tured the learned knowledge into a set of simple deterministic rules instead of a large
table of statistics. He later proposed an unsupervised learning algorithm for automat-
ically training a rule-based POS tagger [Brill 1995]. Considering POS tags as latent
ACM Trans. Asian Low-Resour. Lang. Inf. Process., Vol. V, No. N, Article A, Publication date: January YYYY.

A:4
variables, there have been quite a few approaches relying on EM parameters estima-
tion using HMMs or CRFs. For example, given a sentence W = [w
1
, w
2
, ..., w
n
} and
a sequence of tags T = {t
1
, t
2
, ..., t
n
} of the same length, a trigram model defined as
p(W, T ) =
Q
n
i=1
p(w
i
|t
i
)p(t
i
|t
i2
, t
i1
) was proposed in [Merialdo 1994b]. Following this
way, some improvements were achieved by modifying the statistical models or employ-
ing better parameter estimation techniques. For example, Banko and Moore [2004]
modified the basic HMM structure to incorporate the context on both sides of the
word to be tagged. Smith and Eisner [2005] used contrastive estimation on CRFs for
POS tagging. Toutanova et al. [2007] proposed a Bayesian model that extended latent
Dirichlet allocation (LDA) and incorporated the intuition that words’ distributions over
tags are sparse. Naseem et al. [2009] proposed multilingual learning by combining
cues from multiple languages in two ways: directly merging tag structures for a pair of
languages into a single sequence, and incorporating multilingual context using latent
variables. Markov Chain Monte Carlo sampling techniques were used for estimating
the parameters of hierarchical Bayesian models. Ravi and Knight [2009] proposed us-
ing Integer Programming (IP) to search the smallest bi-gram POS tag set and used this
set to constrain the training of EM. Their approach achieved an accuracy of 91.6% on
the 24k English Penn Treebank test set, but could not handle larger datasets. For solv-
ing the deficiency of IP, Ravi et al. [2010] proposed a two-stage greedy minimization
approach that run much faster while maintaining the performance of tagging. Yatbaz
and Yuret [2010] chose unambiguous substitutes for each occurrence of an ambiguous
word based on its context. Their approach achieved an accuracy of 92.25% using stan-
dard HMM model on the 24k test set. To further improve the performance, several
heuristics were used in [Garrette and Baldridge 2012], which achieved an accuracy
of 88.52% by using incomplete dictionary. Ravi et al. [2014] proposed a distributed
minimum label cover which could parallelize the algorithm while preserving approx-
imation guarantees. The approach achieved an accuracy of 91.4% on the 24k test set
and 88.15% using incomplete dictionary.
Instead of using tag dictionaries, a few canonical examples of each POS tag could
be used in prototype-driven learning [Haghighi and Klein 2006]. The provided proto-
type information could be propagated across a corpus using distributional similarity
features in a log linear generative model. In a similar vein, a closed-class lexicon spec-
ifying possible tags was used to learn a disambiguation model for disambiguating the
occurrences of words in context [Zhao and Marcus 2009].
Our work is similar to approaches to weakly-supervised learning using tag dictionar-
ies since we also assume the availability of such a dictionary consisting of words with
each associated with a list of possible POS tags. However, most previous approaches
try to disambiguate the word’s possible tags by identifying the ground-truth tag it-
eratively. This disambiguation is prone to be misled by the false positive tags within
possible tags set. In this paper, we propose a novel approach for weakly supervised
POS tagging. The set of possible tags is treated as an entirety without the need of
disambiguation. From the perspective of machine learning, our approach falls into the
partial label learning framework [Zhang 2014] in which each training instance is as-
sociated with a set of candidate labels, among which only one is correct. However,
our problem setting here is different. The only supervision information we have is a
POS tag dictionary which lists all possible POS tags for each word. The annotations
of training instances need to be generated based on the POS tag dictionary. Moreover,
the tag dictionary is equally applied to both the training and testing instances. Such
constrains are applied in the test data using constrained ECOC.
ACM Trans. Asian Low-Resour. Lang. Inf. Process., Vol. V, No. N, Article A, Publication date: January YYYY.

Citations
More filters
Proceedings ArticleDOI

Disambiguation Enabled Linear Discriminant Analysis for Partial Label Dimensionality Reduction

TL;DR: This paper investigates the effectiveness of DELIN in improving the generalization ability of state-of-the-art partial label learning algorithms by endowing the popular linear discriminant analysis techniques with the ability of dealing with partial label training examples.
Proceedings ArticleDOI

Partial Label Dimensionality Reduction via Confidence-Based Dependence Maximization

TL;DR: In this paper, a confidence-based dependence maximization (CENDA) method was proposed for dimensionality reduction in partial label learning, where the projection matrix admits closed-form solution by solving a tailored generalized eigenvalue problem, while the labeling confidences of candidate labels are updated by conducting kNN aggregation in the projected feature space.
Proceedings Article

Semi-Supervised Partial Label Learning via Confidence-Rated Margin Maximization

TL;DR: This paper investigates the problem of semi-supervised partial label learning, where unlabeled data is utilized to facilitate model induction along with partial label training examples, and shows that the predictive model and labeling confidence can be solved via alternating optimization which admits QP solutions in either alternating step.
Journal ArticleDOI

A data processing method based on sequence labeling and syntactic analysis for extracting new sentiment words from product reviews

TL;DR: A data processing method based on sequence labeling and syntactic analysis for extracting new sentiment words from product reviews is proposed which can obviously improve the accuracy and recall rate of sentiment analysis.
Journal ArticleDOI

Partial label learning based on label distributions and error-correcting output codes

TL;DR: A new PLL algorithm with prior information of the label distribution based on ECOC (PL-PIE) is proposed, which utilizes the ECOC framework to decompose the problem into multiple binary problems.
References
More filters
Journal ArticleDOI

LIBSVM: A library for support vector machines

TL;DR: Issues such as solving SVM optimization problems theoretical convergence multiclass classification probability estimates and parameter selection are discussed in detail.
ReportDOI

Building a large annotated corpus of English: the penn treebank

TL;DR: As a result of this grant, the researchers have now published on CDROM a corpus of over 4 million words of running text annotated with part-of- speech (POS) tags, which includes a fully hand-parsed version of the classic Brown corpus.
Journal Article

Natural Language Processing (Almost) from Scratch

TL;DR: A unified neural network architecture and learning algorithm that can be applied to various natural language processing tasks including part-of-speech tagging, chunking, named entity recognition, and semantic role labeling is proposed.
Journal ArticleDOI

Class-based n -gram models of natural language

TL;DR: This work addresses the problem of predicting a word from previous words in a sample of text and discusses n-gram models based on classes of words, finding that these models are able to extract classes that have the flavor of either syntactically based groupings or semanticallybased groupings, depending on the nature of the underlying statistics.
Journal ArticleDOI

Solving multiclass learning problems via error-correcting output codes

TL;DR: In this article, error-correcting output codes are employed as a distributed output representation to improve the performance of decision-tree algorithms for multiclass learning problems, such as C4.5 and CART.
Related Papers (5)
Frequently Asked Questions (12)
Q1. What contributions have the authors mentioned in the paper "Weakly supervised part-of-speech (pos) tagging without disambiguation" ?

This paper proposed a weakly supervised POS tagging approach for low-resource languages such as Malagasy, where the supervision information comes from other sources rather than the annotated data. 

In the future, the authors will investigate other ways to generate the coding matrix for possible performance improvement. 

To represent the context features of a target word, the authors concatenate the word embedding of the first left word, the target word and the first right word to form a 192-dimensional vector of [wi−1, wi, wi+1] and use it as the feature vector of the target word. 

Although disambiguation presents as an intuitive and reasonable strategy to training weakly-supervised POS taggers, its effectiveness is largely affected by the possible errors introduced in the previous training iterations. 

Brown et al. [1992] proposed a n-gram model based on classes of words through optimizing the probability of the corpus p(w1|c1) ∏n 2 p(wi|ci)p(ci|ci−1) using some greedy hierarchical clustering. 

One common way to address the problem of lack of annotated data is to make use of a dictionary of words with each one associated with a set of possible POS tags. 

Based on the theory of prototypes, Abend et al. [2010] first clustered the most frequent words based on some morphological representations. 

To represent the context features of a target word, the authors concatenate the word embedding of the first left word, the target word and first right word to form a 150-dimensional vector of [wi−1, wi, wi+1] and use it as the feature vector of the target word. 

For solving the deficiency of IP, Ravi et al. [2010] proposed a two-stage greedy minimization approach that run much faster while maintaining the performance of tagging. 

As pointed out in [Christodoulopoulos et al. 2010], due to a lack of standard and informative evaluation techniques, it is difficult to compare the effectiveness of different clustering methods. 

The authors observe that the accuracy on words with 2 possible tags is less than 90% but the accuracy on words with 3 possible tags is around 90%. 

Kairit et al. [2014] presented an approach for inducing POS classes by combining morphological and distributional information in non-parametric Bayesian generative model based on distance-dependent Chinese restaurant process.