What contributions have the authors mentioned in the paper "Weakly supervised part-of-speech (pos) tagging without disambiguation" ?

This paper proposed a weakly supervised POS tagging approach for low-resource languages such as Malagasy, where the supervision information comes from other sources rather than the annotated data.

What are the future works in "Weakly supervised part-of-speech (pos) tagging without disambiguation" ?

In the future, the authors will investigate other ways to generate the coding matrix for possible performance improvement.

What is the way to represent the context features of a target word?

To represent the context features of a target word, the authors concatenate the word embedding of the first left word, the target word and the first right word to form a 192-dimensional vector of [wi−1, wi, wi+1] and use it as the feature vector of the target word.

What was the first method of clustering?

Based on the theory of prototypes, Abend et al. [2010] first clustered the most frequent words based on some morphological representations.

What is the simplest way to represent the context features of a word?

To represent the context features of a target word, the authors concatenate the word embedding of the first left word, the target word and first right word to form a 150-dimensional vector of [wi−1, wi, wi+1] and use it as the feature vector of the target word.

How is the accuracy of POS tagging on words with 3 possible tags?

The authors observe that the accuracy on words with 2 possible tags is less than 90% but the accuracy on words with 3 possible tags is around 90%.

(Open Access) Weakly Supervised POS Tagging without Disambiguation (2018) | Deyu Zhou

Q: What is the way to solve the deficiency of IP?

For solving the deficiency of IP, Ravi et al. [2010] proposed a two-stage greedy minimization approach that run much faster while maintaining the performance of tagging.

warwick.ac.uk/lib-publications

Manuscript version: Author’s Accepted Manuscript

The version presented in WRAP is the author’s accepted manuscript and may differ from the

published version or Version of Record.

Persistent WRAP URL:

http://wrap.warwick.ac.uk/108974

How to cite:

Please refer to published version for the most recent bibliographic citation information.

If a published version is known of, the repository item page linked to above, will contain

details on accessing it.

The Warwick Research Archive Portal (WRAP) makes this work by researchers of the

University of Warwick available open access under the following conditions.

individual author(s) and/or other copyright owners. To the extent reasonable and

practicable the material made available in WRAP has been checked for eligibility before

being made available.

Copies of full items can be used for personal research or study, educational, or not-for-profit

purposes without prior permission or charge. Provided that the authors, title and full

bibliographic details are credited, a hyperlink and/or URL is given for the original metadata

page and the content is not changed in any way.

Publisher’s statement:

Please refer to the repository item page, publisher’s statement section, for further

information.

For more information, please contact the WRAP Team at: wrap@warwick.ac.uk.

Weakly Supervised Part-of-Speech (POS) Tagging without

Disambiguation

Deyu Zhou, Zhikai Zhang, Min-Ling Zhang, School of Computer Science and Engineering,

Southeast University, China

Yulan He, School of Engineering and Applied Science, Aston University, UK

Weakly supervised part-of-speech (POS) tagging aims to predict the POS tag for a given word in context

by making use of partially annotated data instead of fully tagged corpora. As POS tagging is crucial for

downstream natural language processing (NLP) tasks such as named entity recognition and information

extraction, weakly supervised POS tagging is speciﬁcally attractive in languages where tagged corpora are

mostly unavailable. In this paper, we propose a novel framework for weakly supervised POS tagging where

no annotate corpora are available and the only supervision information comes from a dictionary of words

where each of them is associated with a list of possible POS tags. Our approach is built upon error-correcting

output codes (ECOC) is which each POS tag is assigned with a unique L-bit binary vector. For a total of O

POS tags, we therefore have a coding matrix M of size O × L with value {1, −1}. Each column of the coding

matrix M speciﬁes a dichotomy over the tag space to learn a binary classiﬁer. For each binary classiﬁer, its

training data is generated in the following way: a word will be considered as a positive training example only

if the whole set of its possible tags falls into the positive dichotomy speciﬁed by the column coding; and sim-

ilarly for negative training examples. Given a word, its POS tag is predicted by concatenating the predictive

outputs of the L binary classiﬁers and choosing the one with the closest distance according to some mea-

sure. By incorporating the ECOC strategy, the set of all possible tags for each word is treated as an entirety

without the need of performing disambiguation. Moreover, instead of manual feature engineering employed

in most previous POS tagging approaches, features for training and testing in the proposed framework are

automatically generated using neural language modeling. The proposed framework has been evaluated on

three corpora for English, Italian and Malagasy POS tagging, achieving accuracies of 93.21%, 90.9% and

84.5% respectively, which shows a signiﬁcant improvement compared to the state-of-the-art approaches.

1. INTRODUCTION

Part-of-speech (POS) tagging is to assign a POS tag to a word in text based on its

context. It is crucial for downstream natural language processing (NLP) tasks such as

named entity recognition [Finkel et al. 2005], syntactic parsing [Cer et al. 2010] and

information extraction [Zhou et al. 2015]. Methods for POS tagging in general fall into

two categories: rule based and machine learning based. Rule based approaches rely on

manually designed rules while machine learning approaches require a large amount

of annotated data for training.

In low-resource languages such as Malagasy, annotated data are mostly unavailable.

It is thus attractive to explore weakly-supervised POS tagging approaches where the

supervision information comes from other sources rather than the annotated data. As

the ground-truth POS tag of a word in a sentence is not directly accessible, weakly-

supervised approaches are more difﬁcult to train compared to supervised approaches.

One common way to address the problem of lack of annotated data is to make use of

a dictionary of words with each one associated with a set of possible POS tags. The

actual POS tag of a word in a sentence is considered as a latent variable which is iden-

tiﬁed via iterative reﬁnement procedure. Thus, a typical setup for weakly-supervised

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted

without fee provided that copies are not made or distributed for proﬁt or commercial advantage and that

copies bear this notice and the full citation on the ﬁrst page. Copyrights for components of this work owned

by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or repub-

lish, to post on servers or to redistribute to lists, requires prior speciﬁc permission and/or a fee. Request

permissions from permissions@acm.org.

 YYYY ACM. 2375-4699/YYYY/01-ARTA $15.00

DOI: http://dx.doi.org/10.1145/0000000.0000000

ACM Trans. Asian Low-Resour. Lang. Inf. Process., Vol. V, No. N, Article A, Publication date: January YYYY.

A:2

Table I. An example of input and output of weakly supervised POS tagging. (PRP denotes personal pronoun, DT

for determiner, JJ for adjective, VB for verb base form, CD for cardinal number and so on)

Dictionary (Each word is associated with a list of possible POS tags)

you PRP; these DT; events NNS; took VBD; 35 CD; years NNS; ago IN RB; to IN JJ TO; place NN

VB VBP; recognize VB VBP; that DT IN NN RB VBP WDT; have JJ VBD VBN VBP;...

Input Output

You have to recognize that these

events took place 35 years .

You/PRP have/VBP to/TO recognize/VB that/IN these/DT

events/NNS took/VBD place/NN 35/CD years/NNS ago/IN ./.

POS tagging is that given a dictionary of words with their possible POS tags, we aim

to generate a correct POS tag sequence for any unannotated input sentence. This is

illustrated in Table I.

Previous weakly-supervised POS tagging approaches are largely based on expecta-

tion maximization (EM) parameters estimation using hidden Markov models (HMMs)

or conditional random ﬁelds (CRFs). For example, Merialdo [1994a] used maximum

likelihood estimation (MLE) to train a trigram HMM. Banko and Moore [2004] modi-

ﬁed the basic HMM structure to incorporate the context on both sides of the word to

be tagged. Smith and Eisner [2005] proposed to train CRFs using contrastive estima-

tion for POS tagging. It can be observed that most of the aforementioned approaches

essentially perform disambiguation on a set of possible candidate POS tags for a word

in a sentence. Although disambiguation presents as an intuitive and reasonable strat-

egy to training weakly-supervised POS taggers, its effectiveness is largely affected by

the possible errors introduced in the previous training iterations. That is, false posi-

tive tag(s) identiﬁed in early iterations will be propagated to the next iteration which

makes it difﬁcult for the model to identify the correct POS tag.

In this paper, we propose a novel framework for weakly supervised POS tagging

without the need of disambiguating among a set of possible POS tags, built upon error-

correcting output codes (ECOC) [Dietterich and Bakiri 1995], one of the multi-class

learning techniques. A unique L-bit vector is assigned to each POS tag. For a total of

O POS tags, a coding matrix M of size O × L can be constructed where each cell of

M has a value of {1, −1}. Each column of M speciﬁes a dichotomy over the tag space

to learn a binary classiﬁer. For example, given a set of POS tags {VB, DT, VBP, NN},

the column of M [-1,+1,-1,+1] separates the tag space into negative dichotomy {VB,

VBP} and positive dichotomy{DT, NN}. The key adaptation lies in how the binary

classiﬁers corresponding to the ECOC coding matrix M are built. For each column

of the binary coding matrix, a binary classiﬁer is built based on training examples

derived from the dictionary of the words with their possible POS tags. Speciﬁcally, the

word will be regarded as a positive or negative training example only if all its possible

tags fall into the positive or negative dichotomy speciﬁed by the column coding. In

this way, the set of possible tags is treated as an entirety without resorting to any

disambiguation procedure. Moreover, the choice of features is a critical success factor

for POS tagging. Most of the state-of-the-art POS tagging systems extract features

based on the lexical context of the words to be tagged and their letter structures (e.g.,

presence of sufﬁxes, capitalization and hyphenation). Obviously, such feature design

needs domain knowledge and expertise. In this paper, features employed for weakly

supervised POS tagging are generated based on neural language modelling without

manual processing. The proposed approach has been evaluated on three corpora for

English, Italian and Malagasy POS tagging, and shows a signiﬁcant improvement in

accuracy compared to the state-of-the-art approaches.

The main contributions of the paper are summarized below:

— We proposed a novel framework based on constrained ECOC for weakly supervised

POS tagging. In such way, the set of a word’s possible tags is treated as an entirety

ACM Trans. Asian Low-Resour. Lang. Inf. Process., Vol. V, No. N, Article A, Publication date: January YYYY.

A:3

without resorting to any disambiguation procedure. It thus avoids the problem of

iterative training based on disambiguation, which is commonly used for existing ap-

proaches to weakly supervised POS tagging.

— We developed a POS tagging system without human intervention. Features employed

for POS tagging are generated automatically based on neural language modelling.

— We evaluated the proposed framework on three corpora for English, Italian and

Malagasy POS tagging, and observed a signiﬁcant improvement in accuracy com-

pared to the state-of-the-art approaches.

2. RELATED WORK

Supervised POS tagging has achieved very good results with per-token accuracies over

97% on the English Penn Treebank. However, there are more than 50 low-density

languages where both tagged corpora and language speakers are mostly unavail-

able [Christodoulopoulos et al. 2010]. Some of them are even dead. Therefore, POS

tagging without using any fully annotated corpora has attracted increasing interests.

Generally, based on whether to use supervised information and where the supervised

information comes from, there are three directions for handling the task: POS induc-

tion, where no prior knowledge is used; POS disambiguation, where a dictionary of

words and their possible tags is assumed to be available; and prototype-driven ap-

proaches where a small set of prototypes for each POS tag is provided instead of a

dictionary.

For fully unsupervised POS tagging or POS induction, many approaches casted the

identiﬁcation of POS tags as a knowledge-free clustering problem. Brown et al. [1992]

proposed a n-gram model based on classes of words through optimizing the probability

of the corpus p(w

)

p(w

)p(c

i−1

) using some greedy hierarchical clustering.

Following this way, Clark [2003] incorporated morphological information into cluster-

ing so that morphologically similar words are clustered together. Based on a standard

trigram HMM, Goldwarter and Grifﬁths [2007] proposed a fully Bayesian approach

which allowed the use of priors. A collapsed Gibbs sampler was used to inferring the

hidden POS tags. Johnson [2007] also experimented with variational Bayesian EM

apart from Gibbs sampling and his results showed that variational Bayesian con-

verges faster than Gibbs sampling for POS tagging. Using the structure of a stan-

dard HMM, Berg-Kirkpatrick et al. [2010] turned each component multinomial of the

HMM into a miniature logistic regression. By doing so, features can be easily added

to standard generative models for unsupervised learning, without requiring complex

new training methods. Different from the previous approaches, a graph clustering ap-

proach based on contextual similarity was proposed in [Biemann 2006] so that the

number of POS tags (clusters) could be induced automatically. Based on the theory of

prototypes, Abend et al. [2010] ﬁrst clustered the most frequent words based on some

morphological representations. They then deﬁned landmark clusters which served as

the cores of the induced POS categories and ﬁnally map the rest of the words to these

categories. Kairit et al. [2014] presented an approach for inducing POS classes by com-

bining morphological and distributional information in non-parametric Bayesian gen-

erative model based on distance-dependent Chinese restaurant process. As pointed

out in [Christodoulopoulos et al. 2010], due to a lack of standard and informative

evaluation techniques, it is difﬁcult to compare the effectiveness of different clustering

methods.

For weakly-supervised POS tagging, many researchers focused on POS disambigua-

tion using tag dictionaries. Brill [1992] described a rule-based POS tagger, which cap-

tured the learned knowledge into a set of simple deterministic rules instead of a large

table of statistics. He later proposed an unsupervised learning algorithm for automat-

ically training a rule-based POS tagger [Brill 1995]. Considering POS tags as latent

ACM Trans. Asian Low-Resour. Lang. Inf. Process., Vol. V, No. N, Article A, Publication date: January YYYY.

A:4

variables, there have been quite a few approaches relying on EM parameters estima-

tion using HMMs or CRFs. For example, given a sentence W = [w

, w

, ..., w

} and

a sequence of tags T = {t

, t

, ..., t

} of the same length, a trigram model deﬁned as

p(W, T ) =

i=1

p(w

)p(t

i−2

, t

i−1

) was proposed in [Merialdo 1994b]. Following this

way, some improvements were achieved by modifying the statistical models or employ-

ing better parameter estimation techniques. For example, Banko and Moore [2004]

modiﬁed the basic HMM structure to incorporate the context on both sides of the

word to be tagged. Smith and Eisner [2005] used contrastive estimation on CRFs for

POS tagging. Toutanova et al. [2007] proposed a Bayesian model that extended latent

Dirichlet allocation (LDA) and incorporated the intuition that words’ distributions over

tags are sparse. Naseem et al. [2009] proposed multilingual learning by combining

cues from multiple languages in two ways: directly merging tag structures for a pair of

languages into a single sequence, and incorporating multilingual context using latent

variables. Markov Chain Monte Carlo sampling techniques were used for estimating

the parameters of hierarchical Bayesian models. Ravi and Knight [2009] proposed us-

ing Integer Programming (IP) to search the smallest bi-gram POS tag set and used this

set to constrain the training of EM. Their approach achieved an accuracy of 91.6% on

the 24k English Penn Treebank test set, but could not handle larger datasets. For solv-

ing the deﬁciency of IP, Ravi et al. [2010] proposed a two-stage greedy minimization

approach that run much faster while maintaining the performance of tagging. Yatbaz

and Yuret [2010] chose unambiguous substitutes for each occurrence of an ambiguous

word based on its context. Their approach achieved an accuracy of 92.25% using stan-

dard HMM model on the 24k test set. To further improve the performance, several

heuristics were used in [Garrette and Baldridge 2012], which achieved an accuracy

of 88.52% by using incomplete dictionary. Ravi et al. [2014] proposed a distributed

minimum label cover which could parallelize the algorithm while preserving approx-

imation guarantees. The approach achieved an accuracy of 91.4% on the 24k test set

and 88.15% using incomplete dictionary.

Instead of using tag dictionaries, a few canonical examples of each POS tag could

be used in prototype-driven learning [Haghighi and Klein 2006]. The provided proto-

type information could be propagated across a corpus using distributional similarity

features in a log linear generative model. In a similar vein, a closed-class lexicon spec-

ifying possible tags was used to learn a disambiguation model for disambiguating the

occurrences of words in context [Zhao and Marcus 2009].

Our work is similar to approaches to weakly-supervised learning using tag dictionar-

ies since we also assume the availability of such a dictionary consisting of words with

each associated with a list of possible POS tags. However, most previous approaches

try to disambiguate the word’s possible tags by identifying the ground-truth tag it-

eratively. This disambiguation is prone to be misled by the false positive tags within

possible tags set. In this paper, we propose a novel approach for weakly supervised

POS tagging. The set of possible tags is treated as an entirety without the need of

disambiguation. From the perspective of machine learning, our approach falls into the

partial label learning framework [Zhang 2014] in which each training instance is as-

sociated with a set of candidate labels, among which only one is correct. However,

our problem setting here is different. The only supervision information we have is a

POS tag dictionary which lists all possible POS tags for each word. The annotations

of training instances need to be generated based on the POS tag dictionary. Moreover,

the tag dictionary is equally applied to both the training and testing instances. Such

constrains are applied in the test data using constrained ECOC.

ACM Trans. Asian Low-Resour. Lang. Inf. Process., Vol. V, No. N, Article A, Publication date: January YYYY.

Weakly Supervised POS Tagging without Disambiguation

Figures

Citations

Disambiguation Enabled Linear Discriminant Analysis for Partial Label Dimensionality Reduction

Partial Label Dimensionality Reduction via Confidence-Based Dependence Maximization

Semi-Supervised Partial Label Learning via Confidence-Rated Margin Maximization

A data processing method based on sequence labeling and syntactic analysis for extracting new sentiment words from product reviews

Partial label learning based on label distributions and error-correcting output codes

References

LIBSVM: A library for support vector machines

Building a large annotated corpus of English: the penn treebank

Natural Language Processing (Almost) from Scratch

Class-based n -gram models of natural language

Solving multiclass learning problems via error-correcting output codes

Related Papers (5)

Voted Approach for Part of Speech Tagging in Bengali

Joint POS tagging and text normalization for informal text

Unsupervised Part-of-Speech Tagging in the Large

Corpus based part-of-speech tagging

POS tagging in Amazighe using support vector machines and conditional random fields

Frequently Asked Questions (12)

Q1. What contributions have the authors mentioned in the paper "Weakly supervised part-of-speech (pos) tagging without disambiguation" ?

Q2. What are the future works in "Weakly supervised part-of-speech (pos) tagging without disambiguation" ?

Q3. What is the way to represent the context features of a target word?

Q4. What is the way to train weakly-supervised POS taggers?

Q5. How did Brown et al. develop a n-gram model?

Q6. What is the common way to address the problem of lack of annotated data?

Q7. What was the first method of clustering?

Q8. What is the simplest way to represent the context features of a word?

Q9. What is the way to solve the deficiency of IP?

Q10. Why is it difficult to compare the effectiveness of different clustering methods?

Q11. How is the accuracy of POS tagging on words with 3 possible tags?

Q12. How did Kairit and others develop their approach for generating POS classes?