Controlling complexity in part-of-speech induction

doi:10.5555/2051237.2051253

Open AccessJournal ArticleDOI

Controlling complexity in part-of-speech induction

João Graça, +4 more

- 01 May 2011 -

Journal of Artificial Intelligence Resea...

- Vol. 41, Iss: 2, pp 527-551

Chats0

TLDR

This work refining the model and modifying the learning objective to control its capacity via para- metric and non-parametric constraints, and developing an efficient learning algorithm that is not much more computationally intensive than standard training.

Abstract:

We consider the problem of fully unsupervised learning of grammatical (part-of-speech) categories from unlabeled text. The standard maximum-likelihood hidden Markov model for this task performs poorly, because of its weak inductive bias and large model capacity. We address this problem by refining the model and modifying the learning objective to control its capacity via parametric and non-parametric constraints. Our approach enforces word-category association sparsity, adds morphological and orthographic features, and eliminates hard-to-estimate parameters for rare words. We develop an efficient learning algorithm that is not much more computationally intensive than standard training. We also provide an open-source implementation of the algorithm. Our experiments on five diverse languages (Bulgarian, Danish, English, Portuguese, Spanish) achieve significant improvements compared with previous methods for the same task.

Citations

PDF

Open Access

More filters

Journal ArticleDOI

Urdu language processing: a survey

Ali Daud, +2 more

- 01 Mar 2017 -

Artificial Intelligence Review

TL;DR: The goal of this paper is to organize the ULP work in a way that it can provide a platform for ULP research activities in future and to describe in detail the recent increase in interest and progress made in Urdu language processing research.

...read moreread less

Proceedings Article

Wiki-ly Supervised Part-of-Speech Tagging

Shen Li, +2 more

TL;DR: This paper shows that it is possible to build POS-taggers exceeding state-of-the-art bilingual methods by using simple hidden Markov models and a freely available and naturally growing resource, the Wiktionary.

...read moreread less

Proceedings Article

The PASCAL Challenge on Grammar Induction

Douwe Gelling, +3 more

TL;DR: This is the first challenge to evaluate unsupervised induction systems, a sub-field of syntax which is rapidly becoming very popular, and made use of a 10 different treebanks annotated in a range of different linguistic formalisms and covering 9 languages.

...read moreread less

Journal ArticleDOI

A Robust Transformation-Based Learning Approach Using Ripple Down Rules for Part-of-Speech Tagging

Dat Quoc Nguyen, +3 more

- 12 Dec 2014 -

arXiv: Computation and Language

TL;DR: This paper proposed an incremental knowledge acquisition method where rules are stored in an exception structure and new rules are only added to correct the errors of existing rules, thus allowing systematic control of the interaction between the rules.

...read moreread less

Journal ArticleDOI

Urdu part of speech tagging using conditional random fields

Wahab Khan, +6 more

TL;DR: This work is the first instance of a CRF approach for Urdu POS tagging using linear-chain conditional random fields and employs a strong, stable and balanced language-independent as well as language dependent feature set.

...read moreread less

References

PDF

Open Access

More filters

Journal ArticleDOI

Maximum likelihood from incomplete data via the EM algorithm

Arthur P. Dempster, +2 more

- 01 Sep 1977 -

Journal of the royal statistical society...

Book

Numerical Optimization

Jorge Nocedal, +1 more

TL;DR: Numerical Optimization presents a comprehensive and up-to-date description of the most effective methods in continuous optimization, responding to the growing interest in optimization in engineering, science, and business by focusing on the methods that are best suited to practical problems.

...read moreread less

Book

Nonlinear Programming

Dimitri P. Bertsekas

ReportDOI

Building a large annotated corpus of English: the penn treebank

Mitchell Marcus, +2 more

- 01 Jun 1993 -

Computational Linguistics

TL;DR: As a result of this grant, the researchers have now published on CDROM a corpus of over 4 million words of running text annotated with part-of- speech (POS) tags, which includes a fully hand-parsed version of the classic Brown corpus.

...read moreread less

Journal ArticleDOI

A Maximization Technique Occurring in the Statistical Analysis of Probabilistic Functions of Markov Chains

Leonard E. Baum, +3 more

- 01 Feb 1970 -

Annals of Mathematical Statistics

Collapse

Controlling complexity in part-of-speech induction

Citations

Urdu language processing: a survey

Wiki-ly Supervised Part-of-Speech Tagging

The PASCAL Challenge on Grammar Induction

A Robust Transformation-Based Learning Approach Using Ripple Down Rules for Part-of-Speech Tagging

Urdu part of speech tagging using conditional random fields

References

Maximum likelihood from incomplete data via the EM algorithm

Numerical Optimization

Nonlinear Programming

Building a large annotated corpus of English: the penn treebank

A Maximization Technique Occurring in the Statistical Analysis of Probabilistic Functions of Markov Chains

Related Papers (5)

Class-based n -gram models of natural language

Building a large annotated corpus of English: the penn treebank

CoNLL-X Shared Task on Multilingual Dependency Parsing

Prototype-Driven Learning for Sequence Models

A fully Bayesian approach to unsupervised part-of-speech tagging