scispace - formally typeset
Open AccessJournal ArticleDOI

Controlling complexity in part-of-speech induction

Reads0
Chats0
TLDR
This work refining the model and modifying the learning objective to control its capacity via para- metric and non-parametric constraints, and developing an efficient learning algorithm that is not much more computationally intensive than standard training.
Abstract
We consider the problem of fully unsupervised learning of grammatical (part-of-speech) categories from unlabeled text. The standard maximum-likelihood hidden Markov model for this task performs poorly, because of its weak inductive bias and large model capacity. We address this problem by refining the model and modifying the learning objective to control its capacity via parametric and non-parametric constraints. Our approach enforces word-category association sparsity, adds morphological and orthographic features, and eliminates hard-to-estimate parameters for rare words. We develop an efficient learning algorithm that is not much more computationally intensive than standard training. We also provide an open-source implementation of the algorithm. Our experiments on five diverse languages (Bulgarian, Danish, English, Portuguese, Spanish) achieve significant improvements compared with previous methods for the same task.

read more

Content maybe subject to copyright    Report

Citations
More filters
Journal ArticleDOI

Urdu language processing: a survey

TL;DR: The goal of this paper is to organize the ULP work in a way that it can provide a platform for ULP research activities in future and to describe in detail the recent increase in interest and progress made in Urdu language processing research.
Proceedings Article

Wiki-ly Supervised Part-of-Speech Tagging

TL;DR: This paper shows that it is possible to build POS-taggers exceeding state-of-the-art bilingual methods by using simple hidden Markov models and a freely available and naturally growing resource, the Wiktionary.
Proceedings Article

The PASCAL Challenge on Grammar Induction

TL;DR: This is the first challenge to evaluate unsupervised induction systems, a sub-field of syntax which is rapidly becoming very popular, and made use of a 10 different treebanks annotated in a range of different linguistic formalisms and covering 9 languages.
Journal ArticleDOI

A Robust Transformation-Based Learning Approach Using Ripple Down Rules for Part-of-Speech Tagging

TL;DR: This paper proposed an incremental knowledge acquisition method where rules are stored in an exception structure and new rules are only added to correct the errors of existing rules, thus allowing systematic control of the interaction between the rules.
Journal ArticleDOI

Urdu part of speech tagging using conditional random fields

TL;DR: This work is the first instance of a CRF approach for Urdu POS tagging using linear-chain conditional random fields and employs a strong, stable and balanced language-independent as well as language dependent feature set.
References
More filters
Book

Numerical Optimization

TL;DR: Numerical Optimization presents a comprehensive and up-to-date description of the most effective methods in continuous optimization, responding to the growing interest in optimization in engineering, science, and business by focusing on the methods that are best suited to practical problems.
Book

Nonlinear Programming

ReportDOI

Building a large annotated corpus of English: the penn treebank

TL;DR: As a result of this grant, the researchers have now published on CDROM a corpus of over 4 million words of running text annotated with part-of- speech (POS) tags, which includes a fully hand-parsed version of the classic Brown corpus.
Related Papers (5)