Controlling complexity in part-of-speech induction
Reads0
Chats0
TLDR
This work refining the model and modifying the learning objective to control its capacity via para- metric and non-parametric constraints, and developing an efficient learning algorithm that is not much more computationally intensive than standard training.Abstract:
We consider the problem of fully unsupervised learning of grammatical (part-of-speech) categories from unlabeled text. The standard maximum-likelihood hidden Markov model for this task performs poorly, because of its weak inductive bias and large model capacity. We address this problem by refining the model and modifying the learning objective to control its capacity via parametric and non-parametric constraints. Our approach enforces word-category association sparsity, adds morphological and orthographic features, and eliminates hard-to-estimate parameters for rare words. We develop an efficient learning algorithm that is not much more computationally intensive than standard training. We also provide an open-source implementation of the algorithm. Our experiments on five diverse languages (Bulgarian, Danish, English, Portuguese, Spanish) achieve significant improvements compared with previous methods for the same task.read more
Citations
More filters
Journal ArticleDOI
Urdu language processing: a survey
Ali Daud,Wahab Khan,Dunren Che +2 more
TL;DR: The goal of this paper is to organize the ULP work in a way that it can provide a platform for ULP research activities in future and to describe in detail the recent increase in interest and progress made in Urdu language processing research.
Proceedings Article
Wiki-ly Supervised Part-of-Speech Tagging
Shen Li,João Graça,Ben Taskar +2 more
TL;DR: This paper shows that it is possible to build POS-taggers exceeding state-of-the-art bilingual methods by using simple hidden Markov models and a freely available and naturally growing resource, the Wiktionary.
Proceedings Article
The PASCAL Challenge on Grammar Induction
TL;DR: This is the first challenge to evaluate unsupervised induction systems, a sub-field of syntax which is rapidly becoming very popular, and made use of a 10 different treebanks annotated in a range of different linguistic formalisms and covering 9 languages.
Journal ArticleDOI
A Robust Transformation-Based Learning Approach Using Ripple Down Rules for Part-of-Speech Tagging
TL;DR: This paper proposed an incremental knowledge acquisition method where rules are stored in an exception structure and new rules are only added to correct the errors of existing rules, thus allowing systematic control of the interaction between the rules.
Journal ArticleDOI
Urdu part of speech tagging using conditional random fields
Wahab Khan,Ali Daud,Jamal Abdul Nasir,Tehmina Amjad,Sachi Arafat,Naif Radi Aljohani,Fahd Saleh Alotaibi +6 more
TL;DR: This work is the first instance of a CRF approach for Urdu POS tagging using linear-chain conditional random fields and employs a strong, stable and balanced language-independent as well as language dependent feature set.
References
More filters
Journal ArticleDOI
Maximum likelihood from incomplete data via the EM algorithm
Book
Numerical Optimization
Jorge Nocedal,Stephen J. Wright +1 more
TL;DR: Numerical Optimization presents a comprehensive and up-to-date description of the most effective methods in continuous optimization, responding to the growing interest in optimization in engineering, science, and business by focusing on the methods that are best suited to practical problems.
ReportDOI
Building a large annotated corpus of English: the penn treebank
TL;DR: As a result of this grant, the researchers have now published on CDROM a corpus of over 4 million words of running text annotated with part-of- speech (POS) tags, which includes a fully hand-parsed version of the classic Brown corpus.