scispace - formally typeset
Search or ask a question
Journal ArticleDOI

Unsupervised Part-of-Speech Tagging in the Large

Chris Biemann1
01 Dec 2009-Research on Language and Computation (Springer Netherlands)-Vol. 7, Iss: 2, pp 101-135
TL;DR: A method is presented that constructs a statistical syntactic tagger model from a large amount of unlabeled text data and results in corpus annotation in a comparable way to what POS-taggers provide, but results in slightly different categories as opposed to what is assumed by a linguistically motivated POS-tagger.
Abstract: Syntactic preprocessing is a step that is widely used in NLP applications. Traditionally, rule-based or statistical Part-of-Speech (POS) taggers are employed that either need considerable rule development times or a sufficient amount of manually labeled data. To alleviate this acquisition bottleneck and to enable preprocessing for minority languages and specialized domains, a method is presented that constructs a statistical syntactic tagger model from a large amount of unlabeled text data. The method presented here is called unsupervised POS-tagging, as its application results in corpus annotation in a comparable way to what POS-taggers provide. Nevertheless, its application results in slightly different categories as opposed to what is assumed by a linguistically motivated POS-tagger. These differences hamper evaluation procedures that compare the output of the unsupervised POS-tagger to a tagging with a supervised tagger. To measure the extent to which unsupervised POS-tagging can contribute in application-based settings, the system is evaluated in supervised POS-tagging, word sense disambiguation, named entity recognition and chunking. Unsupervised POS-tagging has been explored since the beginning of the 1990s. Unlike in previous approaches, the kind and number of different tags is here generated by the method itself. Another difference to other methods is that not all words above a certain frequency rank get assigned a tag, but the method is allowed to exclude words from the clustering, if their distribution does not match closely enough with other words. The lexicon size is considerably larger than in previous approaches, resulting in a lower out-of-vocabulary (OOV) rate and in a more consistent tagging. The system presented here is available for download as open-source software along with tagger models for several languages, so the contributions of this work can be easily incorporated into other applications.
Citations
More filters
Book
28 Sep 2012
TL;DR: This book aims to give an introduction to NLP for historical texts and an overview of the state of the art in this field, including specific methods, such as creating part-of-speech taggers for historical languages or handling spelling variation.
Abstract: More and more historical texts are becoming available in digital form. Digitization of paper documents is motivated by the aim of preserving cultural heritage and making it more accessible, both to laypeople and scholars. As digital images cannot be searched for text, digitization projects increasingly strive to create digital text, which can be searched and otherwise automatically processed, in addition to facsimiles. Indeed, the emerging field of digital humanities heavily relies on the availability of digital text for its studies. Together with the increasing availability of historical texts in digital form, there is a growing interest in applying natural language processing (NLP) methods and tools to historical texts. However, the specific linguistic properties of historical texts -- the lack of standardized orthography, in particular -- pose special challenges for NLP. This book aims to give an introduction to NLP for historical texts and an overview of the state of the art in this field. The book starts with an overview of methods for the acquisition of historical texts (scanning and OCR), discusses text encoding and annotation schemes, and presents examples of corpora of historical texts in a variety of languages. The book then discusses specific methods, such as creating part-of-speech taggers for historical languages or handling spelling variation. A final chapter analyzes the relationship between NLP and the digital humanities. Certain recently emerging textual genres, such as SMS, social media, and chat messages, or newsgroup and forum postings share a number of properties with historical texts, for example, nonstandard orthography and grammar, and profuse use of abbreviations. The methods and techniques required for the effective processing of historical texts are thus also of interest for research in other domains.

171 citations

Journal ArticleDOI
22 Jul 2013
TL;DR: A new metaphor of two-dimensional text for data-driven semantic modeling of natural language is proposed, which provides an entirely new angle on the representation of text: not only syntagmatic relations are annotated in the text, but also paradigmatic relations are made explicit by generating lexical expansions.
Abstract: A new metaphor of two-dimensional text for data-driven semantic modeling of natural language is proposed, which provides an entirely new angle on the representation of text: not only syntagmatic relations are annotated in the text, but also paradigmatic relations are made explicit by generating lexical expansions We operationalize distributional similarity in a general framework for large corpora, and describe a new method to generate similar terms in context Our evaluation shows that distributional similarity is able to produce highquality lexical resources in an unsupervised and knowledge-free way, and that our highly scalable similarity measure yields better scores in a WordNet-based evaluation than previous measures for very large corpora Evaluating on a lexical substitution task, we find that our contextualization method improves over a non-contextualized baseline across all parts of speech, and we show how the metaphor can be applied successfully to part-of-speech tagging A number of ways to extend and improve the contextualization method within our framework are discussed As opposed to comparable approaches, our framework defines a model of lexical expansions in context that can generate the expansions as opposed to ranking a given list, and thus does not require existing lexical-semantic resources

146 citations


Cites background or methods from "Unsupervised Part-of-Speech Tagging..."

  • ...Chris Biemann and Eugenie Giesbrecht (2011), Distributional Semantics and Compositionality 2011: Shared Task Description and Results, in Proceedings of the Workshop on Distributional Semantics and Compositionality, pp....

    [...]

  • ...Matthias Richter, Uwe Quasthoff, Erla Hallsteinsdóttir, and Chris Biemann (2006), Exploiting the Leipzig Corpora Collection, in Proceesings of the IS-LTC 2006, Ljubljana, Slovenia, http://nl.ijs.si/is-ltc06/proc/13_Richter.pdf....

    [...]

  • ...Chris Biemann, Stefanie Roos, and Karsten Weihe (2012), Quantifying Semantics Using Complex Network Analysis, in Proceedings of the 24th International Conference on Computational Linguistics (COLING), Mumbai, India, http://aclweb....

    [...]

  • ...9The verbs, nouns and adjectives are lemmatized, using a Compact Patricia Trie classifier (Biemann et al., 2008) trained on the verbs, nouns and adjectives. 10As produced by the Stanford parser. 11For a comparison of measures, see e.g. Evert (2005) and Bordag (2008)....

    [...]

  • ...Matthias Richter, Uwe Quasthoff, Erla Hallsteinsdóttir, and Chris Biemann (2006), Exploiting the Leipzig Corpora Collection, in Proceesings of the IS-LTC 2006, Ljubljana, Slovenia, http://nl....

    [...]

Journal ArticleDOI
TL;DR: In this paper, the authors provide guidance on the complex issue of special education for LEP students, particularly the referral process, focusing on the difficulty of recognizing learning versus language difficulties, that is, how to identify a nonnative-speaking child's need for special education services.
Abstract: specifically on the difficulty of recognizing learning versus language difficulties, that is, how to identify a nonnative-speaking child's need for special education services. They propose a model that administrators can employ that minimizes bias. In a similar vein, the next chapter, by Jeffery Braden and Sandra Fradd, suggests ways that administrators can anticipate difficulties and intervene before such referrals are necessary. William Tikunoff, in the eighth chapter, focuses on instructional leadership. He discusses the characteristics of an effective principal and targets specific areas, such as effective time management. The final chapter, by Beatrice Ward, addresses the greatest resource of any educational institution: the teachers. She describes the clinical approach to teacher development and how it can be implemented. Overall, this book fills a need for basic, factual information about legal requirements, program types, and effective instructional and leadership strategies with respect to the LEP population. Furthermore, it provides guidance on the complex issue of special education for LEP students, particularly the referral process. An additional chapter exploring different models for assessment and program design for these students would have provided depth and balance. Although there is necessarily some overlap between the chapters, it is reinforcing, not repetitive. In addition to being extremely useful to administrators, this book would be of value to school personnel such as psychologists, special education consultants, LEP consultants, instructors-in short, for anyone committed to the design and delivery of effective instructional programs for LEP students.

146 citations

Journal ArticleDOI
TL;DR: This work presents a multilingual Named Entity Recognition approach based on a robust and general set of features across languages and datasets that combines shallow local information with clustering semi-supervised features induced on large amounts of unlabeled text.

78 citations


Cites background or methods from "Unsupervised Part-of-Speech Tagging..."

  • ...In other sequence labeling tasks, Biemann [9] reports a slight improvement (from 97....

    [...]

  • ...More specifically, it had been previously shown how to apply Brown clusters [10] for Chinese Word Segmentation [36], dependency parsing [34], NERC [55] and POS tagging [9]....

    [...]

  • ...Biemann, C., 2009....

    [...]

  • ...Benikova, D., Yimam, S.M., Santhanam, P., Biemann, C., 2015....

    [...]

  • ...Finally, and inspired by previous work (Koo et al., 2008; Biemann, 2009) we measure how much supervision is required to obtain state of the art results....

    [...]

Proceedings ArticleDOI
01 Jun 2016
TL;DR: The IIT-TUDA participation in the SemEval 2016 shared Task 5 of Aspect Based Sentiment Analysis (ABSA) for subtask 1 is reported, with the system placed first in sentiment polarity classification for the English laptop domain, Spanish and Turkish restaurant reviews, and opinion target expression.
Abstract: This paper reports the IIT-TUDA participation in the SemEval 2016 shared Task 5 of Aspect Based Sentiment Analysis (ABSA) for subtask 1. We describe our system incorporating domain dependency graph features, distributional thesaurus and unsupervised lexical induction using an unlabeled external corpus for aspect based sentiment analysis. Overall, we submitted 29 runs, covering 7 languages and 4 different domains. Our system is placed first in sentiment polarity classification for the English laptop domain, Spanish and Turkish restaurant reviews, and opinion target expression for Dutch and French in restaurant domain, and scores in medium ranks for aspect category identification and opinion target extraction.

57 citations


Cites methods from "Unsupervised Part-of-Speech Tagging..."

  • ...We also added two new features including unsupervised PoS tags (Biemann, 2009) as the feature for all the languages and SentiWordNet score for English language....

    [...]

References
More filters
Proceedings Article
28 Jun 2001
TL;DR: This work presents iterative parameter estimation algorithms for conditional random fields and compares the performance of the resulting models to HMMs and MEMMs on synthetic and natural-language data.
Abstract: We present conditional random fields , a framework for building probabilistic models to segment and label sequence data. Conditional random fields offer several advantages over hidden Markov models and stochastic grammars for such tasks, including the ability to relax strong independence assumptions made in those models. Conditional random fields also avoid a fundamental limitation of maximum entropy Markov models (MEMMs) and other discriminative Markov models based on directed graphical models, which can be biased towards states with few successor states. We present iterative parameter estimation algorithms for conditional random fields and compare the performance of the resulting models to HMMs and MEMMs on synthetic and natural-language data.

13,190 citations


"Unsupervised Part-of-Speech Tagging..." refers background or methods in this paper

  • ...…systems for application-based evaluation were chosen to cover two different machine learning paradigms: kernel methods in a word sense disambiguation (WSD) system and Conditional Random Fields (CRFs, see Lafferty et al. 2001) for supervised POS, named entity recognition (NER) and chunking....

    [...]

  • ...The second approach is statistical, for example HMM-taggers (Charniak et al. 1993, inter al.) or taggers based on conditional random fields (Lafferty et al. 2001)....

    [...]

  • ...) or taggers based on conditional random fields (Lafferty et al. 2001)....

    [...]

Journal ArticleDOI
TL;DR: Standard alphabetical procedures for organizing lexical information put together words that are spelled alike and scatter words with similar or related meanings haphazardly through the list.
Abstract: Standard alphabetical procedures for organizing lexical information put together words that are spelled alike and scatter words with similar or related meanings haphazardly through the list. Unfortunately, there is no obvious alternative, no other simple way for lexicographers to keep track of what has been done or for readers to find the word they are looking for. But a frequent objection to this solution is that finding things on an alphabetical list can be tedious and time-consuming. Many people who would like to refer to a dictionary decide not to bother with it because finding the information would interrupt their work and break their train of thought.

5,038 citations


"Unsupervised Part-of-Speech Tagging..." refers background in this paper

  • ...The senses are provided by a sense inventory (usually WordNet, Miller et al. 1990)....

    [...]

Proceedings ArticleDOI
27 May 2003
TL;DR: A new part-of-speech tagger is presented that demonstrates the following ideas: explicit use of both preceding and following tag contexts via a dependency network representation, broad use of lexical features, and effective use of priors in conditional loglinear models.
Abstract: We present a new part-of-speech tagger that demonstrates the following ideas: (i) explicit use of both preceding and following tag contexts via a dependency network representation, (ii) broad use of lexical features, including jointly conditioning on multiple consecutive words, (iii) effective use of priors in conditional loglinear models, and (iv) fine-grained modeling of unknown word features. Using these ideas together, the resulting tagger gives a 97.24% accuracy on the Penn Treebank WSJ, an error reduction of 4.4% on the best previous single automatically learned tagging result.

3,466 citations


"Unsupervised Part-of-Speech Tagging..." refers background in this paper

  • ...While there exist high precision supervised POS taggers and elaborate feature sets have been worked out (see Toutanova et al. 2003 for state-of-the art POS tagging on the Penn Treebank), it does not seem necessary to create an unsupervised tagger in presence of training data....

    [...]

Journal ArticleDOI
TL;DR: This work addresses the problem of predicting a word from previous words in a sample of text and discusses n-gram models based on classes of words, finding that these models are able to extract classes that have the flavor of either syntactically based groupings or semanticallybased groupings, depending on the nature of the underlying statistics.
Abstract: We address the problem of predicting a word from previous words in a sample of text. In particular, we discuss n-gram models based on classes of words. We also discuss several statistical algorithms for assigning words to classes based on the frequency of their co-occurrence with other words. We find that we are able to extract classes that have the flavor of either syntactically based groupings or semantically based groupings, depending on the nature of the underlying statistics.

3,336 citations


Additional excerpts

  • ...This realises a class-based N -gram model (Brown et al. 1992)....

    [...]