scispace - formally typeset
Search or ask a question
Journal ArticleDOI

The Penn Chinese TreeBank: Phrase structure annotation of a large corpus

01 Jun 2005-Natural Language Engineering (Cambridge University Press)-Vol. 11, Iss: 2, pp 207-238
TL;DR: Several Chinese linguistic issues and their implications for treebanking efforts are discussed and how to address these issues when developing annotation guidelines are addressed, and engineering strategies to improve speed while ensuring annotation quality are described.
Abstract: With growing interest in Chinese Language Processing, numerous NLP tools (e.g., word segmenters, part-of-speech taggers, and parsers) for Chinese have been developed all over the world. However, since no large-scale bracketed corpora are available to the public, these tools are trained on corpora with different segmentation criteria, part-of-speech tagsets and bracketing guidelines, and therefore, comparisons are difficult. As a first step towards addressing this issue, we have been preparing a large bracketed corpus since late 1998. The first two installments of the corpus, 250 thousand words of data, fully segmented, POS-tagged and syntactically bracketed, have been released to the public via LDC (www.ldc.upenn.edu). In this paper, we discuss several Chinese linguistic issues and their implications for our treebanking efforts and how we address these issues when developing our annotation guidelines. We also describe our engineering strategies to improve speed while ensuring annotation quality.
Citations
More filters
Journal ArticleDOI
TL;DR: Experimental evaluation confirms that MaltParser can achieve robust, efficient and accurate parsing for a wide range of languages without language-specific enhancements and with rather limited amounts of training data.
Abstract: Parsing unrestricted text is useful for many language technology applications but requires parsing methods that are both robust and efficient. MaltParser is a language-independent system for data-driven dependency parsing that can be used to induce a parser for a new language from a treebank sample in a simple yet flexible manner. Experimental evaluation confirms that MaltParser can achieve robust, efficient and accurate parsing for a wide range of languages without language-specific enhancements and with rather limited amounts of training data.

801 citations


Cites background or methods from "The Penn Chinese TreeBank: Phrase s..."

  • ...The Chinese data are taken from the Penn Chinese Treebank (CTB), version 5.1 (Xue et al. 2005), and the texts are mostly from Xinhua newswire, Sinorama news magazine and Hong Kong News....

    [...]

  • ...1 (Xue et al. 2005), and the texts are mostly from Xinhua newswire, Sinorama news magazine and Hong Kong News....

    [...]

Journal ArticleDOI
TL;DR: The field of natural language processing has been propelled forward by an explosion in the use of deep learning models over the last several years as mentioned in this paper, which includes several core linguistic processing issues in addition to many applications of computational linguistics.
Abstract: Over the last several years, the field of natural language processing has been propelled forward by an explosion in the use of deep learning models. This article provides a brief introduction to the field and a quick overview of deep learning architectures and methods. It then sifts through the plethora of recent studies and summarizes a large assortment of relevant contributions. Analyzed research areas include several core linguistic processing issues in addition to many applications of computational linguistics. A discussion of the current state of the art is then provided along with recommendations for future research in the field.

783 citations

Proceedings Article
13 Jul 2012
TL;DR: The OntoNotes annotation (coreference and other layers) is described and the parameters of the shared task including the format, pre-processing information, evaluation criteria, and presents and discusses the results achieved by the participating systems.
Abstract: The CoNLL-2012 shared task involved predicting coreference in three languages -- English, Chinese and Arabic -- using OntoNotes data. It was a follow-on to the English-only task organized in 2011. Until the creation of the OntoNotes corpus, resources in this subfield of language processing have tended to be limited to noun phrase coreference, often on a restricted set of entities, such as ACE entities. OntoNotes provides a large-scale corpus of general anaphoric coreference not restricted to noun phrases or to a specified set of entity types and covering multiple languages. OntoNotes also provides additional layers of integrated annotation, capturing additional shallow semantic structure. This paper briefly describes the OntoNotes annotation (coreference and other layers) and then describes the parameters of the shared task including the format, pre-processing information, evaluation criteria, and presents and discusses the results achieved by the participating systems. Being a task that has a complex evaluation history, and multiple evalation conditions, it has, in the past, been difficult to judge the improvement in new algorithms over previously reported results. Having a standard test set and evaluation parameters, all based on a resource that provides multiple integrated annotation layers (parses, semantic roles, word senses, named entities and coreference) that could support joint models, should help to energize ongoing research in the task of entity and event coreference.

773 citations

Proceedings ArticleDOI
30 May 2016
TL;DR: A robust methodology for quantifying semantic change is developed by evaluating word embeddings against known historical changes and it is revealed that words that are more polysemous have higher rates of semantic change.
Abstract: Understanding how words change their meanings over time is key to models of language and cultural evolution, but historical data on meaning is scarce, making theories hard to develop and test. Word embeddings show promise as a diachronic tool, but have not been carefully evaluated. We develop a robust methodology for quantifying semantic change by evaluating word embeddings (PPMI, SVD, word2vec) against known historical changes. We then use this methodology to reveal statistical laws of semantic evolution. Using six historical corpora spanning four languages and two centuries, we propose two quantitative laws of semantic change: (i) the law of conformity---the rate of semantic change scales with an inverse power-law of word frequency; (ii) the law of innovation---independent of frequency, words that are more polysemous have higher rates of semantic change.

633 citations

Proceedings ArticleDOI
04 Jun 2009
TL;DR: This shared task combines the shared tasks of the previous five years under a unique dependency-based formalism similar to the 2008 task and describes how the data sets were created and show their quantitative properties.
Abstract: For the 11th straight year, the Conference on Computational Natural Language Learning has been accompanied by a shared task whose purpose is to promote natural language processing applications and evaluate them in a standard setting. In 2009, the shared task was dedicated to the joint parsing of syntactic and semantic dependencies in multiple languages. This shared task combines the shared tasks of the previous five years under a unique dependency-based formalism similar to the 2008 task. In this paper, we define the shared task, describe how the data sets were created and show their quantitative properties, report the results and summarize the approaches of the participating systems.

531 citations


Cites methods from "The Penn Chinese TreeBank: Phrase s..."

  • ...The Chinese data used in the shared task is based on Chinese Treebank 6.0 and the Chinese Proposition Bank 2.0, both of which are publicly available via the Linguistic Data Consortium....

    [...]

  • ...The version of the Chinese Treebank used in this shared task, CTB 6.0, includes newswire, magazine articles, and transcribed broadcast news12....

    [...]

  • ...The Chinese Proposition Bank adds a layer of semantic annotation to the syntactic parses in the Chinese Treebank....

    [...]

  • ...The Chinese Treebank and the Chinese Proposition Bank were funded by DOD, NSF and DARPA....

    [...]

  • ...The data sources of the Chinese Treebank range from Xinhua newswire (mainland China), Hong Kong news, and Sinorama Magazine (Taiwan)....

    [...]

References
More filters
ReportDOI
TL;DR: As a result of this grant, the researchers have now published on CDROM a corpus of over 4 million words of running text annotated with part-of- speech (POS) tags, which includes a fully hand-parsed version of the classic Brown corpus.
Abstract: : As a result of this grant, the researchers have now published oil CDROM a corpus of over 4 million words of running text annotated with part-of- speech (POS) tags, with over 3 million words of that material assigned skeletal grammatical structure This material now includes a fully hand-parsed version of the classic Brown corpus About one half of the papers at the ACL Workshop on Using Large Text Corpora this past summer were based on the materials generated by this grant

8,377 citations

Book
01 Jan 1981

7,936 citations


"The Penn Chinese TreeBank: Phrase s..." refers background in this paper

  • ...While the influence of Government and Binding (GB) theory (Chomsky 1981) and X-bar theory (Jackendoff 1977) is obvious in our corpus, we do not adopt the whole package....

    [...]

Journal ArticleDOI
TL;DR: An automatic system for semantic role tagging trained on the corpus is described and the effect on its performance of various types of information is discussed, including a comparison of full syntactic parsing with a flat representation and the contribution of the empty trace categories of the treebank.
Abstract: The Proposition Bank project takes a practical approach to semantic representation, adding a layer of predicate-argument information, or semantic role labels, to the syntactic structures of the Penn Treebank. The resulting resource can be thought of as shallow, in that it does not represent coreference, quantification, and many other higher-order phenomena, but also broad, in that it covers every instance of every verb in the corpus and allows representative statistics to be calculated.We discuss the criteria used to define the sets of semantic roles used in the annotation process and to analyze the frequency of syntactic/semantic alternations in the corpus. We describe an automatic system for semantic role tagging trained on the corpus and discuss the effect on its performance of various types of information, including a comparison of full syntactic parsing with a flat representation and the contribution of the empty ''trace'' categories of the treebank.

2,416 citations

Proceedings Article
Eugene Charniak1
29 Apr 2000
TL;DR: A new parser for parsing down to Penn tree-bank style parse trees that achieves 90.1% average precision/recall for sentences of length 40 and less and 89.5% when trained and tested on the previously established sections of the Wall Street Journal treebank is presented.
Abstract: We present a new parser for parsing down to Penn tree-bank style parse trees that achieves 90.1% average precision/recall for sentences of length 40 and less, and 89.5% for sentences of length 100 and less when trained and tested on the previously established [5, 9, 10, 15, 17] "standard" sections of the Wall Street Journal treebank. This represents a 13% decrease in error rate over the best single-parser results on this corpus [9]. The major technical innovation is the use of a "maximum-entropy-inspired" model for conditioning and smoothing that let us successfully to test and combine many different conditioning events. We also present some partial results showing the effects of different conditioning information, including a surprising 2% improvement due to guessing the lexical head's pre-terminal before guessing the lexical head.

1,709 citations


"The Penn Chinese TreeBank: Phrase s..." refers background in this paper

  • ...Most notably, the Penn English Treebank (Marcus, Santorini and Marcinkiewicz 1993) has proven to be a crucial resource in the recent success of English Part-Of-Speech (POS) taggers and parsers (Collins 1997, 2000; Charniak 2000), as it provides common training and testing material so that different algorithms can be compared and progress be gauged....

    [...]

  • ...…and Marcinkiewicz 1993) has proven to be a crucial resource in the recent success of English Part-Of-Speech (POS) taggers and parsers (Collins 1997, 2000; Charniak 2000), as it provides common training and testing material so that different algorithms can be compared and progress be gauged....

    [...]