The Penn Chinese TreeBank: Phrase structure annotation of a large corpus

doi:10.1017/S135132490400364X

Home
/
Papers
/
The Penn Chinese TreeBank: Phrase structure annotation of a large corpus

Journal Article•DOI•

The Penn Chinese TreeBank: Phrase structure annotation of a large corpus

Naiwen Xue¹, Fei Xia¹, Fu-Dong Chiou¹, Marta Palmer¹•Institutions (1)

University of Pennsylvania¹

01 Jun 2005-Natural Language Engineering (Cambridge University Press)-Vol. 11, Iss: 2, pp 207-238

TL;DR: Several Chinese linguistic issues and their implications for treebanking efforts are discussed and how to address these issues when developing annotation guidelines are addressed, and engineering strategies to improve speed while ensuring annotation quality are described.

read less

Abstract: With growing interest in Chinese Language Processing, numerous NLP tools (e.g., word segmenters, part-of-speech taggers, and parsers) for Chinese have been developed all over the world. However, since no large-scale bracketed corpora are available to the public, these tools are trained on corpora with different segmentation criteria, part-of-speech tagsets and bracketing guidelines, and therefore, comparisons are difficult. As a first step towards addressing this issue, we have been preparing a large bracketed corpus since late 1998. The first two installments of the corpus, 250 thousand words of data, fully segmented, POS-tagged and syntactically bracketed, have been released to the public via LDC (www.ldc.upenn.edu). In this paper, we discuss several Chinese linguistic issues and their implications for our treebanking efforts and how we address these issues when developing our annotation guidelines. We also describe our engineering strategies to improve speed while ensuring annotation quality.

...read moreread less

Citations

PDF

Open Access

More filters

Journal Article•DOI•

MaltParser: A language-independent system for data-driven dependency parsing

[...]

Joakim Nivre¹, Johan Hall, Jens Nilsson, Atanas Chanev², Gülşen Eryiğit³, Sandra Kübler⁴, Svetoslav Marinov⁵, Erwin Marsi⁶ - Show less +4 more•Institutions (6)

Uppsala University¹, University of Trento², Istanbul Technical University³, University of Tübingen⁴, University of Skövde⁵, Tilburg University⁶

01 Jan 2005-Natural Language Engineering

TL;DR: Experimental evaluation confirms that MaltParser can achieve robust, efficient and accurate parsing for a wide range of languages without language-specific enhancements and with rather limited amounts of training data.

...read moreread less

Abstract: Parsing unrestricted text is useful for many language technology applications but requires parsing methods that are both robust and efficient. MaltParser is a language-independent system for data-driven dependency parsing that can be used to induce a parser for a new language from a treebank sample in a simple yet flexible manner. Experimental evaluation confirms that MaltParser can achieve robust, efficient and accurate parsing for a wide range of languages without language-specific enhancements and with rather limited amounts of training data.

...read moreread less

801 citations

Cites background or methods from "The Penn Chinese TreeBank: Phrase s..."

...The Chinese data are taken from the Penn Chinese Treebank (CTB), version 5.1 (Xue et al. 2005), and the texts are mostly from Xinhua newswire, Sinorama news magazine and Hong Kong News....
[...]
...1 (Xue et al. 2005), and the texts are mostly from Xinhua newswire, Sinorama news magazine and Hong Kong News....
[...]

Journal Article•DOI•

A Survey of the Usages of Deep Learning for Natural Language Processing

[...]

Daniel W. Otter¹, Julian Richard Medina¹, Jugal Kalita¹•Institutions (1)

University of Colorado Colorado Springs¹

01 Feb 2021-IEEE Transactions on Neural Networks

TL;DR: The field of natural language processing has been propelled forward by an explosion in the use of deep learning models over the last several years as mentioned in this paper, which includes several core linguistic processing issues in addition to many applications of computational linguistics.

...read moreread less

Abstract: Over the last several years, the field of natural language processing has been propelled forward by an explosion in the use of deep learning models. This article provides a brief introduction to the field and a quick overview of deep learning architectures and methods. It then sifts through the plethora of recent studies and summarizes a large assortment of relevant contributions. Analyzed research areas include several core linguistic processing issues in addition to many applications of computational linguistics. A discussion of the current state of the art is then provided along with recommendations for future research in the field.

...read moreread less

783 citations

Proceedings Article•

CoNLL-2012 Shared Task: Modeling Multilingual Unrestricted Coreference in OntoNotes

[...]

Sameer Pradhan¹, Alessandro Moschitti², Nianwen Xue³, Olga Uryupina², Yuchen Zhang³ - Show less +1 more•Institutions (3)

BBN Technologies¹, University of Trento², Brandeis University³

13 Jul 2012

TL;DR: The OntoNotes annotation (coreference and other layers) is described and the parameters of the shared task including the format, pre-processing information, evaluation criteria, and presents and discusses the results achieved by the participating systems.

...read moreread less

Abstract: The CoNLL-2012 shared task involved predicting coreference in three languages -- English, Chinese and Arabic -- using OntoNotes data. It was a follow-on to the English-only task organized in 2011. Until the creation of the OntoNotes corpus, resources in this subfield of language processing have tended to be limited to noun phrase coreference, often on a restricted set of entities, such as ACE entities. OntoNotes provides a large-scale corpus of general anaphoric coreference not restricted to noun phrases or to a specified set of entity types and covering multiple languages. OntoNotes also provides additional layers of integrated annotation, capturing additional shallow semantic structure. This paper briefly describes the OntoNotes annotation (coreference and other layers) and then describes the parameters of the shared task including the format, pre-processing information, evaluation criteria, and presents and discusses the results achieved by the participating systems. Being a task that has a complex evaluation history, and multiple evalation conditions, it has, in the past, been difficult to judge the improvement in new algorithms over previously reported results. Having a standard test set and evaluation parameters, all based on a resource that provides multiple integrated annotation layers (parses, semantic roles, word senses, named entities and coreference) that could support joint models, should help to energize ongoing research in the task of entity and event coreference.

...read moreread less

773 citations

Proceedings Article•DOI•

Diachronic Word Embeddings Reveal Statistical Laws of Semantic Change

[...]

William L. Hamilton¹, Jure Leskovec¹, Dan Jurafsky²•Institutions (2)

Stanford University¹, Carnegie Mellon University²

30 May 2016

TL;DR: A robust methodology for quantifying semantic change is developed by evaluating word embeddings against known historical changes and it is revealed that words that are more polysemous have higher rates of semantic change.

...read moreread less

Abstract: Understanding how words change their meanings over time is key to models of language and cultural evolution, but historical data on meaning is scarce, making theories hard to develop and test. Word embeddings show promise as a diachronic tool, but have not been carefully evaluated. We develop a robust methodology for quantifying semantic change by evaluating word embeddings (PPMI, SVD, word2vec) against known historical changes. We then use this methodology to reveal statistical laws of semantic evolution. Using six historical corpora spanning four languages and two centuries, we propose two quantitative laws of semantic change: (i) the law of conformity---the rate of semantic change scales with an inverse power-law of word frequency; (ii) the law of innovation---independent of frequency, words that are more polysemous have higher rates of semantic change.

...read moreread less

633 citations

Proceedings Article•DOI•

The CoNLL-2009 Shared Task: Syntactic and Semantic Dependencies in Multiple Languages

[...]

Jan Hajiċ¹, Massimiliano Ciaramita², Richard Johansson³, Daisuke Kawahara⁴, Maria Antònia Martí⁵, Lluís Màrquez⁶, Adam Meyers⁷, Joakim Nivre⁸, Sebastian Padó⁹, Jan Štėpánek¹, Pavel Straňák¹, Mihai Surdeanu¹⁰, Nianwen Xue¹¹, Yi Zhang¹² - Show less +10 more•Institutions (12)

Charles University in Prague¹, Google², University of Trento³, National Institute of Information and Communications Technology⁴, University of Barcelona⁵, Polytechnic University of Catalonia⁶, New York University⁷, Uppsala University⁸, University of Stuttgart⁹, Stanford University¹⁰, Brandeis University¹¹, Saarland University¹²

04 Jun 2009

TL;DR: This shared task combines the shared tasks of the previous five years under a unique dependency-based formalism similar to the 2008 task and describes how the data sets were created and show their quantitative properties.

...read moreread less

Abstract: For the 11th straight year, the Conference on Computational Natural Language Learning has been accompanied by a shared task whose purpose is to promote natural language processing applications and evaluate them in a standard setting. In 2009, the shared task was dedicated to the joint parsing of syntactic and semantic dependencies in multiple languages. This shared task combines the shared tasks of the previous five years under a unique dependency-based formalism similar to the 2008 task. In this paper, we define the shared task, describe how the data sets were created and show their quantitative properties, report the results and summarize the approaches of the participating systems.

...read moreread less

531 citations

Cites methods from "The Penn Chinese TreeBank: Phrase s..."

...The Chinese data used in the shared task is based on Chinese Treebank 6.0 and the Chinese Proposition Bank 2.0, both of which are publicly available via the Linguistic Data Consortium....
[...]
...The version of the Chinese Treebank used in this shared task, CTB 6.0, includes newswire, magazine articles, and transcribed broadcast news12....
[...]
...The Chinese Proposition Bank adds a layer of semantic annotation to the syntactic parses in the Chinese Treebank....
[...]
...The Chinese Treebank and the Chinese Proposition Bank were funded by DOD, NSF and DARPA....
[...]
...The data sources of the Chinese Treebank range from Xinhua newswire (mainland China), Hong Kong news, and Sinorama Magazine (Taiwan)....
[...]

1
2
3
4
…
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128

Collapse

References

PDF

Open Access

More filters

Report•DOI•

Building a large annotated corpus of English: the penn treebank

[...]

Mitchell Marcus¹, Mary Ann Marcinkiewicz¹, Beatrice Santorini²•Institutions (2)

University of Pennsylvania¹, Northwestern University²

01 Jun 1993-Computational Linguistics

TL;DR: As a result of this grant, the researchers have now published on CDROM a corpus of over 4 million words of running text annotated with part-of- speech (POS) tags, which includes a fully hand-parsed version of the classic Brown corpus.

...read moreread less

Abstract: : As a result of this grant, the researchers have now published oil CDROM a corpus of over 4 million words of running text annotated with part-of- speech (POS) tags, with over 3 million words of that material assigned skeletal grammatical structure This material now includes a fully hand-parsed version of the classic Brown corpus About one half of the papers at the ACL Workshop on Using Large Text Corpora this past summer were based on the materials generated by this grant

...read moreread less

8,377 citations

Book•

Lectures on Government and Binding

[...]

Noam Chomsky

01 Jan 1981

7,936 citations

"The Penn Chinese TreeBank: Phrase s..." refers background in this paper

...While the influence of Government and Binding (GB) theory (Chomsky 1981) and X-bar theory (Jackendoff 1977) is obvious in our corpus, we do not adopt the whole package....
[...]

Distributed morphology and the pieces of inflection

[...]

Morris Halle, Alec Marantz

01 Jan 1993

2,699 citations

Journal Article•DOI•

The Proposition Bank: An Annotated Corpus of Semantic Roles

[...]

Martha Palmer¹, Daniel Gildea², Paul R. Kingsbury¹•Institutions (2)

University of Pennsylvania¹, University of Rochester²

01 Mar 2005-Computational Linguistics

TL;DR: An automatic system for semantic role tagging trained on the corpus is described and the effect on its performance of various types of information is discussed, including a comparison of full syntactic parsing with a flat representation and the contribution of the empty trace categories of the treebank.

...read moreread less

Abstract: The Proposition Bank project takes a practical approach to semantic representation, adding a layer of predicate-argument information, or semantic role labels, to the syntactic structures of the Penn Treebank. The resulting resource can be thought of as shallow, in that it does not represent coreference, quantification, and many other higher-order phenomena, but also broad, in that it covers every instance of every verb in the corpus and allows representative statistics to be calculated.We discuss the criteria used to define the sets of semantic roles used in the annotation process and to analyze the frequency of syntactic/semantic alternations in the corpus. We describe an automatic system for semantic role tagging trained on the corpus and discuss the effect on its performance of various types of information, including a comparison of full syntactic parsing with a flat representation and the contribution of the empty ''trace'' categories of the treebank.

...read moreread less

2,416 citations

Proceedings Article•

A maximum-entropy-inspired parser

[...]

Eugene Charniak¹•Institutions (1)

Brown University¹

29 Apr 2000

TL;DR: A new parser for parsing down to Penn tree-bank style parse trees that achieves 90.1% average precision/recall for sentences of length 40 and less and 89.5% when trained and tested on the previously established sections of the Wall Street Journal treebank is presented.

...read moreread less

Abstract: We present a new parser for parsing down to Penn tree-bank style parse trees that achieves 90.1% average precision/recall for sentences of length 40 and less, and 89.5% for sentences of length 100 and less when trained and tested on the previously established [5, 9, 10, 15, 17] "standard" sections of the Wall Street Journal treebank. This represents a 13% decrease in error rate over the best single-parser results on this corpus [9]. The major technical innovation is the use of a "maximum-entropy-inspired" model for conditioning and smoothing that let us successfully to test and combine many different conditioning events. We also present some partial results showing the effects of different conditioning information, including a surprising 2% improvement due to guessing the lexical head's pre-terminal before guessing the lexical head.

...read moreread less

1,709 citations

"The Penn Chinese TreeBank: Phrase s..." refers background in this paper

...Most notably, the Penn English Treebank (Marcus, Santorini and Marcinkiewicz 1993) has proven to be a crucial resource in the recent success of English Part-Of-Speech (POS) taggers and parsers (Collins 1997, 2000; Charniak 2000), as it provides common training and testing material so that different algorithms can be compared and progress be gauged....
[...]
...…and Marcinkiewicz 1993) has proven to be a crucial resource in the recent success of English Part-Of-Speech (POS) taggers and parsers (Collins 1997, 2000; Charniak 2000), as it provides common training and testing material so that different algorithms can be compared and progress be gauged....
[...]