scispace - formally typeset
Search or ask a question

Showing papers in "Computational Linguistics in 2001"


Journal ArticleDOI
TL;DR: The learning approach to coreference resolution of noun phrases in unrestricted text is presented, indicating that on the general noun phrase coreference task, the learning approach holds promise and achieves accuracy comparable to that of nonlearning approaches.
Abstract: In this paper, we present a learning approach to coreference resolution of noun phrases in unrestricted text. The approach learns from a small, annotated corpus and the task includes resolving not just a certain type of noun phrase (e.g., pronouns) but rather general noun phrases. It also does not restrict the entity types of the noun phrases; that is, coreference is assigned whether they are of "organization," "person," or other types. We evaluate our approach on common data sets (namely, the MUC-6 and MUC-7 coreference corpora) and obtain encouraging results, in-dicating that on the general noun phrase coreference task, the learning approach holds promise and achieves accuracy comparable to that of nonlearning approaches. Our system is the first learning-based system that offers performance comparable to that of state-of-the-art nonlearning systems on these data sets.

1,059 citations



Journal ArticleDOI
TL;DR: This study reports the results of using minimum description length (MDL) analysis to model unsupervised learning of the morphological segmentation of European languages, using corpora ranging in size from 5,000 Words to 500,000 words.
Abstract: This study reports the results of using minimum description length (MDL) analysis to model unsupervised learning of the morphological segmentation of European languages, using corpora ranging in size from 5,000 words to 500,000 words. We develop a set of heuristics that rapidly develop a probabilistic morphological grammar, and use MDL as our primary tool to determine whether the modifications proposed by the heuristics will be adopted or not. The resulting grammar matches well the analysis that would be developed by a human morphologist.In the final section, we discuss the relationship of this style of MDL grammatical analysis to the notion of evaluation metric in early generative grammar.

789 citations


Journal ArticleDOI
Brian Roark1
TL;DR: The authors proposed a probabilistic top-down parser for language modeling for speech recognition, which performs very well in terms of both the accuracy of returned parses and the efficiency with which they are found, relative to the best broad-coverage statistical parsers.
Abstract: This paper describes the functioning of a broad-coverage probabilistic top-down parser, and its application to the problem of language modeling for speech recognition. The paper first introduces key notions in language modeling and probabilistic parsing, and briefly reviews some previous approaches to using syntactic structure for language modeling. A lexicalized probabilistic top-down parser is then presented, which performs very well, in terms of both the accuracy of returned parses and the efficiency with which they are found, relative to the best broad-coverage statistical parsers. A new language model that utilizes probabilistic top-down parsing is then outlined, and empirical results show that it improves upon previous work in test corpus perplexity. Interpolation with a trigram model yields an exceptional improvement relative to the improvement observed by other models, demonstrating the degree to which the information captured by our parsing model is orthogonal to that captured by a trigram model. A small recognition experiment also demonstrates the utility of the model.

336 citations


Journal ArticleDOI
TL;DR: This work reports on supervised learning experiments to automatically classify three major types of English verbs, based on their argument structure-specifically, the thematic roles they assign to participants, using linguistically-motivated statistical indicators extracted from large annotated corpora to train the classifier.
Abstract: Automatic acquisition of lexical knowledge is critical to a wide range of natural language processing tasks. Especially important is knowledge about verbs, which are the primary source of relational information in a sentence---the predicate-argument structure that relates an action or state to its participants (i.e., who did what to whom). In this work, we report on supervised learning experiments to automatically classify three major types of English verbs, based on their argument structure--specifically, the thematic roles they assign to participants. We use linguistically-motivated statistical indicators extracted from large annotated corpora to train the classifier, achieving 69.8% accuracy for a task whose baseline is 34%, and whose expert-based upper bound we calculate at 86.5%. A detailed analysis of the performance of the algorithm and of its errors confirms that the proposed features capture properties related to the argument structure of the verbs. Our results validate our hypotheses that knowledge about thematic relations is crucial for verb classification, and that it can be gleaned from a corpus by automatic means. We thus demonstrate an effective combination of deeper linguistic knowledge with the robustness and scalability of statistical techniques.

216 citations


Journal ArticleDOI
TL;DR: It is examined how differences in language models, learned by different data-driven systems performing the same NLP task, can be exploited to yield a higher accuracy than the best individual system.
Abstract: We examine how differences in language models, learned by different data-driven systems performing the same NLP task, can be exploited to yield a higher accuracy than the best individual system. We do this by means of experiments involving the task of morphosyntactic word class tagging, on the basis of three different tagged corpora. Four well-known tagger generators (hidden Markov model, memory-based, transformation rules, and maximum entropy) are trained on the same corpus data. After comparison, their outputs are combined using several voting strategies and second-stage classifiers. All combination taggers outperform their best component. The reduction in error rate varies with the material in question, but can be as high as 24.3% with the LOB corpus.

208 citations


Journal ArticleDOI
TL;DR: The authors used suffix arrays to compute term frequency (tf) and document frequency (dr) for all n-grams in two large corpora, an English corpus of 50 million words of Wall Street Journal and a Japanese corpus of 216 million characters of Mainichi Shimbun.
Abstract: Bigrams and trigrams are commonly used in statistical natural language processing; this paper will describe techniques for working with much longer n-grams. Suffix arrays (Manber and Myers 1990) were first introduced to compute the frequency and location of a substring (n-gram) in a sequence (corpus) of length N. To compute frequencies over all N(N + 1)/2 substrings in a corpus, the substrings are grouped into a manageable number of equivalence classes. In this way, a prohibitive computation over substrings is reduced to a manageable computation over classes. This paper presents both the algorithms and the code that were used to compute term frequency (tf) and document frequency (dr)for all n-grams in two large corpora, an English corpus of 50 million words of Wall Street Journal and a Japanese corpus of 216 million characters of Mainichi Shimbun.The second half of the paper uses these frequencies to find "interesting" substrings. Lexicographers have been interested in n-grams with high mutual information (MI) where the joint term frequency is higher than what would be expected by chance, assuming that the parts of the n-gram combine independently. Residual inverse document frequency (RIDF) compares document frequency to another model of chance where terms with a particular term frequency are distributed randomly throughout the collection. MI tends to pick out phrases with noncompositional semantics (which often violate the independence assumption) whereas RIDF tends to highlight technical terminology, names, and good keywords for information retrieval (which tend to exhibit nonrandom distributions over documents). The combination of both MI and RIDF is better than either by itself in a Japanese word extraction task.

207 citations


Journal ArticleDOI
TL;DR: This work presents a sense tagger which uses several knowledge sources and attempts to disambiguate all content words in running text rather than limiting itself to treating a restricted vocabulary of words.
Abstract: Word sense disambiguation (WSD) is a computational linguistics task likely to benefit from the tradition of combining different knowledge sources in artificial in telligence research. An important step in the exploration of this hypothesis is to determine which linguistic knowledge sources are most useful and whether their combination leads to improved results. We present a sense tagger which uses several knowledge sources. Tested accuracy exceeds 94% on our evaluation corpus.Our system attempts to disambiguate all content words in running text rather than limiting itself to treating a restricted vocabulary of words. It is argued that this approach is more likely to assist the creation of practical systems.

182 citations


Journal ArticleDOI
TL;DR: It is demonstrated that layout offers a rich resource for achieving presentational coherence, alongside more traditional resources such as text-formatting and the text-internal marking of discourse connections, and an integrated approach to layout, text, and diagram generation is introduced.
Abstract: Combining elements appropriately within a coherent page layout is a well-recognized and crucial aspect of sophisticated information presentation. The precise function and nature of layout has not, however, been sufficiently addressed within computational approaches; attention is often restricted to relatively local issues of typography and text-formatting, leaving broader issues of layout unaddressed. In this paper we focus on the selection and function of layout in pages that appropriately combine textual and graphical representation styles to yield coherent presentation designs. We demonstrate that layout offers a rich resource for achieving presentational coherence, alongside more traditional resources such as text-formatting and the text-internal marking of discourse connections. We also introduce an integrated approach to layout, text, and diagram generation. Our approach is developed on the basis of a preliminary empirical investigation of professionally produced layouts, followed by implementation within a prototype information system in the area of art history.

119 citations


Journal ArticleDOI
TL;DR: In this article, a statistical model for segmentation and word discovery in continuous speech is presented, and an incremental unsupervised learning algorithm to infer word boundaries based on this model is described.
Abstract: A statistical model for segmentation and word discovery in continuous speech is presented. An incremental unsupervised learning algorithm to infer word boundaries based on this model is described. Results are also presented of empirical tests showing that the algorithm is competitive with other models that have been used for similar tasks.

118 citations


Journal ArticleDOI
TL;DR: A probabilistic model that uses both prosodic and lexical cues for the automatic segmentation of speech into topically coherent units is presented and a significant reduction in error is achieved by combining the Prosodic and word-based knowledge sources.
Abstract: We present a probabilistic model that uses both prosodic and lexical cues for the automatic segmentation of speech into topically coherent units. We propose two methods for combining lexical and prosodic information using hidden Markov models and decision trees. Lexical information is obtained from a speech recognizer, and prosodic features are extracted automatically from speech waveforms. We evaluate our approach on the Broadcast News corpus, using the DARPA-TDT evaluation metrics. Results show that the prosodic model alone is competitive with word-based segmentation methods. Furthermore, we achieve a significant reduction in error by combining the prosodic and word-based knowledge sources.

Journal ArticleDOI
TL;DR: A centering algorithm (Left-Right Centering) is introduced that adheres to the constraints and rules of centering theory and is an alternative to Brennan, Friedman, and Pollard's (1987) algorithm.
Abstract: In this paper we compare pronoun resolution algorithms and introduce a centering algorithm (Left-Right Centering) that adheres to the constraints and rules of centering theory and is an alternative to Brennan, Friedman, and Pollard's (1987) algorithm. We then use the Left-Right Centering algorithm to see if two psycholinguistic claims on Cf-list ranking will actually improve pronoun resolution accuracy. Our results from this investigation lead to the development of a new syntax-based ranking of the Cf-list and corpus-based evidence that contradicts the psycholinguistic claims.

Journal ArticleDOI
TL;DR: The purpose of this paper is to suggest the use of a complementary backup method to increase the robustness of any hand-crafted or machine-learning-based NE tagger, and to explore the effectiveness of using more fine-grained evidencenamely, syntactic and semantic contextual knowledge in classifying NEs.
Abstract: Proper nouns form an open class, making the incompleteness of manually or automatically learned classification rules an obvious problem. The purpose of this paper is twofold: first, to suggest the use of a complementary "backup" method to increase the robustness of any hand-crafted or machine-learning-based NE tagger; and second, to explore the effectiveness of using more fine-grained evidence--namely, syntactic and semantic contextual knowledge---in classifying NEs.

Journal ArticleDOI
TL;DR: An algorithm for identifying noun phrase antecedents of third person personal pronouns, demonstrative pronouns, reflexive pronouns, and omitted pronouns (zero pronouns) in unrestricted Spanish texts is presented.
Abstract: This paper presents an algorithm for identifying noun phrase antecedents of third person personal pronouns, demonstrative pronouns, reflexive pronouns, and omitted pronouns (zero pronouns) in unrestricted Spanish texts. We define a list of constraints and preferences for different types of pronominal expressions, and we document in detail the importance of each kind of knowledge (lexical, morphological, syntactic, and statistical) in anaphora resolution for Spanish. The paper also provides a definition for syntactic conditions on Spanish NP-pronoun noncoreference using partial parsing. The algorithm has been evaluated on a corpus of 1,677 pronouns and achieved a success rate of 76.8%. We have also implemented four competitive algorithms and tested their performance in a blind evaluation on the same test corpus. This new approach could easily be extended to other languages such as English, Portuguese, Italian, or Japanese.

Journal ArticleDOI
TL;DR: This elicit-build-test technique compiles lexical and inectional information elicited from a human into a finite-state transducer lexicon and combines this with a sequence of morphographemic rewrite rules that is induced using transformation-based learning from the elicited examples.
Abstract: This paper presents a semiautomatic technique for developing broad-coverage finite-state mor-phological analyzers for use in natural language processing applications. It consists of three components---elicitation of linguistic information from humans, a machine learning bootstrapping scheme, and a testing environment. The three components are applied iteratively until a threshold of output quality is attained. The initial application of this technique is for the morphology of low-density languages in the context of the Expedition project at NMSU Computing Research Laboratory. This elicit-build-test technique compiles lexical and inflectional information elicited from a human into a finite-state transducer lexicon and combines this with a sequence of morphographemic rewrite rules that is induced using transformation-based learning from the elicited examples. The resulting morphological analyzer is then tested against a test set, and any corrections are fed back into the learning procedure, which then builds an improved analyzer.

Journal ArticleDOI
TL;DR: It is shown how the DSG formalism, which is designed to inherit many of the characterestics of LTAG, can be used to express a variety of linguistic analyses not available in LTAG.
Abstract: There is considerable interest among computational linguists in lexicalized grammatical frame-works; lexicalized tree adjoining grammar (LTAG) is one widely studied example. In this paper, we investigate how derivations in LTAG can be viewed not as manipulations of trees but as manipulations of tree descriptions. Changing the way the lexicalized formalism is viewed raises questions as to the desirability of certain aspects of the formalism. We present a new formalism, d-tree substitution grammar (DSG). Derivations in DSG involve the composition of d-trees, special kinds of tree descriptions. Trees are read off from derived d-trees. We show how the DSG formalism, which is designed to inherit many of the characterestics of LTAG, can be used to express a variety of linguistic analyses not available in LTAG.

Journal ArticleDOI
TL;DR: A new formulation of Rule 2 of centering theory is proposed that incorporates these principles as well as a streamlined version of Strube and Hahn's (1999) notion of cheapness, and is argued that this formulation provides a natural way to handle topic switches that appear to violate the canonical preference ordering.
Abstract: The standard preference ordering on the well-known centering transitions Continue, Retain, Shift is argued to be unmotivated: a partial, context-dependent ordering emerges from the interaction between principles dubbed cohesion (maintaining the same center of attention) and salience (realizing the center of attention as the most prominent NP). A new formulation of Rule 2 of centering theory is proposed that incorporates these principles as well as a streamlined version of Strube and Hahn's (1999) notion of cheapness. It is argued that this formulation provides a natural way to handle "topic switches" that appear to violate the canonical preference ordering.

Journal ArticleDOI
TL;DR: The ROSANA approach, which generalizes the verification of coindexing restrictions in order to make it applicable to the deficient syntactic descriptions that are provided by a robust state-of-the-art parser, and proves that the robust implementation of syntactic disjoint reference is nearly optimal.
Abstract: Syntactic coindexing restrictions are by now known to be of central importance to practical anaphor resolution approaches. Since, in particular due to structural ambiguity, the assumption of the availability of a unique syntactic reading proves to be unrealistic, robust anaphor resolution relies on techniques to overcome this deficiency.This paper describes the ROSANA approach, which generalizes the verification of coindexing restrictions in order to make it applicable to the deficient syntactic descriptions that are provided by a robust state-of-the-art parser. By a formal evaluation on two corpora that differ with respect to text genre and domain, it is shown that ROSANA achieves high-quality robust coreference resolution. Moreover, by an in-depth analysis, it is proven that the robust implementation of syntactic disjoint reference is nearly optimal. The study reveals that, compared with approaches that rely on shallow preprocessing, the largely nonheuristic disjoint reference algorithmization opens up the possibility for a slight improvement. Furthermore, it is shown that more significant gains are to be expected elsewhere, particularly from a text-genre-specific choice of preference strategies.The performance study of the ROSANA system crucially rests on an enhanced evaluation methodology for coreference resolution systems, the development of which constitutes the second major contribution of the paper. As a supplement to the model-theoretic scoring scheme that was developed for the Message Understanding Conference (MUC) evaluations, additional evaluation measures are defined that, on one hand, support the developer of anaphor resolution systems, and, on the other hand, shed light on application aspects of pronoun interpretation.

Journal ArticleDOI
TL;DR: The drive toward knowledge-poor and robust approaches was further motivated by the emergence of cheaper and more reliable corpus-based NLP tools such as partof-speech taggers and shallow parsers, alongside the increasing availability of corpora and other NLP resources.
Abstract: Anaphora accounts for cohesion in texts and is a phenomenon under active study in formal and computational linguistics alike. The correct interpretation of anaphora is vital for natural language processing (NLP). For example, anaphora resolution is a key task in natural language interfaces, machine translation, text summarization, information extraction, question answering, and a number of other NLP applications. After considerable initial research, followed by years of relative silence in the early 1980s, anaphora resolution has attracted the attention of many researchers in the last 10 years and a great deal of successful work on the topic has been carried out. Discourseoriented theories and formalisms such as Discourse Representation Theory and Centering Theory inspired new research on the computational treatment of anaphora. The drive toward corpus-based robust NLP solutions further stimulated interest in alternative and/or data-enriched approaches. Last, but not least, application-driven research in areas such as automatic abstracting and information extraction independently highlighted the importance of anaphora and coreference resolution, boosting research in this area. Much of the earlier work in anaphora resolution heavily exploited domain and linguistic knowledge (Sidner 1979; Carter 1987; Rich and LuperFoy 1988; Carbonell and Brown 1988), which was difficult both to represent and to process, and which required considerable human input. However, the pressing need for the development of robust and inexpensive solutions to meet the demands of practical NLP systems encouraged many researchers to move away from extensive domain and linguistic knowledge and to embark instead upon knowledge-poor anaphora resolution strategies. A number of proposals in the 1990s deliberately limited the extent to which they relied on domain and/or linguistic knowledge and reported promising results in knowledge-poor operational environments (Dagan and Itai 1990, 1991; Lappin and Leass 1994; Nasukawa 1994; Kennedy and Boguraev 1996; Williams, Harvey, and Preston 1996; Baldwin 1997; Mitkov 1996, 1998b). The drive toward knowledge-poor and robust approaches was further motivated by the emergence of cheaper and more reliable corpus-based NLP tools such as partof-speech taggers and shallow parsers, alongside the increasing availability of corpora and other NLP resources (e.g., ontologies). In fact, the availability of corpora, both raw and annotated with coreferential links, provided a strong impetus to anaphora resolu

Journal ArticleDOI
TL;DR: A new reporting standard is proposed that improves the exposition of individual results and the possibility for readers to compare techniques across studies, and an informative new performance metric, the resolution rate, is proposed for use in addition to precision and recall.
Abstract: Pronoun resolution studies compute performance inconsistently and describe results incompletely. We propose a new reporting standard that improves the exposition of individual results and the possibility for readers to compare techniques across studies. We also propose an informative new performance metric, the resolution rate, for use in addition to precision and recall.



Journal Article
TL;DR: The authors suggest the use of a complementary backup method to increase the robustness of any hand-crafted or machine-learning-based NE tagger and explore the effectiveness of using more fine-grained evidencenamely, syntactic and semantic contextual knowledge in classifying NEs.
Abstract: Proper nouns form an open class, making the incompleteness of manually or automatically learned classification rules an obvious problem. The purpose of this paper is twofold: first, to suggest the use of a complementary backup method to increase the robustness of any hand-crafted or machine-learning-based NE tagger; and second, to explore the effectiveness of using more fine-grained evidencenamely, syntactic and semantic contextual knowledgein classifying NEs.


Journal ArticleDOI
TL;DR: An extensive analysis of the alignment procedure used in the MUC-6 evaluation of information extraction technology reveals effects that interfere with the stated goals of the evaluation, and argues strongly for the use of accurate alignment criteria in natural language evaluations.
Abstract: As evaluations of computational linguistics technology progress toward higher-level interpretation tasks, the problem of determining alignments between system responses and answer key entries may become less straightforward. We present an extensive analysis of the alignment procedure used in the MUC-6 evaluation of information extraction technology, which reveals effects that interfere with the stated goals of the evaluation. These effects are shown to be pervasive enough that they have the potential to adversely impact the technology development process. These results argue strongly/ or the use of accurate alignment criteria in natural language evaluations, and/ or maintaining the independence of alignment criteria and mechanisms used to calculate scores.

Journal Article
TL;DR: Language as a multifunctional system and process and the construal of experience: consciousness in daily life and in cognitive science and the making of meaning.
Abstract: Part I: Introduction 1. Theoretical preliminaries Part II: The ideation base 2. Overview of the general ideational potential 3. Sequences 4. Figures 5. Elements 6. Grammatical metaphors 7. Comparison with Chinese Part III: The meaning base as a resource in language processing systems 8. Building an ideation base 9. Using the ideation base in text processing Part IV: Theoretical and descriptive alternatives 10. Alternative approaches to meaning. 11. Distortion and transformation 12. Figures and processes Part V: Language and the construal of experience 13. Language as a multifunctional system and process 14. Construing ideational models: consciousness in daily life and in cognitive science 15. Language and the making of meaning.



Journal Article
TL;DR: Only for you today!
Abstract: Only for you today! Discover your favourite parallel text processing alignment and use of translation corpora book right here by downloading and getting the soft file of the book. This is not your time to traditionally go to the book stores to buy a book. Here, varieties of book collections are available to download. One of them is this parallel text processing alignment and use of translation corpora as your preferred book. Getting this book b on-line in this site can be realized now by visiting the link page to download. It will be easy. Why should be here?

Journal ArticleDOI
TL;DR: The purpose of this paper is to suggest that proper nouns form an open class, making the incompleteness of manually or automatically learned classification rules an obvious problem.
Abstract: Proper nouns form an open class, making the incompleteness of manually or automatically learned classification rules an obvious problem. The purpose of this paper is twofold: first, to suggest the ...