scispace - formally typeset
Search or ask a question

Showing papers in "Natural Language Engineering in 2002"


Journal ArticleDOI
TL;DR: This work argues that with a systematic incremental methodology one can go beyond shallow parsing to deeper language analysis, while preserving robustness, and describes a generic system based on such a methodology and designed for building robust analyzers that tackle deeper linguistic phenomena than those traditionally handled by the now widespread shallow parsers.
Abstract: Robustness is a key issue for natural language processing in general and parsing in particular, and many approaches have been explored in the last decade for the design of robust parsing systems. Among those approaches is shallow or partial parsing, which produces minimal and incomplete syntactic structures, often in an incremental way. We argue that with a systematic incremental methodology one can go beyond shallow parsing to deeper language analysis, while preserving robustness. We describe a generic system based on such a methodology and designed for building robust analyzers that tackle deeper linguistic phenomena than those traditionally handled by the now widespread shallow parsers. The rule formalism allows the recognition of n-ary linguistic relations between words or constituents on the basis of global or local structural, topological and/or lexical conditions. It offers the advantage of accepting various types of inputs, ranging from raw to chunked or constituent-marked texts, so for instance it can be used to process existing annotated corpora, or to perform a deeper analysis on the output of an existing shallow parser. It has been successfully used to build a deep functional dependency parser, as well as for the task of co-reference resolution, in a modular way.

321 citations


Journal ArticleDOI
TL;DR: Results obtained at the SENSEVAL-2 initiative confirm that for a significant subset of words domain information can be used to disambiguate with a very high level of precision.
Abstract: This paper explores the role of domain information in word sense disambiguation. The underlying hypothesis is that domain labels, such as MEDICINE, ARCHITECTURE and SPORT, provide a useful way to establish semantic relations among word senses, which can be profitably used during the disambiguation process. Results obtained at the SENSEVAL-2 initiative confirm that for a significant subset of words domain information can be used to disambiguate with a very high level of precision.

164 citations


Journal ArticleDOI
TL;DR: Analysis of feedback forms filled in after each decision indicated that the intelligibility of present-day machine-generated summaries is high, and the evaluation methods used in the SUMMAC evaluation are of interest to both summarization evaluation as well as evaluation of other ‘output-related’ NLP technologies, where there may be many potentially acceptable outputs.
Abstract: The TIPSTER Text Summarization Evaluation (SUMMAC) has developed several new extrinsic and intrinsic methods for evaluating summaries. It has established definitively that automatic text summarization is very effective in relevance assessment tasks on news articles. Summaries as short as 17% of full text length sped up decision-making by almost a factor of 2 with no statistically significant degradation in accuracy. Analysis of feedback forms filled in after each decision indicated that the intelligibility of present-day machine-generated summaries is high. Systems that performed most accurately in the production of indicative and informative topic-related summaries used term frequency and co-occurrence statistics, and vocabulary overlap comparisons between text passages. However, in the absence of a topic, these statistical methods do not appear to provide any additional leverage: in the case of generic summaries, the systems were indistinguishable in accuracy. The paper discusses some of the tradeoffs and challenges faced by the evaluation, and also lists some of the lessons learned, impacts, and possible future directions. The evaluation methods used in the SUMMAC evaluation are of interest to both summarization evaluation as well as evaluation of other ‘output-related’ NLP technologies, where there may be many potentially acceptable outputs, with no automatic way to compare them.

145 citations


Journal ArticleDOI
TL;DR: An architectural system that contributes to engineering robustness and low-overhead systems development (GATE, a General Architecture for Text Engineering) is presented and results from the development of a multi-purpose cross-genre Named Entity recognition system are presented.
Abstract: We discuss robustness in LE systems from the perspective of engineering, and the predictability of both outputs and construction process that this entails. We present an architectural system that contributes to engineering robustness and low-overhead systems development (GATE, a General Architecture for Text Engineering). To verify our ideas we present results from the development of a multi-purpose cross-genre Named Entity recognition system. This system aims be robust across diverse input types, and to reduce the need for costly and timeconsuming adaptation of systems to new applications, with its capability to process texts from widely differing domains and genres.

121 citations


Journal ArticleDOI
TL;DR: This paper presents a comprehensive empirical exploration and evaluation of a diverse range of data characteristics which influence word sense disambiguation performance, including three variants of Bayesian classifiers, a cosine model, non-hierarchical decision lists, and an extension of the transformation-based learning model.
Abstract: This paper presents a comprehensive empirical exploration and evaluation of a diverse range of data characteristics which influence word sense disambiguation performance. It focuses on a set of six core supervised algorithms, including three variants of Bayesian classifiers, a cosine model, non-hierarchical decision lists, and an extension of the transformation-based learning model. Performance is investigated in detail with respect to the following parameters: (a) target language (English, Spanish, Swedish and Basque); (b) part of speech; (c) sense granularity; (d) inclusion and exclusion of major feature classes; (e) variable context width (further broken down by part-of-speech of keyword); (f) number of training examples; (g) baseline probability of the most likely sense; (h) sense distributional entropy; (i) number of senses per keyword; (j) divergence between training and test data; (k) degree of (artificially introduced) noise in the training data; (l) the effectiveness of an algorithm's confidence rankings; and (m) a full keyword breakdown of the performance of each algorithm. The paper concludes with a brief analysis of similarities, differences, strengths and weaknesses of the algorithms and a hierarchical clustering of these algorithms based on agreement of sense classification behavior. Collectively, the paper constitutes the most comprehensive survey of evaluation measures and tests yet applied to sense disambiguation algorithms. And it does so over a diverse range of supervised algorithms, languages and parameter spaces in single unified experimental framework.

117 citations


Journal ArticleDOI
TL;DR: The evaluation of WSD has turned out to be as difficult as designing the systems in the first place, and system evaluation is crucial to explain these results and to show the way forward.
Abstract: Has system performance on Word Sense Disambiguation (WSD) reached a limit? Automatic systems don't perform nearly as well as humans on the task, and from the results of the SENSEVAL exercises, recent improvements in system performance appear negligible or even negative. Still, systems do perform much better than the baselines, so something is being done right. System evaluation is crucial to explain these results and to show the way forward. Indeed, the success of any project in WSD is tied to the evaluation methodology used, and especially to the formalization of the task that the systems perform. The evaluation of WSD has turned out to be as difficult as designing the systems in the first place.

97 citations


Journal ArticleDOI
TL;DR: It is demonstrated that optimization per word-expert leads to an overall significant improvement in the generalization accuracies of the produced WSD systems.
Abstract: Various Machine Learning (ML) approaches have been demonstrated to produce relatively successful Word Sense Disambiguation (WSD) systems. There are still unexplained differences among the performance measurements of different algorithms, hence it is warranted to deepen the investigation into which algorithm has the right ‘bias’ for this task. In this paper, we show that this is not easy to accomplish, due to intricate interactions between information sources, parameter settings, and properties of the training data. We investigate the impact of parameter optimization on generalization accuracy in a memory-based learning approach to English and Dutch WSD. A ‘word-expert’ architecture was adopted, yielding a set of classifiers, each specialized in one single wordform. The experts consist of multiple memory-based learning classifiers, each taking different information sources as input, combined in a voting scheme. We optimized the architectural and parametric settings for each individual word-expert by performing cross-validation experiments on the learning material. The results of these experiments show that the variation of both the algorithmic parameters and the information sources available to the classifiers leads to large fluctuations in accuracy. We demonstrate that optimization per word-expert leads to an overall significant improvement in the generalization accuracies of the produced WSD systems.

84 citations


Journal ArticleDOI
Hang Li1
TL;DR: The authors proposed an efficient algorithm based on the Minimum Description Length (MDL) principle for estimating such a probability model, which is a natural extension of that proposed in Brown, Della Pietra, deSouza, Lai and Mercer (1992).
Abstract: We address the problem of clustering words (or constructing a thesaurus) based on co-occurrence data, and conducting syntactic disambiguation by using the acquired word classes. We view the clustering problem as that of estimating a class-based probability distribution specifying the joint probabilities of word pairs. We propose an efficient algorithm based on the Minimum Description Length (MDL) principle for estimating such a probability model. Our clustering method is a natural extension of that proposed in Brown, Della Pietra, deSouza, Lai and Mercer (1992). We next propose a syntactic disambiguation method which combines the use of automatically constructed word classes and that of a hand-made thesaurus. The overall disambiguation accuracy achieved by our method is 88.2%, which compares favorably against the accuracies obtained by the state-of-the-art disambiguation methods.

84 citations


Journal ArticleDOI
TL;DR: This study examines several key issues in system combination for the word sense disambiguation task, ranging from algorithmic structure to parameter estimation, and demonstrates that the combination system obtains a significantly lower error rate than other systems participating in the SENSEVAL2 exercise.
Abstract: Classifier combination is an effective and broadly useful method of improving system performance. This article investigates in depth a large number of both well-established and novel classifier combination approaches for the word sense disambiguation task, studied over a diverse classifier pool which includes feature-enhanced Naive Bayes, Cosine, Decision List, Transformation-based Learning and MMVC classifiers. Each classifier has access to the same rich feature space, comprised of distance weighted bag-of-lemmas, local ngram context and specific syntactic relations, such as Verb-Object and Noun-Modifier. This study examines several key issues in system combination for the word sense disambiguation task, ranging from algorithmic structure to parameter estimation. Experiments using the standard SENSEVAL2 lexical-sample data sets in four languages (English, Spanish, Swedish and Basque) demonstrate that the combination system obtains a significantly lower error rate when compared with other systems participating in the SENSEVAL2 exercise, yielding state-of-the-art performance on these data sets.

75 citations


Journal ArticleDOI
TL;DR: An empirical definition of robustness based on the notion of performance is proposed and a framework for controlling the parser robustness in the design phase is presented.
Abstract: Robustness has been traditionally stressed as a general desirable property of any computational model and system. The human NL interpretation device exhibits this property as the ability to deal with odd sentences. However, the difficulties in a theoretical explanation of robustness within the linguistic modelling suggested the adoption of an empirical notion. In this paper, we propose an empirical definition of robustness based on the notion of performance. Furthermore, a framework for controlling the parser robustness in the design phase is presented. The control is achieved via the adoption of two principles: the modularisation, typical of the software engineering practice, and the availability of domain adaptable components. The methodology has been adopted for the production of CHAOS, a pool of syntactic modules, which has been used in real applications. This pool of modules enables a large validation of the notion of empirical robustness, on the one side, and of the design methodology, on the other side, over different corpora and two different languages (English and Italian).

72 citations


Journal ArticleDOI
TL;DR: The ideas described in this paper were implemented in a system that achieves excellent performance on the data provided during the SENSEVAL-2 evaluation exercise, for both English all words and English lexical sample tasks.
Abstract: This paper presents a novel approach for word sense disambiguation. The underlying algorithm has two main components: (1) pattern learning from available sense-tagged corpora (SemCor), from dictionary definitions (WordNet) and from a generated corpus (GenCor); and (2) instance based learning with automatic feature selection, when training data is available for a particular word. The ideas described in this paper were implemented in a system that achieves excellent performance on the data provided during the SENSEVAL-2 evaluation exercise, for both English all words and English lexical sample tasks.

Journal ArticleDOI
TL;DR: Results from a psycholinguistic experiment, indicating the most salient sentences for a given text as the ‘gold standard’, show that the proposed algorithm performs better than commonly used machine learning and statistical approaches to summarisation.
Abstract: This paper describes a simple discourse parsing and analysis algorithm that combines a formal underspecification utilising discourse grammar with Information Retrieval (IR) techniques. First, linguistic knowledge based on discourse markers is used to constrain a totally underspecified discourse representation. Then, the remaining underspecification is further specified by the computation of a topicality score for every discourse unit. This computation is done via the vector space model. Finally, the sentences in a prominent position (e.g. the first sentence of a paragraph) are given an adjusted topicality score. The proposed algorithm was evaluated by applying it to a text summarisation task. Results from a psycholinguistic experiment, indicating the most salient sentences for a given text as the ‘gold standard’, show that the algorithm performs better than commonly used machine learning and statistical approaches to summarisation.

Journal ArticleDOI
TL;DR: This paper presents a new IE-rule learning system that deals with these training set problems and describes a set of experiments for testing this capability of the new learning approach.
Abstract: The growing availability of textual sources has lead to an increase in the use of automatic knowledge acquisition approaches from textual data, as in Information Extraction (IE). Most IE systems use knowledge explicitly represented as sets of IE rules usually manually acquired. Recently, however, the acquisition of this knowledge has been faced by applying a huge variety of Machine Learning (ML) techniques. Within this framework, new problems arise in relation to the way of selecting and annotating positive examples, and sometimes negative ones, in supervised approaches, or the way of organizing unsupervised or semi-supervised approaches. This paper presents a new IE-rule learning system that deals with these training set problems and describes a set of experiments for testing this capability of the new learning approach.

Journal ArticleDOI
TL;DR: A cascaded method for anaphor and pronoun generation is proposed for handling pro-drop and discourse constraints on pronominalization, and it uses binding theory and centering theory to model local and nonlocal references.
Abstract: We describe a system for contextually appropriate anaphor and pronoun generation for Turkish. It uses binding theory and centering theory to model local and nonlocal references. We describe the rules for Turkish, and their computational treatment. A cascaded method for anaphor and pronoun generation is proposed for handling pro-drop and discourse constraints on pronominalization. The system has been tested as a stand-alone nominal expression generator, and also as a reference planning component of a transfer-based MT system.

Journal ArticleDOI
TL;DR: This paper describes an LVQ-based learning pattern association system that uniquely maps a given Arabic word to its corresponding morphological pattern, and therefore deduces its morphological properties.
Abstract: Most of the morphological properties of derivational Arabic words are encapsulated in their corresponding morphological patterns. The morphological pattern is a template that shows how the word should be decomposed into its constituent morphemes (prefix + stem + suffix), and at the same time, marks the positions of the radicals comprising the root of the word. The number of morphological patterns in Arabic is finite and is well below 1000. Due to these properties, most of the current analysis algorithms concentrate on discovering the morphological pattern of the input word as a major step in recognizing the type and category of the word. Unfortunately, this process is non-determinitic in the sense that the underlying search process may sometimes associate more than one morphological pattern with the given word, all of them satisfying the major lexical constraints. One solution to this problem is to use a collection of connectionist pattern associaters that uniquely associate each word with its corresponding morphological pattern. This paper describes an LVQ-based learning pattern association system that uniquely maps a given Arabic word to its corresponding morphological pattern, and therefore deduces its morphological properties. The system consists of a collection of hetroassociative models that are trained using the LVQ algorithm plus a collection of autoassociative models that have been trained using backpropagation. Experimental results have shown that the system is fairly accurate and very easy to train. The LVQ algorithm has been chosen because it is very easy to train and the implied training time is very small compared to that of backpropagation.

Journal ArticleDOI
TL;DR: Using bootstrapping is proposed to solve the problem of topic analysis: a first topic analysis based on a weakly structured source of knowledge, a collocation network, is used for learning explicit topic representations that then support a more precise and reliable topic analysis.
Abstract: Topic analysis is important for many applications dealing with texts, such as text summarization or information extraction. However, it can be done with great precision only if it relies on structured knowledge, which is difficult to produce on a large scale. In this paper, we propose using bootstrapping to solve this problem: a first topic analysis based on a weakly structured source of knowledge, a collocation network, is used for learning explicit topic representations that then support a more precise and reliable topic analysis.

Journal ArticleDOI
TL;DR: The automated analysis of natural language data has become a central issue in the design of intelligent information systems and recent interest has been focused on providing approximate analysis techniques, assuming that when perfect analysis is not possible, partial results may be still very useful.
Abstract: The automated analysis of natural language data has become a central issue in the design of intelligent information systems. Processing unconstrained natural language data is still considered as an AI-hard task. However, various analysis techniques have been proposed to address specific aspects of natural language. In particular, recent interest has been focused on providing approximate analysis techniques, assuming that when perfect analysis is not possible, partial results may be still very useful.

Journal ArticleDOI
TL;DR: This paper explores the effectiveness of index terms more complex than the single words used in conventional information retrieval systems and introduces a method to select effective index terms by using a decision tree.
Abstract: This paper explores the effectiveness of index terms more complex than the single words used in conventional information retrieval systems. Retrieval is done in two phases: in the first, a conventional retrieval method (the Okapi system) is used; in the second, complex index terms such as syntactic relations and single words with part-of-speech information are introduced to rerank the results of the first phase. We evaluated the effectiveness of the different types of index terms through experiments using the TREC-7 test collection and 50 queries. The retrieval effectiveness was improved for 32 out of 50 queries. Based on this investigation, we then introduce a method to select effective index terms by using a decision tree. Further experiments with the same test collection showed that retrieval effectiveness was improved in 25 of the 50 queries.

Journal ArticleDOI
TL;DR: The aim of this paper is to introduce some general reflections on the task of lexical semantic annotation and the adequacy of existing lexical-semantic reference resources, while giving an overall description of the Italian lexical sample task for the SENSEVAL-2 experiment.
Abstract: The aim of our paper is twofold: to introduce some general reflections on the task of lexical semantic annotation and the adequacy of existing lexical-semantic reference resources, while giving an overall description of the Italian lexical sample task for the SENSEVAL-2 experiment. We suggest how the SENSEVAL exercise (and comparison between the two editions of the experiment) can be employed to evaluate the lexical reference resources used for annotation. We conclude with a few general remarks on the gap between the lexicon, a partially decontextualised object, and the corpus, where context plays a significant role.

Journal ArticleDOI
TL;DR: This paper proposes a robust approach to parsing suitable for Information Extraction from texts using finite-state cascades, characterized by the construction of an approximation of the full parse tree that captures all the information relevant for IE purposes, leaving the other relations underspecified.
Abstract: This paper proposes a robust approach to parsing suitable for Information Extraction (IE) from texts using finite-state cascades. The approach is characterized by the construction of an approximation of the full parse tree that captures all the information relevant for IE purposes, leaving the other relations underspecified. Sequences of cascades of finite-state rules deterministically analyze the text, building unambiguous structures. Initially basic chunks are analyzed; then clauses are recognized and nested; finally modifier attachment is performed and the global parse tree is built. The parsing approach allows robust, effective and efficient analysis of real world texts. The grammar organization simplifies changes, insertion of new rules and integration of domain-oriented rules. The approach has been tested for Italian, English, and Russian. A parser based on such an approach has been implemented as part of Pinocchio, an environment for developing and running IE applications.

Journal ArticleDOI
TL;DR: Gale, as he liked to be called by his friends and family, had extremely broad interests, both professionally and otherwise.
Abstract: Gale, as he liked to be called by his friends and family, had extremely broad interests, both professionally and otherwise. His professional career at Bell Labs included radio astronomy, economics, statistics and computational linguistics.