scispace - formally typeset
Search or ask a question

Showing papers in "Natural Language Engineering in 2005"


Journal ArticleDOI
TL;DR: Experimental evaluation confirms that MaltParser can achieve robust, efficient and accurate parsing for a wide range of languages without language-specific enhancements and with rather limited amounts of training data.
Abstract: Parsing unrestricted text is useful for many language technology applications but requires parsing methods that are both robust and efficient. MaltParser is a language-independent system for data-driven dependency parsing that can be used to induce a parser for a new language from a treebank sample in a simple yet flexible manner. Experimental evaluation confirms that MaltParser can achieve robust, efficient and accurate parsing for a wide range of languages without language-specific enhancements and with rather limited amounts of training data.

801 citations


Journal ArticleDOI
TL;DR: Several Chinese linguistic issues and their implications for treebanking efforts are discussed and how to address these issues when developing annotation guidelines are addressed, and engineering strategies to improve speed while ensuring annotation quality are described.
Abstract: With growing interest in Chinese Language Processing, numerous NLP tools (e.g., word segmenters, part-of-speech taggers, and parsers) for Chinese have been developed all over the world. However, since no large-scale bracketed corpora are available to the public, these tools are trained on corpora with different segmentation criteria, part-of-speech tagsets and bracketing guidelines, and therefore, comparisons are difficult. As a first step towards addressing this issue, we have been preparing a large bracketed corpus since late 1998. The first two installments of the corpus, 250 thousand words of data, fully segmented, POS-tagged and syntactically bracketed, have been released to the public via LDC (www.ldc.upenn.edu). In this paper, we discuss several Chinese linguistic issues and their implications for our treebanking efforts and how we address these issues when developing our annotation guidelines. We also describe our engineering strategies to improve speed while ensuring annotation quality.

640 citations


Journal ArticleDOI
TL;DR: Using parallel text to help solving the problem of creating syntactic annotation in more languages by annotating the English side of a parallel corpus, project the analysis to the second language, and train a stochastic analyzer on the resulting noisy annotations.
Abstract: Broad coverage, high quality parsers are available for only a handful of languages. A prerequisite for developing broad coverage parsers for more languages is the annotation of text with the desired linguistic representations (also known as “treebanking”). However, syntactic annotation is a labor intensive and time-consuming process, and it is difficult to find linguistically annotated text in sufficient quantities. In this article, we explore using parallel text to help solving the problem of creating syntactic annotation in more languages. The central idea is to annotate the English side of a parallel corpus, project the analysis to the second language, and then train a stochastic analyzer on the resulting noisy annotations. We discuss our background assumptions, describe an initial study on the “projectability” of syntactic relations, and then present two experiments in which stochastic parsers are developed with minimal human intervention via projection from English.

384 citations


Journal ArticleDOI
TL;DR: A method for detecting and correcting many spelling errors by identifying tokens that are semantically unrelated to their context and are spelling variations of words that would be related to the context is presented.
Abstract: Spelling errors that happen to result in a real word in the lexicon cannot be detected by a conventional spelling checker. We present a method for detecting and correcting many such errors by identifying tokens that are semantically unrelated to their context and are spelling variations of words that would be related to the context. Relatedness to context is determined by a measure of semantic distance initially proposed by Jiang and Conrath (1997). We tested the method on an artificial corpus of errors; it achieved recall of 23–50% and precision of 18–25%.

194 citations


Journal ArticleDOI
TL;DR: The authors investigated sources of human annotator disagreements stemming from the tagging for the English Verb Lexical Sample Task in the Senseval-2 exercise in automatic word sense disambiguation.
Abstract: In this paper we discuss a persistent problem arising from polysemy: namely the difficulty of finding consistent criteria for making fine-grained sense distinctions, either manually or automatically. We investigate sources of human annotator disagreements stemming from the tagging for the English Verb Lexical Sample Task in the Senseval-2 exercise in automatic Word Sense Disambiguation. We also examine errors made by a high-performing maximum entropy Word Sense Disambiguation system we developed. Both sets of errors are at least partially reconciled by a more coarse-grained view of the senses, and we present the groupings we use for quantitative coarse-grained evaluation as well as the process by which they were created. We compare the system’s performance with our human annotator performance in light of both fine-grained and coarse-grained sense distinctions and show that well-defined sense groups can be of value in improving word sense disambiguation by both humans and machines.

157 citations


Journal ArticleDOI
TL;DR: Results suggest that the transfer approach is one promising solution to the resource bottleneck and allows for the exploitation of existing (mostly English) annotated resources to bootstrap the creation of annotated corpora in new (resource-poor) languages with greatly reduced human effort.
Abstract: In this article we illustrate and evaluate an approach to create high quality linguistically annotated resources based on the exploitation of aligned parallel corpora. This approach is based on the assumption that if a text in one language has been annotated and its translation has not, annotations can be transferred from the source text to the target using word alignment as a bridge. The transfer approach has been tested and extensively applied for the creation of the MultiSemCor corpus, an English/Italian parallel corpus created on the basis of the English SemCor corpus. In MultiSemCor the texts are aligned at the word level and word sense annotated with a shared inventory of senses. A number of experiments have been carried out to evaluate the different steps involved in the methodology and the results suggest that the transfer approach is one promising solution to the resource bottleneck. First, it leads to the creation of a parallel corpus, which represents a crucial resource per se. Second, it allows for the exploitation of existing (mostly English) annotated resources to bootstrap the creation of annotated corpora in new (resource-poor) languages with greatly reduced human effort.

100 citations


Journal ArticleDOI
TL;DR: The best results, well above baseline, were achieved with time-delay networks that used features related to the author's syntactic preferences, whereas low-level and vocabulary-based features were not found to be useful.
Abstract: As part of a larger project to develop an aid for writers that would help to eliminate stylistic inconsistencies within a document, we experimented with neural networks to find the points in a text at which its stylistic character changes. Our best results, well above baseline, were achieved with time-delay networks that used features related to the author's syntactic preferences, whereas low-level and vocabulary-based features were not found to be useful. An alternative approach with character bigrams was not successful.

94 citations


Journal ArticleDOI
TL;DR: This paper presents a very simple and effective approach to using parallel corpora for automatic bilingual lexicon acquisition, which uses the Random Indexing vector space methodology, and explains how this approach differs from traditional cooccurrence-based word alignment algorithms.
Abstract: This paper presents a very simple and effective approach to using parallel corpora for automatic bilingual lexicon acquisition. The approach, which uses the Random Indexing vector space methodology, is based on finding correlations between terms based on their distributional characteristics. The approach requires a minimum of preprocessing and linguistic knowledge, and is efficient, fast and scalable. In this paper, we explain how our approach differs from traditional cooccurrence-based word alignment algorithms, and we demonstrate how to extract bilingual lexica using the Random Indexing approach applied to aligned parallel data. The acquired lexica are evaluated by comparing them to manually compiled gold standards, and we report overlap of around 60%. We also discuss methodological problems with evaluating lexical resources of this kind.

81 citations


Journal ArticleDOI
TL;DR: This work presents an approach in which multimodal parsing and understanding are achieved using a weighted finite-state device which takes speech and gesture streams as inputs and outputs their joint interpretation.
Abstract: Multimodal interfaces are systems that allow input and/or output to be conveyed over multiple channels such as speech, graphics, and gesture. In addition to parsing and understanding separate utterances from different modes such as speech or gesture, multimodal interfaces also need to parse and understand composite multimodal utterances that are distributed over multiple input modes. We present an approach in which multimodal parsing and understanding are achieved using a weighted finite-state device which takes speech and gesture streams as inputs and outputs their joint interpretation. In comparison to previous approaches, this approach is significantly more efficient and provides a more general probabilistic framework for multimodal ambiguity resolution. The approach also enables tight-coupling of multimodal understanding with speech recognition. Since the finite-state approach is more lightweight in computational needs, it can be more readily deployed on a broader range of mobile platforms. We provide speech recognition results that demonstrate compensation effects of exploiting gesture information in a directory assistance and messaging task using a multimodal interface.

68 citations


Journal ArticleDOI
TL;DR: An approach to term extraction that rests on theoretical claims about the structure of words that uses the structural properties of compound words to specifically elicit the sets of terms defined by type hierarchies such as hyponymy and meronymy is outlined.
Abstract: Advances in language engineering may be dependent on theoretical principles originating from linguistics, since both share a common object of enquiry, natural language structures. We outline an approach to term extraction that rests on theoretical claims about the structure of words. We use the structural properties of compound words to specifically elicit the sets of terms defined by type hierarchies such as hyponymy and meronymy. The theoretical claims revolve around the head-modifier principle, which determines the formation of a major class of compounds. Significantly it has been suggested that the principle operates in languages other than English. To demonstrate the extendibility of our approach beyond English, we present a case study of term extraction in Chinese, a language whose written form is the vehicle of communication for over 1.3 billion language users, and therefore has great significance for the development of language engineering technologies.

56 citations


Journal ArticleDOI
TL;DR: It is argued that the XML-processing paradigm is ideally suited for automatically preparing the corpus for parsing, and it is demonstrated that a model of the meaning of compound nominalisations is achievable with the aid of current broad-coverage parsers.
Abstract: This paper reports on a number of experiments which are designed to investigate the extent to which current NLP resources are able to syntactically and semantically analyse biomedical text. We address two tasks: (a) parsing a real corpus with a hand-built wide-coverage grammar, producing both syntactic analyses and logical forms and (b) automatically computing the interpretation of compound nouns where the head is a nominalisation (e.g. hospital arrival means an arrival at hospital, while patient arrival means an arrival of a patient). For the former task we demonstrate that flexible and yet constrained pre-processing techniques are crucial to success: these enable us to use part-of-speech tags to overcome inadequate lexical coverage, and to package up complex technical expressions prior to parsing so that they are blocked from creating misleading amounts of syntactic complexity. We argue that the XML-processing paradigm is ideally suited for automatically preparing the corpus for parsing. For the latter task, we compute interpretations of the compounds by exploiting surface cues and meaning paraphrases, which in turn are extracted from the parsed corpus. This provides an empirical setting in which we can compare the utility of a comparatively deep parser vs. a shallow one, exploring the trade-off between resolving attachment ambiguities on the one hand and generating errors in the parses on the other. We demonstrate that a model of the meaning of compound nominalisations is achievable with the aid of current broad-coverage parsers.

Journal ArticleDOI
TL;DR: Tests on a 4.8 million word bitext indicate that while SMT appears to outperform the authors' system for French-English on a number of metrics, for English-French, on all but one automatic evaluation metric, the performance of the EBMT system is superior to the baseline SMT model.
Abstract: In previous work (Gough and Way 2004), we showed that our Example-Based Machine Translation (EBMT) system improved with respect to both coverage and quality when seeded with increasing amounts of training data, so that it significantly outperformed the on-line MT system Logomedia according to a wide variety of automatic evaluation metrics. While it is perhaps unsurprising that system performance is correlated with the amount of training data, we address in this paper the question of whether a large-scale, robust EBMT system such as ours can outperform a Statistical Machine Translation (SMT) system. We obtained a large English-French translation memory from Sun Microsystems from which we randomly extracted a near 4K test set. The remaining data was split into three training sets, of roughly 50K, 100K and 200K sentence-pairs in order to measure the effect of increasing the size of the training data on the performance of the two systems. Our main observation is that contrary to perceived wisdom in the field, there appears to be little substance to the claim that SMT systems are guaranteed to outperform EBMT systems when confronted with ‘enough’ training data. Our tests on a 4.8 million word bitext indicate that while SMT appears to outperform our system for French-English on a number of metrics, for English-French, on all but one automatic evaluation metric, the performance of our EBMT system is superior to the baseline SMT model.

Journal ArticleDOI
TL;DR: A novel approach to multi-document summarization, which explicitly addresses the problem of detecting, and retaining for the summary, multiple themes in document collections, and applies Iterative Residual Rescaling (IRR).
Abstract: This paper describes a novel approach to multi-document summarization, which explicitly addresses the problem of detecting, and retaining for the summary, multiple themes in document collections. We place equal emphasis on the processes of theme identification and theme presentation. For the former, we apply Iterative Residual Rescaling (IRR); for the latter, we argue for graphical display elements. IRR is an algorithm designed to account for correlations between words and to construct multi-dimensional topical space indicative of relationships among linguistic objects (documents, phrases, and sentences). Summaries are composed of objects with certain properties, derived by exploiting the many-to-many relationships in such a space. Given their inherent complexity, our multi-faceted summaries benefit from a visualization environment. We discuss some essential features of such an environment.

Journal ArticleDOI
TL;DR: This article developed an algorithm for clarification dialog recognition through the analysis of collected data on clarification dialogues and examined the importance of clarification dialogue recognition for question answering, and showed the usefulness of the algorithm by demonstrating how the recognition of clarified dialogues can simplify the task of answer retrieval.
Abstract: We examine the implementation of clarification dialogues, a mechanism for ensuring that question answering systems take into account user goals by allowing them to ask series of related questions either by refining or expanding on previous questions with follow-up questions, in the context of open domain Question Answering systems. We develop an algorithm for clarification dialogue recognition through the analysis of collected data on clarification dialogues and examine the importance of clarification dialogue recognition for question answering. The algorithm is evaluated and shown to successfully recognize the start and continuation of clarification dialogues in 94% of cases. We then show the usefulness of the algorithm by demonstrating how the recognition of clarification dialogues can simplify the task of answer retrieval.

Journal ArticleDOI
TL;DR: An architecture for robust parsing of natural language utterances has been developed, using a plausibility-based arbitration procedure to derive fairly rich structural representations, comprising aspects of syntax, semantics and other description levels of language.
Abstract: Based on constraint optimization techniques, an architecture for robust parsing of natural language utterances has been developed. The resulting system is able to combine possibly contradicting evidence from a variety of information sources, using a plausibility-based arbitration procedure to derive fairly rich structural representations, comprising aspects of syntax, semantics and other description levels of language. The results of a series of experiments are reported which demonstrate the high potential for robust behaviour with respect to ungrammaticality, incomplete utterances, and temporal pressure.

Journal ArticleDOI
TL;DR: This work allows the user to enter text in compressed form, in particular, using a simple stipulated abbreviation method that reduces characters by 26.4%, yet is simple enough that it can be learned easily and generated relatively fluently.
Abstract: We address the problem of improving the efficiency of natural language text input under degraded conditions (for instance, on mobile computing devices or by disabled users), by taking advantage of the informational redundancy in natural language. Previous approaches to this problem have been based on the idea of prediction of the text, but these require the user to take overt action to verify or select the system's predictions. We propose taking advantage of the duality between prediction and compression . We allow the user to enter text in compressed form, in particular, using a simple stipulated abbreviation method that reduces characters by 26.4%, yet is simple enough that it can be learned easily and generated relatively fluently. We decode the abbreviated text using a statistical generative model of abbreviation, with a residual word error rate of 3.3%. The chief component of this model is an n -gram language model. Because the system's operation is completely independent from the user's, the overhead from cognitive task switching and attending to the system's actions online is eliminated, opening up the possibility that the compression-based method can achieve text input efficiency improvements where the prediction-based methods have not. We report the results of a user study evaluating this method.

Journal ArticleDOI
TL;DR: The clue alignment approach and the optimization of its parameters using a genetic algorithm is described, which shows a significant improvement of about 6% in F-scores compared to the baseline produced by statistical word alignment.
Abstract: Statistical, linguistic, and heuristic clues can be used for the alignment of words and multi-word units in parallel texts. This article describes the clue alignment approach and the optimization of its parameters using a genetic algorithm. Word alignment clues can come from various sources such as statistical alignment models, co-occurrence tests, string similarity scores and static dictionaries. A genetic algorithm implementing an evolutionary procedure can be used to optimize the parameters necessary for combining available clues. Experiments on English/Swedish bitext show a significant improvement of about 6% in F-scores compared to the baseline produced by statistical word alignment.

Journal ArticleDOI
TL;DR: Through the HMM and an effective constraint relaxation algorithm to deal with the data sparseness problem, PowerNE is able to effectively apply and integrate various internal and external evidences of entity names and the named entity recognition problem is resolved effectively.
Abstract: Named entity recognition identifies and classifies entity names in a text document into some predefined categories. It resolves the “who”, “where” and “how much” problems in information extraction and leads to the resolution of the “what” and “how” problems in further processing. This paper presents a Hidden Markov Model (HMM) and proposes a HMM-based named entity recognizer implemented as the system PowerNE. Through the HMM and an effective constraint relaxation algorithm to deal with the data sparseness problem, PowerNE is able to effectively apply and integrate various internal and external evidences of entity names. Currently, four evidences are included: (1) a simple deterministic internal feature of the words, such as capitalization and digitalization; (2) an internal semantic feature of the important triggers; (3) an internal gazetteer feature, which determines the appearance of the current word string in the provided gazetteer list; and (4) an external macro context feature, which deals with the name alias phenomena. In this way, the named entity recognition problem is resolved effectively. PowerNE has been benchmarked with the Message Understanding Conferences (MUC) data. The evaluation shows that, using the formal training and test data of the MUC-6 and MUC-7 English named entity tasks, and it achieves the F-measures of 96.6 and 94.1, respectively. Compared with the best reported machine learning system, it achieves a 1.7 higher F-measure with one quarter of the training data on MUC-6, and a 3.6 higher F-measure with one ninth of the training data on MUC-7. In addition, it performs slightly better than the best reported handcrafted rule-based systems on MUC-6 and MUC-7.

Journal ArticleDOI
TL;DR: It is shown how auxiliary information can be used to constrain the procedure directly by restricting the set of alignments explored during parameter estimation, which enables the integration of bilingual and monolingual knowledge sources while retaining the flexibility of the underlying models.
Abstract: Standard parameter estimation schemes for statistical translation models can struggle to find reasonable settings on some parallel corpora. We show how auxiliary information can be used to constrain the procedure directly by restricting the set of alignments explored during parameter estimation. This enables the integration of bilingual and monolingual knowledge sources while retaining the flexibility of the underlying models. We demonstrate the effectiveness of this approach for incorporating linguistic and domain-specific constraints on various parallel corpora, and consider the importance of using the context of the parallel text to guide the application of such constraints.

Journal ArticleDOI
TL;DR: The implementation of a Chinese paraphraser for a Chinese-Japanese spoken-language translation system is described, which uses a pattern-based approach in which the meaning is retained to the greatest possible extent without deep parsing.
Abstract: One of the key issues in spoken-language translation is how to deal with unrestricted expressions in spontaneous utterances. We have developed a paraphraser for use as part of a translation system, and in this paper we describe the implementation of a Chinese paraphraser for a Chinese-Japanese spoken-language translation system. When an input sentence cannot be translated by the transfer engine, the paraphraser automatically transforms the sentence into alternative expressions until one of these alternatives can be translated by the transfer engine. Two primary issues must be dealt with in paraphrasing: how to determine new expressions, and how to retain the meaning of the input sentence. We use a pattern-based approach in which the meaning is retained to the greatest possible extent without deep parsing. The paraphrase patterns are acquired from a paraphrase corpus and human experience. The paraphrase instances are automatically extracted and then generalized into paraphrase patterns. A total of 1719 paraphrase patterns obtained using this method and an implemented paraphraser were used in a paraphrasing experiment. The results showed that the implemented paraphraser generated 1.7 paraphrases on average for each test sentence and achieved an accuracy of 88%.

Journal ArticleDOI
TL;DR: An analytical method for design and performance analysis of language models (LM) is described, and an example interactive software tool based on the technique is demonstrated.
Abstract: An analytical method for design and performance analysis of language models (LM) is described, and an example interactive software tool based on the technique is demonstrated. The LM performance analysis does not require on-line simulation or experimentation with the recognition system in which the LM is to employed. By exploiting parallels with signal detection theory, a profile of the LM as a function of the design parameters is given in a set of curves analogous to a receiver-operating-characteristic display.