Showing papers in &quot;Natural Language Engineering in 2005&quot;

Bootstrapping parsers via syntactic projection across parallel texts

TL;DR: Several Chinese linguistic issues and their implications for treebanking efforts are discussed and how to address these issues when developing annotation guidelines are addressed, and engineering strategies to improve speed while ensuring annotation quality are described.

...read moreread less

Abstract: With growing interest in Chinese Language Processing, numerous NLP tools (e.g., word segmenters, part-of-speech taggers, and parsers) for Chinese have been developed all over the world. However, since no large-scale bracketed corpora are available to the public, these tools are trained on corpora with different segmentation criteria, part-of-speech tagsets and bracketing guidelines, and therefore, comparisons are difficult. As a first step towards addressing this issue, we have been preparing a large bracketed corpus since late 1998. The first two installments of the corpus, 250 thousand words of data, fully segmented, POS-tagged and syntactically bracketed, have been released to the public via LDC (www.ldc.upenn.edu). In this paper, we discuss several Chinese linguistic issues and their implications for our treebanking efforts and how we address these issues when developing our annotation guidelines. We also describe our engineering strategies to improve speed while ensuring annotation quality.

...read moreread less

640 citations

Journal Article•DOI•

[...]

Rebecca Hwa¹, Philip Resnik², Amy Weinberg², Clara I. Cabezas², Okan Kolak² - Show less +1 more•Institutions (2)

University of Pittsburgh¹, University of Maryland, College Park²

Correcting real-word spelling errors by restoring lexical cohesion

TL;DR: Using parallel text to help solving the problem of creating syntactic annotation in more languages by annotating the English side of a parallel corpus, project the analysis to the second language, and train a stochastic analyzer on the resulting noisy annotations.

...read moreread less

Abstract: Broad coverage, high quality parsers are available for only a handful of languages. A prerequisite for developing broad coverage parsers for more languages is the annotation of text with the desired linguistic representations (also known as “treebanking”). However, syntactic annotation is a labor intensive and time-consuming process, and it is difficult to find linguistically annotated text in sufficient quantities. In this article, we explore using parallel text to help solving the problem of creating syntactic annotation in more languages. The central idea is to annotate the English side of a parallel corpus, project the analysis to the second language, and then train a stochastic analyzer on the resulting noisy annotations. We discuss our background assumptions, describe an initial study on the “projectability” of syntactic relations, and then present two experiments in which stochastic parsers are developed with minimal human intervention via projection from English.

...read moreread less

384 citations

Journal Article•DOI•

[...]

Graeme Hirst¹, Alexander Budanitsky¹•Institutions (1)

University of Toronto¹

Making fine-grained and coarse-grained sense distinctions, both manually and automatically

TL;DR: A method for detecting and correcting many spelling errors by identifying tokens that are semantically unrelated to their context and are spelling variations of words that would be related to the context is presented.

...read moreread less

Abstract: Spelling errors that happen to result in a real word in the lexicon cannot be detected by a conventional spelling checker. We present a method for detecting and correcting many such errors by identifying tokens that are semantically unrelated to their context and are spelling variations of words that would be related to the context. Relatedness to context is determined by a measure of semantic distance initially proposed by Jiang and Conrath (1997). We tested the method on an artificial corpus of errors; it achieved recall of 23–50% and precision of 18–25%.

...read moreread less

194 citations

Journal Article•DOI•

[...]

Martha Palmer¹, Hoa Trang Dang², Christiane Fellbaum³•Institutions (3)

University of Colorado Boulder¹, National Institute of Standards and Technology², Princeton University³

01 Jan 2005-Natural Language Engineering

TL;DR: The authors investigated sources of human annotator disagreements stemming from the tagging for the English Verb Lexical Sample Task in the Senseval-2 exercise in automatic word sense disambiguation.

...read moreread less

Abstract: In this paper we discuss a persistent problem arising from polysemy: namely the difficulty of finding consistent criteria for making fine-grained sense distinctions, either manually or automatically. We investigate sources of human annotator disagreements stemming from the tagging for the English Verb Lexical Sample Task in the Senseval-2 exercise in automatic Word Sense Disambiguation. We also examine errors made by a high-performing maximum entropy Word Sense Disambiguation system we developed. Both sets of errors are at least partially reconciled by a more coarse-grained view of the senses, and we present the groupings we use for quantitative coarse-grained evaluation as well as the process by which they were created. We compare the system’s performance with our human annotator performance in light of both fine-grained and coarse-grained sense distinctions and show that well-defined sense groups can be of value in improving word sense disambiguation by both humans and machines.

...read moreread less

157 citations

Journal Article•DOI•

Exploiting parallel texts in the creation of multilingual semantically annotated resources: the MultiSemCor Corpus

[...]

Luisa Bentivogli, Emanuele Pianta

Segmenting documents by stylistic character

TL;DR: Results suggest that the transfer approach is one promising solution to the resource bottleneck and allows for the exploitation of existing (mostly English) annotated resources to bootstrap the creation of annotated corpora in new (resource-poor) languages with greatly reduced human effort.

...read moreread less

Abstract: In this article we illustrate and evaluate an approach to create high quality linguistically annotated resources based on the exploitation of aligned parallel corpora. This approach is based on the assumption that if a text in one language has been annotated and its translation has not, annotations can be transferred from the source text to the target using word alignment as a bridge. The transfer approach has been tested and extensively applied for the creation of the MultiSemCor corpus, an English/Italian parallel corpus created on the basis of the English SemCor corpus. In MultiSemCor the texts are aligned at the word level and word sense annotated with a shared inventory of senses. A number of experiments have been carried out to evaluate the different steps involved in the methodology and the results suggest that the transfer approach is one promising solution to the resource bottleneck. First, it leads to the creation of a parallel corpus, which represents a crucial resource per se. Second, it allows for the exploitation of existing (mostly English) annotated resources to bootstrap the creation of annotated corpora in new (resource-poor) languages with greatly reduced human effort.

...read moreread less

100 citations

Journal Article•DOI•

[...]

Neil Graham¹, Graeme Hirst¹, Bhaskara Marthi¹•Institutions (1)

University of Toronto¹

Automatic bilingual lexicon acquisition using random indexing of parallel corpora

TL;DR: The best results, well above baseline, were achieved with time-delay networks that used features related to the author's syntactic preferences, whereas low-level and vocabulary-based features were not found to be useful.

...read moreread less

Abstract: As part of a larger project to develop an aid for writers that would help to eliminate stylistic inconsistencies within a document, we experimented with neural networks to find the points in a text at which its stylistic character changes. Our best results, well above baseline, were achieved with time-delay networks that used features related to the author's syntactic preferences, whereas low-level and vocabulary-based features were not found to be useful. An alternative approach with character bigrams was not successful.

...read moreread less

94 citations

Journal Article•DOI•

[...]

Magnus Sahlgren¹, Jussi Karlgren¹•Institutions (1)

Swedish Institute of Computer Science¹

Finite-state multimodal integration and understanding

TL;DR: This paper presents a very simple and effective approach to using parallel corpora for automatic bilingual lexicon acquisition, which uses the Random Indexing vector space methodology, and explains how this approach differs from traditional cooccurrence-based word alignment algorithms.

...read moreread less

Abstract: This paper presents a very simple and effective approach to using parallel corpora for automatic bilingual lexicon acquisition. The approach, which uses the Random Indexing vector space methodology, is based on finding correlations between terms based on their distributional characteristics. The approach requires a minimum of preprocessing and linguistic knowledge, and is efficient, fast and scalable. In this paper, we explain how our approach differs from traditional cooccurrence-based word alignment algorithms, and we demonstrate how to extract bilingual lexica using the Random Indexing approach applied to aligned parallel data. The acquired lexica are evaluated by comparing them to manually compiled gold standards, and we report overlap of around 60%. We also discuss methodological problems with evaluating lexical resources of this kind.

...read moreread less

81 citations

Journal Article•DOI•

[...]

Michael V. Johnston¹, Srinivas Bangalore¹•Institutions (1)

AT&T Labs¹

The head-modifier principle and multilingual term extraction

TL;DR: This work presents an approach in which multimodal parsing and understanding are achieved using a weighted finite-state device which takes speech and gesture streams as inputs and outputs their joint interpretation.

...read moreread less

Abstract: Multimodal interfaces are systems that allow input and/or output to be conveyed over multiple channels such as speech, graphics, and gesture. In addition to parsing and understanding separate utterances from different modes such as speech or gesture, multimodal interfaces also need to parse and understand composite multimodal utterances that are distributed over multiple input modes. We present an approach in which multimodal parsing and understanding are achieved using a weighted finite-state device which takes speech and gesture streams as inputs and outputs their joint interpretation. In comparison to previous approaches, this approach is significantly more efficient and provides a more general probabilistic framework for multimodal ambiguity resolution. The approach also enables tight-coupling of multimodal understanding with speech recognition. Since the finite-state approach is more lightweight in computational needs, it can be more readily deployed on a broader range of mobile platforms. We provide speech recognition results that demonstrate compensation effects of exploiting gesture information in a directory assistance and messaging task using a multimodal interface.

...read moreread less

68 citations

Journal Article•DOI•

[...]

Andrew Hippisley¹, David Cheng¹, Khurshid Ahmad¹•Institutions (1)

University of Surrey¹

A comparison of parsing technologies for the biomedical domain

TL;DR: An approach to term extraction that rests on theoretical claims about the structure of words that uses the structural properties of compound words to specifically elicit the sets of terms defined by type hierarchies such as hyponymy and meronymy is outlined.

...read moreread less

Abstract: Advances in language engineering may be dependent on theoretical principles originating from linguistics, since both share a common object of enquiry, natural language structures. We outline an approach to term extraction that rests on theoretical claims about the structure of words. We use the structural properties of compound words to specifically elicit the sets of terms defined by type hierarchies such as hyponymy and meronymy. The theoretical claims revolve around the head-modifier principle, which determines the formation of a major class of compounds. Significantly it has been suggested that the principle operates in languages other than English. To demonstrate the extendibility of our approach beyond English, we present a case study of term extraction in Chinese, a language whose written form is the vehicle of communication for over 1.3 billion language users, and therefore has great significance for the development of language engineering technologies.

...read moreread less

56 citations

Journal Article•DOI•

[...]

Claire Grover¹, Alex Lascarides¹, Mirella Lapata²•Institutions (2)

University of Edinburgh¹, University of Sheffield²

Comparing example-based and statistical machine translation

TL;DR: It is argued that the XML-processing paradigm is ideally suited for automatically preparing the corpus for parsing, and it is demonstrated that a model of the meaning of compound nominalisations is achievable with the aid of current broad-coverage parsers.

...read moreread less

Abstract: This paper reports on a number of experiments which are designed to investigate the extent to which current NLP resources are able to syntactically and semantically analyse biomedical text. We address two tasks: (a) parsing a real corpus with a hand-built wide-coverage grammar, producing both syntactic analyses and logical forms and (b) automatically computing the interpretation of compound nouns where the head is a nominalisation (e.g. hospital arrival means an arrival at hospital, while patient arrival means an arrival of a patient). For the former task we demonstrate that flexible and yet constrained pre-processing techniques are crucial to success: these enable us to use part-of-speech tags to overcome inadequate lexical coverage, and to package up complex technical expressions prior to parsing so that they are blocked from creating misleading amounts of syntactic complexity. We argue that the XML-processing paradigm is ideally suited for automatically preparing the corpus for parsing. For the latter task, we compute interpretations of the compounds by exploiting surface cues and meaning paraphrases, which in turn are extracted from the parsed corpus. This provides an empirical setting in which we can compare the utility of a comparatively deep parser vs. a shallow one, exploring the trade-off between resolving attachment ambiguities on the one hand and generating errors in the parses on the other. We demonstrate that a model of the meaning of compound nominalisations is achievable with the aid of current broad-coverage parsers.

...read moreread less

Journal Article•DOI•

[...]

Andy Way¹, Nano Gough¹•Institutions (1)

Dublin City University¹

Visualization-enabled multi-document summarization by Iterative Residual Rescaling

TL;DR: Tests on a 4.8 million word bitext indicate that while SMT appears to outperform the authors' system for French-English on a number of metrics, for English-French, on all but one automatic evaluation metric, the performance of the EBMT system is superior to the baseline SMT model.

...read moreread less

Abstract: In previous work (Gough and Way 2004), we showed that our Example-Based Machine Translation (EBMT) system improved with respect to both coverage and quality when seeded with increasing amounts of training data, so that it significantly outperformed the on-line MT system Logomedia according to a wide variety of automatic evaluation metrics. While it is perhaps unsurprising that system performance is correlated with the amount of training data, we address in this paper the question of whether a large-scale, robust EBMT system such as ours can outperform a Statistical Machine Translation (SMT) system. We obtained a large English-French translation memory from Sun Microsystems from which we randomly extracted a near 4K test set. The remaining data was split into three training sets, of roughly 50K, 100K and 200K sentence-pairs in order to measure the effect of increasing the size of the training data on the performance of the two systems. Our main observation is that contrary to perceived wisdom in the field, there appears to be little substance to the claim that SMT systems are guaranteed to outperform EBMT systems when confronted with ‘enough’ training data. Our tests on a 4.8 million word bitext indicate that while SMT appears to outperform our system for French-English on a number of metrics, for English-French, on all but one automatic evaluation metric, the performance of our EBMT system is superior to the baseline SMT model.

...read moreread less

Journal Article•DOI•

[...]

Rie Ando¹, Branimir Boguraev¹, Roy J. Byrd¹, Mary S. Neff¹•Institutions (1)

IBM¹

Implementing clarification dialogues in open domain question answering

TL;DR: A novel approach to multi-document summarization, which explicitly addresses the problem of detecting, and retaining for the summary, multiple themes in document collections, and applies Iterative Residual Rescaling (IRR).

...read moreread less

Abstract: This paper describes a novel approach to multi-document summarization, which explicitly addresses the problem of detecting, and retaining for the summary, multiple themes in document collections. We place equal emphasis on the processes of theme identification and theme presentation. For the former, we apply Iterative Residual Rescaling (IRR); for the latter, we argue for graphical display elements. IRR is an algorithm designed to account for correlations between words and to construct multi-dimensional topical space indicative of relationships among linguistic objects (documents, phrases, and sentences). Summaries are composed of objects with certain properties, derived by exploiting the many-to-many relationships in such a space. Given their inherent complexity, our multi-faceted summaries benefit from a visualization environment. We discuss some essential features of such an environment.

...read moreread less

Journal Article•DOI•

[...]

Marco De Boni¹, Suresh Manandhar²•Institutions (2)

Leeds Beckett University¹, University of York²

Robust parsing with weighted constraints

TL;DR: This article developed an algorithm for clarification dialog recognition through the analysis of collected data on clarification dialogues and examined the importance of clarification dialogue recognition for question answering, and showed the usefulness of the algorithm by demonstrating how the recognition of clarified dialogues can simplify the task of answer retrieval.

...read moreread less

Abstract: We examine the implementation of clarification dialogues, a mechanism for ensuring that question answering systems take into account user goals by allowing them to ask series of related questions either by refining or expanding on previous questions with follow-up questions, in the context of open domain Question Answering systems. We develop an algorithm for clarification dialogue recognition through the analysis of collected data on clarification dialogues and examine the importance of clarification dialogue recognition for question answering. The algorithm is evaluated and shown to successfully recognize the start and continuation of clarification dialogues in 94% of cases. We then show the usefulness of the algorithm by demonstrating how the recognition of clarification dialogues can simplify the task of answer retrieval.

...read moreread less

Journal Article•DOI•

[...]

Kilian A. Foth¹, Wolfgang Menzel¹, Ingo Schröder¹•Institutions (1)

University of Hamburg¹

Abbreviated Text Input Using Language Modeling

TL;DR: An architecture for robust parsing of natural language utterances has been developed, using a plausibility-based arbitration procedure to derive fairly rich structural representations, comprising aspects of syntax, semantics and other description levels of language.

...read moreread less

Abstract: Based on constraint optimization techniques, an architecture for robust parsing of natural language utterances has been developed. The resulting system is able to combine possibly contradicting evidence from a variety of information sources, using a plausibility-based arbitration procedure to derive fairly rich structural representations, comprising aspects of syntax, semantics and other description levels of language. The results of a series of experiments are reported which demonstrate the high potential for robust behaviour with respect to ungrammaticality, incomplete utterances, and temporal pressure.

...read moreread less

Journal Article•DOI•

[...]

Stuart M. Shieber¹, Rani Nelken¹•Institutions (1)

Harvard University¹

01 Jan 2005-Natural Language Engineering

TL;DR: This work allows the user to enter text in compressed form, in particular, using a simple stipulated abbreviation method that reduces characters by 26.4%, yet is simple enough that it can be learned easily and generated relatively fluently.

...read moreread less

Abstract: We address the problem of improving the efficiency of natural language text input under degraded conditions (for instance, on mobile computing devices or by disabled users), by taking advantage of the informational redundancy in natural language. Previous approaches to this problem have been based on the idea of prediction of the text, but these require the user to take overt action to verify or select the system's predictions. We propose taking advantage of the duality between prediction and compression . We allow the user to enter text in compressed form, in particular, using a simple stipulated abbreviation method that reduces characters by 26.4%, yet is simple enough that it can be learned easily and generated relatively fluently. We decode the abbreviated text using a statistical generative model of abbreviation, with a residual word error rate of 3.3%. The chief component of this model is an n -gram language model. Because the system's operation is completely independent from the user's, the overhead from cognitive task switching and attending to the system's actions online is eliminated, opening up the possibility that the compression-based method can achieve text input efficiency improvements where the prediction-based methods have not. We report the results of a user study evaluating this method.

...read moreread less

Journal Article•DOI•

Optimization of word alignment clues

[...]

Jörg Tiedemann¹•Institutions (1)

University of Groningen¹

Machine learning-based named entity recognition via effective integration of various evidences

TL;DR: The clue alignment approach and the optimization of its parameters using a genetic algorithm is described, which shows a significant improvement of about 6% in F-scores compared to the baseline produced by statistical word alignment.

...read moreread less

Abstract: Statistical, linguistic, and heuristic clues can be used for the alignment of words and multi-word units in parallel texts. This article describes the clue alignment approach and the optimization of its parameters using a genetic algorithm. Word alignment clues can come from various sources such as statistical alignment models, co-occurrence tests, string similarity scores and static dictionaries. A genetic algorithm implementing an evolutionary procedure can be used to optimize the parameters necessary for combining available clues. Experiments on English/Swedish bitext show a significant improvement of about 6% in F-scores compared to the baseline produced by statistical word alignment.

...read moreread less

Journal Article•DOI•

[...]

Guodong Zhou¹, Jian Su¹•Institutions (1)

Institute for Infocomm Research Singapore¹

Constrained EM for parallel text alignment

TL;DR: Through the HMM and an effective constraint relaxation algorithm to deal with the data sparseness problem, PowerNE is able to effectively apply and integrate various internal and external evidences of entity names and the named entity recognition problem is resolved effectively.

...read moreread less

Abstract: Named entity recognition identifies and classifies entity names in a text document into some predefined categories. It resolves the “who”, “where” and “how much” problems in information extraction and leads to the resolution of the “what” and “how” problems in further processing. This paper presents a Hidden Markov Model (HMM) and proposes a HMM-based named entity recognizer implemented as the system PowerNE. Through the HMM and an effective constraint relaxation algorithm to deal with the data sparseness problem, PowerNE is able to effectively apply and integrate various internal and external evidences of entity names. Currently, four evidences are included: (1) a simple deterministic internal feature of the words, such as capitalization and digitalization; (2) an internal semantic feature of the important triggers; (3) an internal gazetteer feature, which determines the appearance of the current word string in the provided gazetteer list; and (4) an external macro context feature, which deals with the name alias phenomena. In this way, the named entity recognition problem is resolved effectively. PowerNE has been benchmarked with the Message Understanding Conferences (MUC) data. The evaluation shows that, using the formal training and test data of the MUC-6 and MUC-7 English named entity tasks, and it achieves the F-measures of 96.6 and 94.1, respectively. Compared with the best reported machine learning system, it achieves a 1.7 higher F-measure with one quarter of the training data on MUC-6, and a 3.6 higher F-measure with one ninth of the training data on MUC-7. In addition, it performs slightly better than the best reported handcrafted rule-based systems on MUC-6 and MUC-7.

...read moreread less

Journal Article•DOI•

[...]

David Talbot¹•Institutions (1)

University of Edinburgh¹

Paraphrasing spoken Chinese using a paraphrase corpus

TL;DR: It is shown how auxiliary information can be used to constrain the procedure directly by restricting the set of alignments explored during parameter estimation, which enables the integration of bilingual and monolingual knowledge sources while retaining the flexibility of the underlying models.

...read moreread less

Abstract: Standard parameter estimation schemes for statistical translation models can struggle to find reasonable settings on some parallel corpora. We show how auxiliary information can be used to constrain the procedure directly by restricting the set of alignments explored during parameter estimation. This enables the integration of bilingual and monolingual knowledge sources while retaining the flexibility of the underlying models. We demonstrate the effectiveness of this approach for incorporating linguistic and domain-specific constraints on various parallel corpora, and consider the importance of using the context of the parallel text to guide the application of such constraints.

...read moreread less

Journal Article•DOI•

[...]

Yujie Zhang¹, Kazuhide Yamamoto²•Institutions (2)

National Institute of Information and Communications Technology¹, Nagaoka University of Technology²

A decision-theoretic framework for the evaluation of language models used in speech recognizers

TL;DR: The implementation of a Chinese paraphraser for a Chinese-Japanese spoken-language translation system is described, which uses a pattern-based approach in which the meaning is retained to the greatest possible extent without deep parsing.

...read moreread less

Abstract: One of the key issues in spoken-language translation is how to deal with unrestricted expressions in spontaneous utterances. We have developed a paraphraser for use as part of a translation system, and in this paper we describe the implementation of a Chinese paraphraser for a Chinese-Japanese spoken-language translation system. When an input sentence cannot be translated by the transfer engine, the paraphraser automatically transforms the sentence into alternative expressions until one of these alternatives can be translated by the transfer engine. Two primary issues must be dealt with in paraphrasing: how to determine new expressions, and how to retain the meaning of the input sentence. We use a pattern-based approach in which the meaning is retained to the greatest possible extent without deep parsing. The paraphrase patterns are acquired from a paraphrase corpus and human experience. The paraphrase instances are automatically extracted and then generalized into paraphrase patterns. A total of 1719 paraphrase patterns obtained using this method and an implemented paraphraser were used in a paraphrasing experiment. The results showed that the implemented paraphraser generated 1.7 paraphrases on average for each test sentence and achieved an accuracy of 88%.

...read moreread less

Journal Article•DOI•

[...]

John R. Deller¹, Keyur Desai¹, Y. P. Yang¹•Institutions (1)

Michigan State University¹