scispace - formally typeset
Search or ask a question

Showing papers on "Rule-based machine translation published in 2006"


Journal ArticleDOI
TL;DR: A new (proportional) 2-tuple fuzzy linguistic representation model for computing with words (CW), which is based on the concept of "symbolic proportion," which provides an opportunity to describe the initial linguistic information by members of a "continuous" linguistic scale domain which does not necessarily require the ordered linguistic terms of a linguistic variable being equidistant.
Abstract: In this paper, we provide a new (proportional) 2-tuple fuzzy linguistic representation model for computing with words (CW), which is based on the concept of "symbolic proportion." This concept motivates us to represent the linguistic information by means of 2-tuples, which are composed by two proportional linguistic terms. For clarity and generality, we first study proportional 2-tuples under ordinal contexts. Then, under linguistic contexts and based on canonical characteristic values (CCVs) of linguistic labels, we define many aggregation operators to handle proportional 2-tuple linguistic information in a computational stage for CW without any loss of information. Our approach for this proportional 2-tuple fuzzy linguistic representation model deals with linguistic labels, which do not have to be symmetrically distributed around a medium label and without the traditional requirement of having "equal distance" between them. Moreover, this new model not only provides a space to allow a "continuous" interpolation of a sequence of ordered linguistic labels, but also provides an opportunity to describe the initial linguistic information by members of a "continuous" linguistic scale domain which does not necessarily require the ordered linguistic terms of a linguistic variable being equidistant. Meanwhile, under the assumption of equally informative (which is defined by a condition based on the concept of CCV), we show that our model reduces to Herrera and Mart/spl inodot//spl acute/nez's (translational) 2-tuple fuzzy linguistic representation model.

467 citations


Proceedings ArticleDOI
08 Jun 2006
TL;DR: This work uses a target language parser to generate parse trees for each sentence on the target side of the bilingual training corpus, matching them with phrase table lattices built for the corresponding source sentence.
Abstract: We present translation results on the shared task "Exploiting Parallel Texts for Statistical Machine Translation" generated by a chart parsing decoder operating on phrase tables augmented and generalized with target language syntactic categories. We use a target language parser to generate parse trees for each sentence on the target side of the bilingual training corpus, matching them with phrase table lattices built for the corresponding source sentence. Considering phrases that correspond to syntactic categories in the parse trees we develop techniques to augment (declare a syntactically motivated category for a phrase pair) and generalize (form mixed terminal and nonterminal phrases) the phrase table into a synchronous bilingual grammar. We present results on the French-to-English task for this workshop, representing significant improvements over the workshop's baseline system. Our translation system is available open-source under the GNU General Public License.

347 citations


Proceedings Article
01 Apr 2006
TL;DR: An approach for extracting relations between entities from biomedical literature based solely on shallow linguistic information is proposed, which outperforms most of the previous methods based on syntactic and semantic information.
Abstract: We propose an approach for extracting relations between entities from biomedical literature based solely on shallow linguistic information. We use a combination of kernel functions to integrate two different information sources: (i) the whole sentence where the relation appears, and (ii) the local contexts around the interacting entities. We performed experiments on extracting gene and protein interactions from two different data sets. The results show that our approach outperforms most of the previous methods based on syntactic and semantic information.

328 citations


Proceedings ArticleDOI
04 Jun 2006
TL;DR: It is shown that WASP performs favorably in terms of both accuracy and coverage compared to existing learning methods requiring similar amount of supervision, and shows better robustness to variations in task complexity and word order.
Abstract: We present a novel statistical approach to semantic parsing, WASP, for constructing a complete, formal meaning representation of a sentence. A semantic parser is learned given a set of sentences annotated with their correct meaning representations. The main innovation of WASP is its use of state-of-the-art statistical machine translation techniques. A word alignment model is used for lexical acquisition, and the parsing model itself can be seen as a syntax-based translation model. We show that WASP performs favorably in terms of both accuracy and coverage compared to existing learning methods requiring similar amount of supervision, and shows better robustness to variations in task complexity and word order.

306 citations


Proceedings Article
04 Dec 2006
TL;DR: This paper presents a general-purpose inference algorithm for adaptor grammars, making it easy to define and use such models, and illustrates how several existing nonparametric Bayesian models can be expressed within this framework.
Abstract: This paper introduces adaptor grammars, a class of probabilistic models of language that generalize probabilistic context-free grammars (PCFGs). Adaptor grammars augment the probabilistic rules of PCFGs with "adaptors" that can induce dependencies among successive uses. With a particular choice of adaptor, based on the Pitman-Yor process, nonparametric Bayesian models of language using Dirichlet processes and hierarchical Dirichlet processes can be written as simple grammars. We present a general-purpose inference algorithm for adaptor grammars, making it easy to define and use such models, and illustrate how several existing nonparametric Bayesian models can be expressed within this framework.

292 citations


Journal ArticleDOI
TL;DR: This article describes in detail an n-gram approach to statistical machine translation that consists of a log-linear combination of a translation model based on n- grams of bilingual units, which are referred to as tuples, along with four specific feature functions.
Abstract: This article describes in detail an n-gram approach to statistical machine translation. This approach consists of a log-linear combination of a translation model based on n-grams of bilingual units, which are referred to as tuples, along with four specific feature functions. Translation performance, which happens to be in the state of the art, is demonstrated with Spanish-to-English and English-to-Spanish translations of the European Parliament Plenary Sessions (EPPS).

285 citations


Proceedings ArticleDOI
22 Oct 2006
TL;DR: This work proposes a generative solution based on a DSL called TCS (Textual Concrete Syntax), which is used to automatically generate tools for model- to-text and text-to-model transformations.
Abstract: Domain modeling promotes the description of various facets of information systems by a coordinated set of domain-specific languages (DSL). Some of them have visual/graphical and other may have textual concrete syntaxes. Model Driven Engineering (MDE) helps defining the concepts and relations of the domain by the way of metamodel elements. For visual languages, it is necessary to establish links between these concepts and relations on one side and visual symbols on the other side. Similarly, with textual languages it is necessary to establish links between metamodel elements and syntactic structures of the textual DSL. To successfully apply MDE in a wide range of domains we need tools for fast implementation of the expected growing number of DSLs. Regarding the textual syntax of DSLs, we believe that most current proposals for bridging the world of models (MDE) and the world of grammars (Grammarware) are not completely adapted to this need. We propose a generative solution based on a DSL called TCS (Textual Concrete Syntax). Specifications expressed in TCS are used to automatically generate tools for model-to-text and text-to-model transformations. The proposed approach is illustrated by a case study in the definition of a telephony language.

270 citations


Proceedings ArticleDOI
17 Jul 2006
TL;DR: A novel method for extracting parallel sub-sentential fragments from comparable, non-parallel bilingual corpora by analyzing potentially similar sentence pairs using a signal processing-inspired approach, which enables it to extract useful machine translation training data even from very non-Parallel corpora, which contain no parallel sentence pairs.
Abstract: We present a novel method for extracting parallel sub-sentential fragments from comparable, non-parallel bilingual corpora. By analyzing potentially similar sentence pairs using a signal processing-inspired approach, we detect which segments of the source sentence are translated into segments in the target sentence, and which are not. This method enables us to extract useful machine translation training data even from very non-parallel corpora, which contain no parallel sentence pairs. We evaluate the quality of the extracted data by showing that it improves the performance of a state-of-the-art statistical machine translation system.

207 citations


Proceedings Article
01 Apr 2006
TL;DR: A novel method for computing a consensus translation from the outputs of multiple machine translation (MT) systems by voting on a confusion network that produces pairwise word alignments of the original machine translation hypotheses with an enhanced statistical alignment algorithm that explicitly models word reordering.
Abstract: This paper describes a novel method for computing a consensus translation from the outputs of multiple machine translation (MT) systems. The outputs are combined and a possibly new translation hypothesis can be generated. Similarly to the well-established ROVER approach of (Fiscus, 1997) for combining speech recognition hypotheses, the consensus translation is computed by voting on a confusion network. To create the confusion network, we produce pairwise word alignments of the original machine translation hypotheses with an enhanced statistical alignment algorithm that explicitly models word reordering. The context of a whole document of translations rather than a single sentence is taken into account to produce the alignment. The proposed alignment and voting approach was evaluated on several machine translation tasks, including a large vocabulary task. The method was also tested in the framework of multi-source and speech translation. On all tasks and conditions, we achieved significant improvements in translation quality, increasing e. g. the BLEU score by as much as 15% relative.

193 citations


Proceedings ArticleDOI
04 Jun 2006
TL;DR: A linear-time algorithm for factoring syntactic re-orderings by binarizing synchronous rules when possible is devised and it is shown that the resulting rule set significantly improves the speed and accuracy of a state-of-the-art syntax-based machine translation system.
Abstract: Systems based on synchronous grammars and tree transducers promise to improve the quality of statistical machine translation output, but are often very computationally intensive. The complexity is exponential in the size of individual grammar rules due to arbitrary re-orderings between the two languages, and rules extracted from parallel corpora can be quite large. We devise a linear-time algorithm for factoring syntactic re-orderings by binarizing synchronous rules when possible and show that the resulting rule set significantly improves the speed and accuracy of a state-of-the-art syntax-based machine translation system.

162 citations


Patent
12 Dec 2006
TL;DR: In this article, a Hybrid Distributed Network Language Translation (HDNLT) system is described, where a distributed network of human and machine translators communicate electronically and provide for the translation of material in source language.
Abstract: A Hybrid Distributed Network Language Translation (HDNLT) system having a distributed network of human and machine translators that communicate electronically and provide for the translation of material in source language. Individual translators receive a reputation that reflects their translation competency, reliability and accuracy. An individual translator's reputation is adjusted dynamically with feedback from other translators and/or comparison of their translation results to translations from those with known high reputation and to the final translation results. Additionally, translations are produced statistically, first by breaking input source text into fragments, sending each fragment redundantly to a number of translators with varying levels of reputation. The results of these translations are assembled taking into account (giving weight to) the translator reputation of each translator, the statistical properties of the translation results, the statistical correlation of preferred results to target language fragments, the properties of the particular language and other relevant factors.

Proceedings ArticleDOI
17 Jul 2006
TL;DR: This work proposes to use a new statistical language model that is based on a continuous representation of the words in the vocabulary, which achieves consistent improvements in the BLEU score on the development and test data.
Abstract: Statistical machine translation systems are based on one or more translation models and a language model of the target language. While many different translation models and phrase extraction algorithms have been proposed, a standard word n-gram back-off language model is used in most systems. In this work, we propose to use a new statistical language model that is based on a continuous representation of the words in the vocabulary. A neural network is used to perform the projection and the probability estimation. We consider the translation of European Parliament Speeches. This task is part of an international evaluation organized by the TC-STAR project in 2006. The proposed method achieves consistent improvements in the BLEU score on the development and test data. We also present algorithms to improve the estimation of the language model probabilities when splitting long sentences into shorter chunks.

Proceedings ArticleDOI
Yaser Al-Onaizan1, Kishore Papineni1
17 Jul 2006
TL;DR: A new distortion model is proposed that can be used with existing phrase-based SMT decoders to address n-gram language model limitations and a novel metric to measure word order similarity (or difference) between any pair of languages based on word alignments is proposed.
Abstract: In this paper, we argue that n-gram language models are not sufficient to address word reordering required for Machine Translation. We propose a new distortion model that can be used with existing phrase-based SMT decoders to address those n-gram language model limitations. We present empirical results in Arabic to English Machine Translation that show statistically significant improvements when our proposed model is used. We also propose a novel metric to measure word order similarity (or difference) between any pair of languages based on word alignments.

Proceedings ArticleDOI
17 Jul 2006
TL;DR: A semi-supervised approach to training for statistical machine translation that alternates the traditional Expectation Maximization step that is applied on a large training corpus with a discriminative step aimed at increasing word-alignment quality on a small, manually word-aligned sub-corpus is introduced.
Abstract: We introduce a semi-supervised approach to training for statistical machine translation that alternates the traditional Expectation Maximization step that is applied on a large training corpus with a discriminative step aimed at increasing word-alignment quality on a small, manually word-aligned sub-corpus. We show that our algorithm leads not only to improved alignments but also to machine translation outputs of higher quality.

Proceedings ArticleDOI
04 Nov 2006
TL;DR: Quantitative results combined with interview data show that lexical entrainment was disrupted in machine translation-mediated communication because echoing is disrupted by asymmetries in machine translations, and the process of shortening referring expressions is also disrupted.
Abstract: Even though multilingual communities that use machine translation to overcome language barriers are increasing, we still lack a complete understanding of how machine translation affects communication. In this study, eight pairs from three different language communities--China, Korea, and Japan--worked on referential tasks in their shared second language (English) and in their native languages using a machine translation embedded chat system. Drawing upon prior research, we predicted differences in conversational efficiency and content, and in the shortening of referring expressions over trials. Quantitative results combined with interview data show that lexical entrainment was disrupted in machine translation-mediated communication because echoing is disrupted by asymmetries in machine translations. In addition, the process of shortening referring expressions is also disrupted because the translations do not translate the same terms consistently throughout the conversation. To support natural referring behavior in machine translation-mediated communication, we need to resolve asymmetries and inconsistencies caused by machine translations.

Journal ArticleDOI
TL;DR: The ATR multilingual speech-to-speech translation (S2ST) system, which is mainly focused on translation between English and Asian languages, uses a parallel multilingual database consisting of over 600 000 sentences that cover a broad range of travel-related conversations.
Abstract: In this paper, we describe the ATR multilingual speech-to-speech translation (S2ST) system, which is mainly focused on translation between English and Asian languages (Japanese and Chinese). There are three main modules of our S2ST system: large-vocabulary continuous speech recognition, machine text-to-text (T2T) translation, and text-to-speech synthesis. All of them are multilingual and are designed using state-of-the-art technologies developed at ATR. A corpus-based statistical machine learning framework forms the basis of our system design. We use a parallel multilingual database consisting of over 600 000 sentences that cover a broad range of travel-related conversations. Recent evaluation of the overall system showed that speech-to-speech translation quality is high, being at the level of a person having a Test of English for International Communication (TOEIC) score of 750 out of the perfect score of 990.

Journal ArticleDOI
TL;DR: A new fusion approach for multi-granularity linguistic information for managing information assessed in different linguistic term sets with different granularity and/or semantic is presented.

Journal ArticleDOI
TL;DR: In four experiments, translators or bilinguals read sentences for repetition or for translation, and a pattern of results provides support for horizontal theories of translation.


Patent
04 Dec 2006
TL;DR: In this paper, the authors present modular speech-to-speech translation systems and methods that provide adaptable platforms to enable verbal communication between speakers of different languages within the context of specific domains.
Abstract: The present invention disclose modular speech-to-speech translation systems and methods that provide adaptable platforms to enable verbal communication between speakers of different languages within the context of specific domains. The components of the preferred embodiments of the present invention includes: (1) speech recognition; (2) machine translation; (3) N-best merging module; (4) verification; and (5) text-to-speech. Characteristics of the speech recognition module here are that the modules are structured to provide N-best selections and multi-stream processing, where multiple speech recognition engines may be active at any one time. The N-best lists from the one or more speech recognition engines may be handled either separately or collectively to improve both recognition and translation results. A merge module is responsible for integrating the N-best outputs of the translation engines along with confidence/translation scores to create a ranked list or recognition-translation pairs.

Proceedings ArticleDOI
22 Jul 2006
TL;DR: ParaEval is presented, an automatic evaluation framework that uses paraphrases to improve the quality of machine translation evaluations and correlates significantly better than BLEU with human assessment in measurements for both fluency and adequacy.
Abstract: In this paper, we present ParaEval, an automatic evaluation framework that uses paraphrases to improve the quality of machine translation evaluations. Previous work has focused on fixed n-gram evaluation metrics coupled with lexical identity matching. ParaEval addresses three important issues: support for paraphrase/synonym matching, recall measurement, and correlation with human judgments. We show that ParaEval correlates significantly better than BLEU with human assessment in measurements for both fluency and adequacy.

Proceedings Article
01 May 2006
TL;DR: This work investigates new possibilities for improving the quality of statistical machine translation (SMT) by applying word reorderings of the source language sentences based on Part-of-Speech tags by proposing two types of reordering depending on the language pair and the translation direction: local re orderings of nouns and adjectives for translation from and into Spanish and long-range reorderments of verbs for translation into German.
Abstract: Translation In this work we investigate new possibilities for improving the quality of statistical machine translation (SMT) by applying word reorderings of the source language sentences based on Part-of-Speech tags. Results are presented on the European Parliament corpus containing about 700k sentences and 15M running words. In order to investigate sparse training data scenarios, we also report results obtained on about 1\% of the original corpus. The source languages are Spanish and English and target languages are Spanish, English and German. We propose two types of reorderings depending on the language pair and the translation direction: local reorderings of nouns and adjectives for translation from and into Spanish and long-range reorderings of verbs for translation into German. For our best translation system, we achieve up to 2\% relative reduction of WER and up to 7\% relative increase of BLEU score. Improvements can be seen both on the reordered sentences as well as on the rest of the test corpus. Local reorderings are especially important for the translation systems trained on the small corpus whereas long-range reorderings are more effective for the larger corpus.

Proceedings ArticleDOI
17 Jul 2006
TL;DR: This paper studies the effect of different word-level preprocessing schemes for Arabic on the quality of phrase-based statistical machine translation and presents and evaluates different methods for combining pre processing schemes resulting in improved translation quality.
Abstract: Statistical machine translation is quite robust when it comes to the choice of input representation. It only requires consistency between training and testing. As a result, there is a wide range of possible preprocessing choices for data used in statistical machine translation. This is even more so for morphologically rich languages such as Arabic. In this paper, we study the effect of different word-level preprocessing schemes for Arabic on the quality of phrase-based statistical machine translation. We also present and evaluate different methods for combining preprocessing schemes resulting in improved translation quality.

Patent
08 May 2006
TL;DR: In this article, a system and method for translating data from a source language to a target language is provided wherein machine generated target translation of a source sentence is compared to a database of human generated target sentences.
Abstract: A system and method for translating data from a source language to a target language is provided wherein machine generated target translation of a source sentence is compared to a database of human generated target sentences. If a matching human generated target sentence is found, the human generated target sentence may be used instead of the machine generated sentence, since the human generated target sentence is more likely to be a well-formed sentence than the machine generated sentence. The system and method does not rely on a translation memory containing pairs of sentences in both source and target languages, and minimizes the reliance on a human translator to correct a translation generated by machine translation.

01 Jan 2006
TL;DR: With such an automatically-generated dictionary, the Example-Based Machine Translation system covers more of its input on unseen texts than the same system does when provided with a manually-created general-purpose dictionary and other knowledge sources.
Abstract: An Example-Based Machine Translation system is supplied with a sentencealigned bilingual corpus, but no other knowledge sources. Using the knowledge implicit in the corpus, it generates a bilingual word-for-word dictionary for alignment during translation. With such an automatically-generated dictionary, the system covers (with equivalent quality) more of its input on unseen texts than the same system does when provided with a manually-created general-purpose dictionary and other knowledge sources.

01 Jan 2006
TL;DR: A modest investment of time on the order of two person-weeks adding linguistic knowledge reduces the required example text by a factor of six or more, while retaining comparable translation quality, which makes EBMT more attractive for so-called "low-density" languages for which little data is available.
Abstract: Example-Based Machine Translation (EBMT) using partial exact matching against a database of translation examples has proven quite successful, but requires a large amount of pre-translated text in order to achieve broad coverage of unrestricted text. By adding linguistically tagged entries to the example base and permitting recursive matches that replace the matched text with the associated tag, substantial reductions in the required amount of pre-translated text can be achieved. A modest investment of time on the order of two person-weeks adding linguistic knowledge reduces the required example text by a factor of six or more, while retaining comparable translation quality. This reduction makes EBMT more attractive for so-called "low-density" languages for which little data is available.

Patent
10 Oct 2006
TL;DR: In this article, a method and computer system for analyzing sentences of various languages and constructing a language-independent semantic structure are provided, and exhaustive linguistic descriptions are created, and lexical, morphological, syntactic, and semantic analyses for one or more sentences of a natural or artificial language are performed.
Abstract: A method and computer system for analyzing sentences of various languages and constructing a language-independent semantic structure are provided. On the basis of comprehensive knowledge about languages and semantics, exhaustive linguistic descriptions are created, and lexical, morphological, syntactic, and semantic analyses for one or more sentences of a natural or artificial language are performed. A computer system is also provided to implement, analyze and store various linguistic structures and to perform lexical, morphological, syntactic, and semantic analyses. As result, a generalized data structure, such as a semantic structure, is generated and used to describe the meaning of one or more sentences in language-independent form, applicable to automated abstracting, machine translation, control systems, Internet information retrieval, etc.

Proceedings ArticleDOI
17 Jun 2006
TL;DR: This work proposes to use attribute grammars for recognizing normal events and detecting abnormal events in a video using an extension of the Earley parser that handles attributes and concurrent event threads.
Abstract: We propose to use attribute grammars for recognizing normal events and detecting abnormal events in a video. Attribute grammars can describe constraints on features (attributes) in addition to the syntactic structure of the input. Events are recognized using an extension of the Earley parser that handles attributes and concurrent event threads. Abnormal events are detected when the input does not follow syntax of the grammar or the attributes do not satisfy the constraints in the attribute grammar to some degree. We demonstrate the effectiveness of our method for the task of recognizing normal events and detecting anomalies in a parking lot.

Proceedings Article
01 Apr 2006
TL;DR: A backoff model for phrasebased machine translation that translates unseen word forms in foreign-language text by hierarchical morphological abstractions at the word and the phrase level is proposed.
Abstract: We propose a backoff model for phrasebased machine translation that translates unseen word forms in foreign-language text by hierarchical morphological abstractions at the word and the phrase level. The model is evaluated on the Europarl corpus for German-English and FinnishEnglish translation and shows improvements over state-of-the-art phrase-based models.

Proceedings ArticleDOI
22 Jul 2006
TL;DR: statistical machine reordering (SMR) consists in using the powerful techniques developed for statistical machine translation to translate the source language into a reordered source language (S'), which allows for an improved translation into the target language (T).
Abstract: Reordering is currently one of the most important problems in statistical machine translation systems. This paper presents a novel strategy for dealing with it: statistical machine reordering (SMR). It consists in using the powerful techniques developed for statistical machine translation (SMT) to translate the source language (S) into a reordered source language (S'), which allows for an improved translation into the target language (T). The SMT task changes from S2T to S'2T which leads to a monotonized word alignment and shorter translation units. In addition, the use of classes in SMR helps to infer new word reorderings. Experiments are reported in the EsEn WMT06 tasks and the ZhEn IWSLT05 task and show significant improvement in translation quality.