scispace - formally typeset
Search or ask a question

Showing papers on "Rule-based machine translation published in 2003"


Book
01 Jan 2003
TL;DR: Grammatical Evolution is the first comprehensive introduction to Grammatical Evolution, a novel approach to Genetic Programming that adopts principles from molecular biology in a simple and useful manner, coupled with the use of grammars to specify legal structures in a search.
Abstract: From the Publisher: Grammatical Evolution: Evolutionary Automatic Programming in an Arbitrary Language provides the first comprehensive introduction to Grammatical Evolution, a novel approach to Genetic Programming that adopts principles from molecular biology in a simple and useful manner, coupled with the use of grammars to specify legal structures in a search. Grammatical Evolution's rich modularity gives a unique flexibility, making it possible to use alternative search strategies - whether evolutionary, deterministics or some other approach - and to radically change its behavior by merely changing the grammar supplied. This approach to Genetic Programming represents a powerful new weapon in the Machine Learning toolkit that can be applied to a diverse set of problem domains.

621 citations


Patent
14 Nov 2003
TL;DR: In this paper, a system and method for translation of electronic communications automatically selects and deploys specialized dictionaries based upon context recognition and other factors, which can be used to translate electronic mail, instant messages, chat, SMS messages, electronic text and word processing files, Internet web pages, Internet search results, and other textual communications for a variety of device types, including wireless devices.
Abstract: A system and method for translation of electronic communications automatically selects and deploys specialized dictionaries based upon context recognition and other factors. Software tools can be employed for continual dictionary enhancement. The invention can accept speech and text inputs and can be used to translate electronic mail, instant messages, chat, SMS messages, electronic text and word processing files, Internet web pages, Internet search results, and other textual communications for a variety of device types, including wireless devices. In one embodiment, language pairs are automatically determined in real-time.

225 citations


Dissertation
01 Feb 2003
TL;DR: The Matrix technique extends the keywords procedure to produce key grammatical categories and key concepts and has been shown to be applicable in the comparison of UK 2001 general election manifestos, vocabulary studies in sociolinguistics, studies of language learners, information extraction and content analysis.
Abstract: Matrix: A statistical method and software tool for linguistic analysis through corpus comparison A thesis submitted to Lancaster University for the degree of Ph.D. in Computer Science Paul Edward Rayson, B.Sc. September 2002 This thesis reports the development of a new kind of method and tool (Matrix) for advancing the statistical analysis of electronic corpora of linguistic data. First, we describe the standard corpus linguistic methodology, which is hypothesis-driven. The standard research process model is ‘question – build – annotate – retrieve – interpret’, in other words, identifying the research question (and the linguistic features) early in the study. In recent years corpora have been increasingly annotated with linguistic information. From our survey, we find that no tools are available which are datadriven on annotated corpora, in other words, a tool which assists in finding candidate research questions. However, Matrix is such a tool. It allows the macroscopic analysis (the study of the characteristics of whole texts or varieties of language) to inform the microscopic level (focussing on the use of a particular linguistic feature) as to which linguistic features should be investigated further. By integrating part-of-speech tagging and lexical semantic tagging in a profiling tool, the Matrix technique extends the keywords procedure to produce key grammatical categories and key concepts. It has been shown to be applicable in the comparison of UK 2001 general election manifestos of the Labour and Liberal Democratic parties, vocabulary studies in sociolinguistics, studies of language learners, information extraction and content analysis. Currently, it has been tested on restricted levels of annotation and only on English language data.

219 citations


Journal ArticleDOI
TL;DR: An IRS based on fuzzy multi- granular linguistic information and a method to process the multi-granular linguisticInformation are proposed and the system accepts Boolean queries whose terms can be simultaneously weighted by means of ordinal linguistic values according to three semantics.

137 citations


Proceedings ArticleDOI
18 Jun 2003
TL;DR: Stochastic grammars are extended by adding event parameters, state checks, and sensitivity to an internal scene model to recognize a person performing the Towers of Hanoi task from a video sequence by analyzing object interaction events.
Abstract: Video-based recognition and prediction of a temporally extended activity can benefit from a detailed description of high-level expectations about the activity. Stochastic grammars allow for an efficient representation of such expectations and are well-suited for the specification of temporally well-ordered activities. In this paper, we extend stochastic grammars by adding event parameters, state checks, and sensitivity to an internal scene model. We present an implemented system that uses human-specified grammars to recognize a person performing the Towers of Hanoi task from a video sequence by analyzing object interaction events. Experimental results from several videos show robust recognition of the full task and its constituent sub-tasks even though no appearance models of the objects in the video are provided. These experiments include videos of the task performed with different shaped objects and with distracting and extraneous interactions.

129 citations


Journal ArticleDOI
TL;DR: In this article, the problem of automatically mining parallel texts from the Web and different ways of integrating the translation models within the retrieval process was investigated, and the results showed that the Web-based translation models can surpass commercial MT systems in CLIR tasks.
Abstract: Although more and more language pairs are covered by machine translation (MT) services, there are still many pairs that lack translation resources. Cross-language information retrieval (CLIR) is an application that needs translation functionality of a relatively low level of sophistication, since current models for information retrieval (IR) are still based on a bag of words. The Web provides a vast resource for the automatic construction of parallel corpora that can be used to train statistical translation models automatically. The resulting translation models can be embedded in several ways in a retrieval model. In this article, we will investigate the problem of automatically mining parallel texts from the Web and different ways of integrating the translation models within the retrieval process. Our experiments on standard test collections for CLIR show that the Web-based translation models can surpass commercial MT systems in CLIR tasks. These results open the perspective of constructing a fully automatic query translation device for CLIR at a very low cost.

128 citations


Proceedings ArticleDOI
16 Sep 2003
TL;DR: The results show that the system is comparable to existing SLU systems which rely on either handcrafted semantic grammar rules or statistical models trained on fully-annotated training corpora, but it has greatly reduced build cost.
Abstract: The paper presents a purely data-driven spoken language understanding (SLU) system. It consists of three major components, a speech recognizer, a semantic parser, and a dialog act decoder. A novel feature of the system is that the understanding components are trained directly from data without using explicit semantic grammar rules or fully-annotated corpus data. Despite this, the system is nevertheless able to capture hierarchical structure in user utterances and handle long range dependencies. Experiments have been conducted on the ATIS corpus and 16.1% and 12.6% utterance understanding error rates were obtained for spoken input using the ATIS-3 1993 and 1994 test sets. These results show that our system is comparable to existing SLU systems which rely on either handcrafted semantic grammar rules or statistical models trained on fully-annotated training corpora, but it has greatly reduced build cost.

127 citations


Proceedings ArticleDOI
07 Jul 2003
TL;DR: A dedicated noun phrase translation subsystem is built that improves over the currently best general statistical machine translation methods by incorporating special modeling and special features.
Abstract: We define noun phrase translation as a subtask of machine translation This enables us to build a dedicated noun phrase translation subsystem that improves over the currently best general statistical machine translation methods by incorporating special modeling and special features We achieved 655% translation accuracy in a German-English translation task vs 532% with IBM Model 4

117 citations


Journal ArticleDOI
TL;DR: The results are important in the sense that, using linguistic information, i.e. morphological analyses of the words, and a corpus large enough to train a statistical model significantly improves these basic information extraction tasks for Turkish.
Abstract: This paper presents the results of a study on information extraction from unrestricted Turkish text using statistical language processing methods. In languages like English, there is a very small number of possible word forms with a given root word. However, languages like Turkish have very productive agglutinative morphology. Thus, it is an issue to build statistical models for specific tasks using the surface forms of the words, mainly because of the data sparseness problem. In order to alleviate this problem, we used additional syntactic information, i.e. the morphological structure of the words. We have successfully applied statistical methods using both the lexical and morphological information to sentence segmentation, topic segmentation, and name tagging tasks. For sentence segmentation, we have modeled the final inflectional groups of the words and combined it with the lexical model, and decreased the error rate to 4.34%, which is 21% better than the result obtained using only the surface forms of the words. For topic segmentation, stems of the words (especially nouns) have been found to be more effective than using the surface forms of the words and we have achieved 10.90% segmentation error rate on our test set according to the weighted TDT-2 segmentation cost metric. This is 32% better than the word-based baseline model. For name tagging, we used four different information sources to model names. Our first information source is based on the surface forms of the words. Then we combined the contextual cues with the lexical model, and obtained some improvement. After this, we modeled the morphological analyses of the words, and finally we modeled the tag sequence, and reached an F-Measure of 91.56%, according to the MUC evaluation criteria. Our results are important in the sense that, using linguistic information, i.e. morphological analyses of the words, and a corpus large enough to train a statistical model significantly improves these basic information extraction tasks for Turkish.

111 citations


Proceedings ArticleDOI
19 Nov 2003
TL;DR: The authors formulate the semantic parsing problem as a classification problem using support vector machines and use a hand-labeled training set and a set of features drawn from earlier work together with some feature enhancements.
Abstract: There is an ever-growing need to add structure in the form of semantic markup to the huge amounts of unstructured text data now available. We present the technique of shallow semantic parsing, the process of assigning a simple WHO did WHAT to WHOM, etc., structure to sentences in text, as a useful tool in achieving this goal. We formulate the semantic parsing problem as a classification problem using support vector machines. Using a hand-labeled training set and a set of features drawn from earlier work together with some feature enhancements, we demonstrate a system that performs better than all other published results on shallow semantic parsing.

92 citations


Journal ArticleDOI
TL;DR: It is proved that the P systems with context-free rules are computationally universal, able to generate all computable array languages.
Abstract: We consider array languages (sets of pictures consisting of symbols placed in the lattice points of the 2D grid) and the possibility to handle them with P systems. After proving binary normal forms for array matrix grammars (which, even in the case when no appearance checking is used, are known to generate the array languages of arbitrary array grammars), we prove that the P systems with context-free rules (with three membranes and no control on the communication or the use of rules) are computationally universal, able to generate all computable array languages. Some open problems are also formulated.

Proceedings ArticleDOI
26 Oct 2003
TL;DR: A decoder for statistical machine translation which allows controlled reordering of the words generated in the target language and the effect of the length of this reordering window on the search space and the translation quality is analyzed.
Abstract: We describe a decoder for statistical machine translation which allows controlled reordering of the words generated in the target language After a general discussion of the structure of a decoder a particular implementation is discussed which allows for word-to-word and phrase-to-phrase translation Word reordering is used to improve the translation quality We analyze the effect of the length of this reordering window on the search space and the translation quality Results for Chinese-to-English and Arabic-to-English translation tasks are presented

Proceedings ArticleDOI
Christoph Tillmann1
11 Jul 2003
TL;DR: A phrase- based unigram model for statistical machine translation that uses a much simpler set of model parameters than similar phrase-based models that has been successfully test on a Chinese-English and an Arabic-English translation task.
Abstract: In this paper, we describe a phrase-based unigram model for statistical machine translation that uses a much simpler set of model parameters than similar phrase-based models. The units of translation are blocks -- pairs of phrases. During decoding, we use a block unigram model and a word-based trigram language model. During training, the blocks are learned from source interval projections using an underlying high-precision word alignment. The system performance is significantly increased by applying a novel block extension algorithm using an additional high-recall word alignment. The blocks are further filtered using unigram-count selection criteria. The system has been successfully test on a Chinese-English and an Arabic-English translation task.

Journal ArticleDOI
TL;DR: Summarily formal aspect of CW is more systematically established more deeply dealt with while some new problems also emerge.
Abstract: Computing with words (CW) as a methodology, means computing and reasoning by the use of words in place of numbers or symbols, which may conform more to humans' perception when describing real-world problems. In this paper, as a continuation of a previous paper, we aim to develop and deepen a formal aspect of CW. According to the previous paper, the basic point of departure is that CW treats certain formal modes of computation with strings of fuzzy subsets instead of symbols as their inputs. Specifically, 1) we elaborate on CW via Turing machine (TM) models, showing the time complexity is at least exponential if the inputs are strings of words; 2) a negative result of (6) not holding is verified which indicates that the extension principle for CW via TMs needs to be re-examined; 3) we discuss CW via context- free grammars and regular grammars and the extension principles for CW via these formal grammars are set up; 4) some equivalences between fuzzy pushdown automata (respectively, fuzzy finite-state automata) fuzzy context-free grammars (respectively, fuzzy regular grammars) are demonstrated in the sense that the inputs are instead strings of words; 5) some instances are described in detail. Summarily formal aspect of CW is more systematically established more deeply dealt with while some new problems also emerge.

Journal ArticleDOI
TL;DR: M-PIRO is a system that generates descriptions of museum objects tailored to the user that is significantly more modular than that of its predecessor ILEX.
Abstract: The authors describe a system that generates descriptions of museum objects tailored to the user. The texts presented to adults, children, and experts differ in several ways, from the choice of words used to the complexity of the sentence forms. M-PIRO can currently generate text in three languages: English, Greek, and Italian. The grammar resources are language independent as much as possible. M-PIRO's system architecture is significantly more modular than that of its predecessor ILEX. In particular, the linguistic resources, database, and user-modeling subsystems are now separate from the systems that perform the natural language generation and speech synthesis.

Proceedings ArticleDOI
12 Jul 2003
TL;DR: The results of a feasibility study on the ability of memory-based machine translation and word-to-word compositional machine translation to translate Japanese and English noun-noun compounds are described.
Abstract: The translation of compound nouns is a major issue in machine translation due to their frequency of occurrence and high productivity. Various shallow methods have been proposed to translate compound nouns, notable amongst which are memory-based machine translation and word-to-word compositional machine translation. This paper describes the results of a feasibility study on the ability of these methods to translate Japanese and English noun-noun compounds.

Proceedings ArticleDOI
12 Apr 2003
TL;DR: A fully fledged translation system is used to ensure the quality of the proposed extensions of the interactive machine translation system and word hypotheses graphs are used as an efficient search space representation to achieve fast response times.
Abstract: The goal of interactive machine translation is to improve the productivity of human translators. An interactive machine translation system operates as follows: the automatic system proposes a translation. Now, the human user has two options: to accept the suggestion or to correct it. During the post-editing process, the human user is assisted by the interactive system in the following way: the system suggests an extension of the current translation prefix. Then, the user either accepts this extension (completely or partially) or ignores it. The two most important factors of such an interactive system are the quality of the proposed extensions and the response time. Here, we will use a fully fledged translation system to ensure the quality of the proposed extensions. To achieve fast response times, we will use word hypotheses graphs as an efficient search space representation. We will show results of our approach on the Verbmobil task and on the Canadian Hansards task.

Book ChapterDOI
01 Jan 2003
TL;DR: A new example-based method of machine translation in which the examples need not be direct translations, allowing the use of currently available sentence-aligned corpora as data.
Abstract: This paper introduces a new example-based method of machine translation in which the examples need not be direct translations. The system will weed out strange examples during translation, allowing the use of currently available sentence-aligned corpora as data. Rule-based modules are used where appropriate. A prototype Japanese-to-English system has been implemented that allows multiple users to share corpora.

01 Jan 2003
TL;DR: This workshop incrementally add new features representing syntactic knowledge that deal with specific problems of the underlying baseline, and extends previous tree-based alignment models by allowing partial tree alignments when the two syntactic structures are not isomorphic.
Abstract: In recent evaluations of machine translation systems, statistical systems have outperformed classical approaches based on interpretation, transfer, and generation. Nonetheless, the output of statistical systems often contains obvious grammatical errors. This can be attributed to the fact that the syntactic well-formedness is only influenced by local n-gram language models and simple alignment models. We aim to integrate syntactic structure into statistical models to address this problem. In the workshop we start with a very strong baseline – the alignment template statistical machine translation system that obtained the best results in the 2002 and 2003 DARPA MT evaluations. This model is based on a log-linear modeling framework, which allows for the easy integration of many different knowledge sources (i.e. feature functions) into an overall model and to train the feature function combination weights discriminatively. During the workshop, we incrementally add new features representing syntactic knowledge that deal with specific problems of the underlying baseline. We want to investigate a broad range of possible feature functions, from very simple binary features to sophisticated treeto-tree translation models. Simple feature functions test if a certain constituent occurs in the source and the target language parse tree. More sophisticated features are derived from an alignment model where whole sub-trees in source and target can be aligned node by node. We also plan to investigate features based on projection of parse trees from one language onto strings of another, a useful technique when parses are available for only one of the two languages. We extend previous tree-based alignment models by allowing partial tree alignments when the two syntactic structures are not isomorphic. We work with the Chinese-English data from the recent evaluations, as large amounts of sentence-aligned training corpora, as well as multiple reference translations are available. This will also allow to compare results with the various systems participating in the evaluations. In addition, an annotated Chinese-English parallel tree-bank is available. We evaluate the improvement of our system using the BLEU metric. Using the additional feature functions developed during the workshop the BLEU score improved from 31.6% for the baseline MT system to 33.2% using rescoring of a 1000-best list.

01 Jan 2003
TL;DR: A dedicated noun phrase translation subsystem is built that improves over the currently best general statistical machine translation methods by incorporating special modeling and special features and shows overall improvement in translation quality.
Abstract: We define noun phrase translation as a subtask of statistical machine translation. This enables us to build a dedicated noun phrase translation subsystem that improves over the currently best general statistical machine translation methods by incorporating special modeling and special features. We integrate such a system into a state-of-the-art statistical machine translation system with novel methods and show overall improvement in translation quality. We also carry out empirical linguistic studies on noun phrase translatability and the sources of translation errors.

Proceedings ArticleDOI
28 Jul 2003
TL;DR: A novel two-step fuzzy translation technique for cross-lingual spelling variants that was evaluated empirically using five source languages and English as a target language and performed better, sometimes considerably better, than fuzzy matching alone.
Abstract: We will present a novel two-step fuzzy translation technique for cross-lingual spelling variants. In the first stage, transformation rules are applied to source words to render them more similar to their target language equivalents. The rules are generated automatically using translation dictionaries as source data. In the second stage, the intermediate forms obtained in the first stage are translated into a target language using fuzzy matching. The effectiveness of the technique was evaluated empirically using five source languages and English as a target language. The target word list contained 189 000 English words with the correct equivalents for the source words among them. The source words were translated using the two-step fuzzy translation technique, and the results were compared with those of plain fuzzy matching based translation. The combined technique performed better, sometimes considerably better, than fuzzy matching alone.

Proceedings ArticleDOI
03 Nov 2003
TL;DR: Although the parser was originally developed for conversational English and made many mistakes in parsing sentences from the biochemical domain, it nevertheless achieved better overall performance than a co-occurrence-only method.
Abstract: Many natural language processing approaches at various complexity levels have been reported for extracting biochemical interactions from MEDLINE. While some algorithms using simple template matching are unable to deal with the complex syntactic structures, others exploiting sophisticated parsing techniques are hindered by greater computational cost. This study investigates link grammar parsing for extracting biochemical interactions. Link grammar parsing can handle many syntactic structures and is computationally relatively efficient. We experimented on a sample MEDLINE corpus. Although the parser was originally developed for conversational English and made many mistakes in parsing sentences from the biochemical domain, it nevertheless achieved better overall performance than a co-occurrence-only method. Customizing the parser for the biomedical domain is expected to improve its performance further.

Book ChapterDOI
01 Jan 2003
TL;DR: It is shown that Example-Based Machine Translation, as long as it is linguistically principled, significantly overlaps with other linguologically principled approaches to Machine Translation.
Abstract: We maintain that the essential feature that characterizes a Machine Translation approach and sets it apart from other approaches is the kind of knowledge it uses. From this perspective, we argue that Example-Based Machine Translation is sometimes characterized in terms of nonessential features. We show that Example-Based Machine Translation, as long as it is linguistically principled, significantly overlaps with other linguistically principled approaches to Machine Translation. We make a proposal for translation knowledge bases that make such an overlap explicit. We relate our proposal to translation by analogy, which stands out as an inherently example-based technique.

Journal ArticleDOI
01 Oct 2003
TL;DR: The basic techniques of agile parsing in TXL are introduced and several industry proven techniques for exploiting agile parse in software source analysis and transformation are discussed.
Abstract: Syntactic analysis forms a foundation of many source analysis and reverse engineering tools. However, a single standard grammar is not always appropriate for all source analysis and manipulation tasks. Small custom modifications to the grammar can make the programs used to implement these tasks simpler, clearer and more efficient. This leads to a new paradigm for programming these tools: agile parsing. In agile parsing the effective grammar used by a particular tool is a combination of two parts: the standard base grammar for the input language, and a set of explicit grammar overrides that modify the parse to support the task at hand. This paper introduces the basic techniques of agile parsing in TXL and discusses several industry proven techniques for exploiting agile parsing in software source analysis and transformation.

Book ChapterDOI
21 Aug 2003
TL;DR: The monolingual, bilingual, and multilingual retrieval experiments using the CLEF 2003 test collection show that document translation- based retrieval is slightly better than the query translation-based retrieval on the CLEFs.
Abstract: This paper describes monolingual, bilingual, and multilingual retrieval experiments using the CLEF 2003 test collection. The paper compares query translation-based multilingual retrieval with document translation-based multilingual retrieval where the documents are translated into the query language by translating the document words individually using machine translation systems or statistical translation lexicons derived from parallel texts. The multilingual retrieval results show that document translation-based retrieval is slightly better than the query translation-based retrieval on the CLEF 2003 test collection. Furthermore, combining query translation and document translation in multilingual retrieval achieves even better performance.

Proceedings ArticleDOI
26 Oct 2003
TL;DR: An integrated phrase segmentation/alignment algorithm (ISA) for statistical machine translation that yields phrase-to-phrase translations with significant higher precisions than the baseline system where phrase translations are extracted from the HMM word alignment.
Abstract: We present an integrated phrase segmentation/alignment algorithm (ISA) for statistical machine translation. Without the need of building an initial word-to-word alignment or initially segmenting the monolingual text into phrases as other methods do, this algorithm segments the sentences into phrases and finds their alignments simultaneously. For each sentence pair, ISA builds a two-dimensional matrix to represent a sentence pair where the value of each cell corresponds to the point-wise mutual information (MI) between the source and target words. Based on the similarities of MI values among cells, we identify the aligned phrase pairs. Once all the phrase pairs are found, we know both how to segment one sentence into phrases and also the alignments between the source and target sentences. We use monolingual bigram language models to estimate the joint probabilities of the identified phrase pairs. The joint probabilities are then normalized to conditional probabilities, which are used by the decoder. Despite its simplicity, this approach yields phrase-to-phrase translations with significant higher precisions than our baseline system where phrase translations are extracted from the HMM word alignment. When we combine the phrase-to-phrase translations generated by this algorithm with the baseline system, the improvement on translation quality is even larger.

Journal ArticleDOI
TL;DR: The results indicate that the Xfer system significantly outperforms both EBMT and SMT in this scenario, and automatically learned transfer rules are effective in improving translation performance, compared with a baseline word-to-word translation version of the system.
Abstract: We describe an experiment designed to evaluate the capabilities of our trainable transfer-based (Xfer) machine translation approach, as applied to the task of Hindi-to-English translation, and trained under an extremely limited data scenario. We compare the performance of the Xfer approach with two corpus-based approaches---Statistical MT (SMT) and Example-based MT (EBMT)---under the limited data scenario. The results indicate that the Xfer system significantly outperforms both EBMT and SMT in this scenario. Results also indicate that automatically learned transfer rules are effective in improving translation performance, compared with a baseline word-to-word translation version of the system. Xfer system performance with a limited number of manually written transfer rules is, however, still better than the current automatically inferred rules. Furthermore, a "multiengine" version of our system that combined the output of the Xfer and SMT systems and optimizes translation selection outperformed both individual systems.

Book ChapterDOI
16 Feb 2003
TL;DR: In this paper, an approach based on static processing of stem allomorphs and the method of analysis known as "analysis through generation" was proposed for morphological analysis for inflective languages.
Abstract: Development of morphological analysis systems for inflective languages is a tedious and laborious task. We suggest an approach for development of such systems that permits to spend less time and effort. It is based on static processing of stem allomorphs and the method of analysis known as "analysis through generation." These features allow for using the morphological models oriented to generation, instead of developing special analysis models. Normally, generation models are presented in traditional grammars and correspond very well to the intuition of speakers. Systems based on this approach were developed for Russian and Spanish.

Proceedings Article
01 Jan 2003
TL;DR: This paper examines the compilation of recognition grammars with an emphasis on the dynamic (changing) properties of the grammar and how these relate to context-dependent speech recognizers through the algebra of finite-state transducers (FSTs).
Abstract: Spoken language systems, ranging from interactive voice response (IVR) to mixed-initiative conversational systems, make use of a wide range of recognition grammars and vocabularies. The recognition grammars are either static (created at design time) or dynamic (dependent on database lookup at run time). This paper examines the compilation of recognition grammars with an emphasis on the dynamic (changing) properties of the grammar and how these relate to context-dependent speech recognizers. By casting the problem in the algebra of finite-state transducers (FSTs) we can use the composition operator for fast-and-efficient compilation and splicing of dynamic recognition grammars within the context of a larger precompiled static grammar.

Proceedings Article
01 Jan 2003
TL;DR: It is shown that, although domain actions are domain specific, the approach scales up to large domains without an explosion of domain actions and can be coded with high inter-coder reliability across research sites.
Abstract: We describe a coding scheme for machine translation of spoken taskoriented dialogue. The coding scheme covers two levels of speaker intention − domain independent speech acts and domain dependent domain actions. Our database contains over 14,000 tagged sentences in English, Italian, and German. We argue that domain actions, and not speech acts, are the relevant discourse unit for improving translation quality. We also show that, although domain actions are domain specific, the approach scales up to large domains without an explosion of domain actions and can be coded with high inter-coder reliability across research sites. Furthermore, although the number of domain actions is on the order of ten times the number of speech acts, sparseness is not a problem for the training of classifiers for identifying the domain action. We describe our work on developing high accuracy speech act and domain action classifiers, which is the core of the source language analysis module of our NESPOLE machine translation system.