Showing papers on "Rule-based machine translation published in 2003"

PDF

Open Access

Book•

Grammatical Evolution: Evolutionary Automatic Programming in an Arbitrary Language

[...]

01 Jan 2003

TL;DR: Grammatical Evolution is the first comprehensive introduction to Grammatical Evolution, a novel approach to Genetic Programming that adopts principles from molecular biology in a simple and useful manner, coupled with the use of grammars to specify legal structures in a search.

...read moreread less

Abstract: From the Publisher: Grammatical Evolution: Evolutionary Automatic Programming in an Arbitrary Language provides the first comprehensive introduction to Grammatical Evolution, a novel approach to Genetic Programming that adopts principles from molecular biology in a simple and useful manner, coupled with the use of grammars to specify legal structures in a search. Grammatical Evolution's rich modularity gives a unique flexibility, making it possible to use alternative search strategies - whether evolutionary, deterministics or some other approach - and to radically change its behavior by merely changing the grammar supplied. This approach to Genetic Programming represents a powerful new weapon in the Machine Learning toolkit that can be applied to a diverse set of problem domains.

...read moreread less

621 citations

Patent•

Language translation system and method using specialized dictionaries

[...]

Robert E. Levin

14 Nov 2003

TL;DR: In this paper, a system and method for translation of electronic communications automatically selects and deploys specialized dictionaries based upon context recognition and other factors, which can be used to translate electronic mail, instant messages, chat, SMS messages, electronic text and word processing files, Internet web pages, Internet search results, and other textual communications for a variety of device types, including wireless devices.

...read moreread less

Abstract: A system and method for translation of electronic communications automatically selects and deploys specialized dictionaries based upon context recognition and other factors. Software tools can be employed for continual dictionary enhancement. The invention can accept speech and text inputs and can be used to translate electronic mail, instant messages, chat, SMS messages, electronic text and word processing files, Internet web pages, Internet search results, and other textual communications for a variety of device types, including wireless devices. In one embodiment, language pairs are automatically determined in real-time.

...read moreread less

225 citations

Dissertation•

Matrix:a statistical method and software tool for linguistic analysis through corpus comparison

[...]

Paul Rayson

01 Feb 2003

TL;DR: The Matrix technique extends the keywords procedure to produce key grammatical categories and key concepts and has been shown to be applicable in the comparison of UK 2001 general election manifestos, vocabulary studies in sociolinguistics, studies of language learners, information extraction and content analysis.

...read moreread less

Abstract: Matrix: A statistical method and software tool for linguistic analysis through corpus comparison A thesis submitted to Lancaster University for the degree of Ph.D. in Computer Science Paul Edward Rayson, B.Sc. September 2002 This thesis reports the development of a new kind of method and tool (Matrix) for advancing the statistical analysis of electronic corpora of linguistic data. First, we describe the standard corpus linguistic methodology, which is hypothesis-driven. The standard research process model is ‘question – build – annotate – retrieve – interpret’, in other words, identifying the research question (and the linguistic features) early in the study. In recent years corpora have been increasingly annotated with linguistic information. From our survey, we find that no tools are available which are datadriven on annotated corpora, in other words, a tool which assists in finding candidate research questions. However, Matrix is such a tool. It allows the macroscopic analysis (the study of the characteristics of whole texts or varieties of language) to inform the microscopic level (focussing on the use of a particular linguistic feature) as to which linguistic features should be investigated further. By integrating part-of-speech tagging and lexical semantic tagging in a profiling tool, the Matrix technique extends the keywords procedure to produce key grammatical categories and key concepts. It has been shown to be applicable in the comparison of UK 2001 general election manifestos of the Labour and Liberal Democratic parties, vocabulary studies in sociolinguistics, studies of language learners, information extraction and content analysis. Currently, it has been tested on restricted levels of annotation and only on English language data.

...read moreread less

219 citations

Journal Article•DOI•

A model of fuzzy linguistic IRS based on multi-granular linguistic information

[...]

Enrique Herrera-Viedma¹, Oscar Cordón¹, María Luque¹, A.G. Lopez¹, A.M. Muñoz¹ - Show less +1 more•Institutions (1)

University of Granada¹

01 Nov 2003-International Journal of Approximate Reasoning

TL;DR: An IRS based on fuzzy multi- granular linguistic information and a method to process the multi-granular linguisticInformation are proposed and the system accepts Boolean queries whose terms can be simultaneously weighted by means of ordinal linguistic values according to three semantics.

...read moreread less

137 citations

Proceedings Article•DOI•

Expectation grammars: leveraging high-level expectations for activity recognition

[...]

David Minnen¹, Irfan Essa¹, Thad Starner¹•Institutions (1)

Georgia Institute of Technology¹

18 Jun 2003

TL;DR: Stochastic grammars are extended by adding event parameters, state checks, and sensitivity to an internal scene model to recognize a person performing the Towers of Hanoi task from a video sequence by analyzing object interaction events.

...read moreread less

Abstract: Video-based recognition and prediction of a temporally extended activity can benefit from a detailed description of high-level expectations about the activity. Stochastic grammars allow for an efficient representation of such expectations and are well-suited for the specification of temporally well-ordered activities. In this paper, we extend stochastic grammars by adding event parameters, state checks, and sensitivity to an internal scene model. We present an implemented system that uses human-specified grammars to recognize a person performing the Towers of Hanoi task from a video sequence by analyzing object interaction events. Experimental results from several videos show robust recognition of the full task and its constituent sub-tasks even though no appearance models of the objects in the video are provided. These experiments include videos of the task performed with different shaped objects and with distracting and extraneous interactions.

...read moreread less

129 citations

Journal Article•DOI•

Embedding web-based statistical translation models in cross-language information retrieval

[...]

Wessel Kraaij, Jian-Yun Nie¹, Michel Simard¹•Institutions (1)

Université de Montréal¹

01 Sep 2003-Computational Linguistics

TL;DR: In this article, the problem of automatically mining parallel texts from the Web and different ways of integrating the translation models within the retrieval process was investigated, and the results showed that the Web-based translation models can surpass commercial MT systems in CLIR tasks.

...read moreread less

Abstract: Although more and more language pairs are covered by machine translation (MT) services, there are still many pairs that lack translation resources. Cross-language information retrieval (CLIR) is an application that needs translation functionality of a relatively low level of sophistication, since current models for information retrieval (IR) are still based on a bag of words. The Web provides a vast resource for the automatic construction of parallel corpora that can be used to train statistical translation models automatically. The resulting translation models can be embedded in several ways in a retrieval model. In this article, we will investigate the problem of automatically mining parallel texts from the Web and different ways of integrating the translation models within the retrieval process. Our experiments on standard test collections for CLIR show that the Web-based translation models can surpass commercial MT systems in CLIR tasks. These results open the perspective of constructing a fully automatic query translation device for CLIR at a very low cost.

...read moreread less

128 citations

Proceedings Article•DOI•

A data-driven spoken language understanding system

[...]

Yulan He¹, Steve Young¹•Institutions (1)

University of Cambridge¹

16 Sep 2003

TL;DR: The results show that the system is comparable to existing SLU systems which rely on either handcrafted semantic grammar rules or statistical models trained on fully-annotated training corpora, but it has greatly reduced build cost.

...read moreread less

Abstract: The paper presents a purely data-driven spoken language understanding (SLU) system. It consists of three major components, a speech recognizer, a semantic parser, and a dialog act decoder. A novel feature of the system is that the understanding components are trained directly from data without using explicit semantic grammar rules or fully-annotated corpus data. Despite this, the system is nevertheless able to capture hierarchical structure in user utterances and handle long range dependencies. Experiments have been conducted on the ATIS corpus and 16.1% and 12.6% utterance understanding error rates were obtained for spoken input using the ATIS-3 1993 and 1994 test sets. These results show that our system is comparable to existing SLU systems which rely on either handcrafted semantic grammar rules or statistical models trained on fully-annotated training corpora, but it has greatly reduced build cost.

...read moreread less

127 citations

Proceedings Article•DOI•

Feature-Rich Statistical Translation of Noun Phrases

[...]

Philipp Koehn¹, Kevin Knight¹•Institutions (1)

University of Southern California¹

07 Jul 2003

TL;DR: A dedicated noun phrase translation subsystem is built that improves over the currently best general statistical machine translation methods by incorporating special modeling and special features.

...read moreread less

Abstract: We define noun phrase translation as a subtask of machine translation This enables us to build a dedicated noun phrase translation subsystem that improves over the currently best general statistical machine translation methods by incorporating special modeling and special features We achieved 655% translation accuracy in a German-English translation task vs 532% with IBM Model 4

...read moreread less

117 citations

Journal Article•DOI•

A statistical information extraction system for Turkish

[...]

Gokhan Tur¹, Dilek Hakkani-Tur¹, Kemal Oflazer²•Institutions (2)

AT&T Labs¹, Sabancı University²

01 Jun 2003-Natural Language Engineering

TL;DR: The results are important in the sense that, using linguistic information, i.e. morphological analyses of the words, and a corpus large enough to train a statistical model significantly improves these basic information extraction tasks for Turkish.

...read moreread less

Abstract: This paper presents the results of a study on information extraction from unrestricted Turkish text using statistical language processing methods. In languages like English, there is a very small number of possible word forms with a given root word. However, languages like Turkish have very productive agglutinative morphology. Thus, it is an issue to build statistical models for specific tasks using the surface forms of the words, mainly because of the data sparseness problem. In order to alleviate this problem, we used additional syntactic information, i.e. the morphological structure of the words. We have successfully applied statistical methods using both the lexical and morphological information to sentence segmentation, topic segmentation, and name tagging tasks. For sentence segmentation, we have modeled the final inflectional groups of the words and combined it with the lexical model, and decreased the error rate to 4.34%, which is 21% better than the result obtained using only the surface forms of the words. For topic segmentation, stems of the words (especially nouns) have been found to be more effective than using the surface forms of the words and we have achieved 10.90% segmentation error rate on our test set according to the weighted TDT-2 segmentation cost metric. This is 32% better than the word-based baseline model. For name tagging, we used four different information sources to model names. Our first information source is based on the surface forms of the words. Then we combined the contextual cues with the lexical model, and obtained some improvement. After this, we modeled the morphological analyses of the words, and finally we modeled the tag sequence, and reached an F-Measure of 91.56%, according to the MUC evaluation criteria. Our results are important in the sense that, using linguistic information, i.e. morphological analyses of the words, and a corpus large enough to train a statistical model significantly improves these basic information extraction tasks for Turkish.

...read moreread less

111 citations

Proceedings Article•DOI•

Semantic role parsing: adding semantic structure to unstructured text

[...]

Sameer Pradhan¹, Kadri Hacioglu¹, Wayne H. Ward¹, James Martin¹, Dan Jurafsky¹ - Show less +1 more•Institutions (1)

University of Colorado Boulder¹

19 Nov 2003

TL;DR: The authors formulate the semantic parsing problem as a classification problem using support vector machines and use a hand-labeled training set and a set of features drawn from earlier work together with some feature enhancements.

...read moreread less

Abstract: There is an ever-growing need to add structure in the form of semantic markup to the huge amounts of unstructured text data now available. We present the technique of shallow semantic parsing, the process of assigning a simple WHO did WHAT to WHOM, etc., structure to sentences in text, as a useful tool in achieving this goal. We formulate the semantic parsing problem as a classification problem using support vector machines. Using a hand-labeled training set and a set of features drawn from earlier work together with some feature enhancements, we demonstrate a system that performs better than all other published results on shallow semantic parsing.

...read moreread less

92 citations

Journal Article•DOI•

Array-rewriting P systems

[...]

Rodica Ceterchi¹, Madhu Mutyam², Gheorghe Paun¹, K. G. Subramanian³•Institutions (3)

Rovira i Virgili University¹, Indian Institute of Technology Madras², Madras Christian College³

19 Aug 2003-Natural Computing

TL;DR: It is proved that the P systems with context-free rules are computationally universal, able to generate all computable array languages.

...read moreread less

Abstract: We consider array languages (sets of pictures consisting of symbols placed in the lattice points of the 2D grid) and the possibility to handle them with P systems. After proving binary normal forms for array matrix grammars (which, even in the case when no appearance checking is used, are known to generate the array languages of arbitrary array grammars), we prove that the P systems with context-free rules (with three membranes and no control on the communication or the use of rules) are computationally universal, able to generate all computable array languages. Some open problems are also formulated.

...read moreread less

Proceedings Article•DOI•

SMT decoder dissected: word reordering

[...]

Stephan Vogel¹•Institutions (1)

Carnegie Mellon University¹

26 Oct 2003

TL;DR: A decoder for statistical machine translation which allows controlled reordering of the words generated in the target language and the effect of the length of this reordering window on the search space and the translation quality is analyzed.

...read moreread less

Abstract: We describe a decoder for statistical machine translation which allows controlled reordering of the words generated in the target language After a general discussion of the structure of a decoder a particular implementation is discussed which allows for word-to-word and phrase-to-phrase translation Word reordering is used to improve the translation quality We analyze the effect of the length of this reordering window on the search space and the translation quality Results for Chinese-to-English and Arabic-to-English translation tasks are presented

...read moreread less

Proceedings Article•DOI•

A projection extension algorithm for statistical machine translation

[...]

Christoph Tillmann¹•Institutions (1)

IBM¹

11 Jul 2003

TL;DR: A phrase- based unigram model for statistical machine translation that uses a much simpler set of model parameters than similar phrase-based models that has been successfully test on a Chinese-English and an Arabic-English translation task.

...read moreread less

Abstract: In this paper, we describe a phrase-based unigram model for statistical machine translation that uses a much simpler set of model parameters than similar phrase-based models. The units of translation are blocks -- pairs of phrases. During decoding, we use a block unigram model and a word-based trigram language model. During training, the blocks are learned from source interval projections using an underlying high-precision word alignment. The system performance is significantly increased by applying a novel block extension algorithm using an additional high-recall word alignment. The blocks are further filtered using unigram-count selection criteria. The system has been successfully test on a Chinese-English and an Arabic-English translation task.

...read moreread less

Journal Article•DOI•

Computing with words via Turing machines: a formal approach

[...]

Huaiqing Wang¹, Daowen Qiu²•Institutions (2)

City University of Hong Kong¹, Sun Yat-sen University²

01 Dec 2003-IEEE Transactions on Fuzzy Systems

TL;DR: Summarily formal aspect of CW is more systematically established more deeply dealt with while some new problems also emerge.

...read moreread less

Abstract: Computing with words (CW) as a methodology, means computing and reasoning by the use of words in place of numbers or symbols, which may conform more to humans' perception when describing real-world problems. In this paper, as a continuation of a previous paper, we aim to develop and deepen a formal aspect of CW. According to the previous paper, the basic point of departure is that CW treats certain formal modes of computation with strings of fuzzy subsets instead of symbols as their inputs. Specifically, 1) we elaborate on CW via Turing machine (TM) models, showing the time complexity is at least exponential if the inputs are strings of words; 2) a negative result of (6) not holding is verified which indicates that the extension principle for CW via TMs needs to be re-examined; 3) we discuss CW via context- free grammars and regular grammars and the extension principles for CW via these formal grammars are set up; 4) some equivalences between fuzzy pushdown automata (respectively, fuzzy finite-state automata) fuzzy context-free grammars (respectively, fuzzy regular grammars) are demonstrated in the sense that the inputs are instead strings of words; 5) some instances are described in detail. Summarily formal aspect of CW is more systematically established more deeply dealt with while some new problems also emerge.

...read moreread less

Journal Article•DOI•

Speaking the users' languages

[...]

Amy Isard¹, Jon Oberlander¹, Colin Matheson¹, Ion Androutsopoulos²•Institutions (2)

University of Edinburgh¹, Athens University of Economics and Business²

01 Jan 2003-IEEE Intelligent Systems

TL;DR: M-PIRO is a system that generates descriptions of museum objects tailored to the user that is significantly more modular than that of its predecessor ILEX.

...read moreread less

Abstract: The authors describe a system that generates descriptions of museum objects tailored to the user. The texts presented to adults, children, and experts differ in several ways, from the choice of words used to the complexity of the sentence forms. M-PIRO can currently generate text in three languages: English, Greek, and Italian. The grammar resources are language independent as much as possible. M-PIRO's system architecture is significantly more modular than that of its predecessor ILEX. In particular, the linguistic resources, database, and user-modeling subsystems are now separate from the systems that perform the natural language generation and speech synthesis.

...read moreread less

Proceedings Article•DOI•

Noun-Noun Compound Machine Translation A Feasibility Study on Shallow Processing

[...]

Takaaki Tanaka, Timothy Baldwin¹•Institutions (1)

Stanford University¹

12 Jul 2003

TL;DR: The results of a feasibility study on the ability of memory-based machine translation and word-to-word compositional machine translation to translate Japanese and English noun-noun compounds are described.

...read moreread less

Abstract: The translation of compound nouns is a major issue in machine translation due to their frequency of occurrence and high productivity. Various shallow methods have been proposed to translate compound nouns, notable amongst which are memory-based machine translation and word-to-word compositional machine translation. This paper describes the results of a feasibility study on the ability of these methods to translate Japanese and English noun-noun compounds.

...read moreread less

Proceedings Article•DOI•

Efficient search for interactive statistical machine translation

[...]

Franz Josef Och, Richard Zens, Hermann Ney

12 Apr 2003

TL;DR: A fully fledged translation system is used to ensure the quality of the proposed extensions of the interactive machine translation system and word hypotheses graphs are used as an efficient search space representation to achieve fast response times.

...read moreread less

Abstract: The goal of interactive machine translation is to improve the productivity of human translators. An interactive machine translation system operates as follows: the automatic system proposes a translation. Now, the human user has two options: to accept the suggestion or to correct it. During the post-editing process, the human user is assisted by the interactive system in the following way: the system suggests an extension of the current translation prefix. Then, the user either accepts this extension (completely or partially) or ignores it. The two most important factors of such an interactive system are the quality of the proposed extensions and the response time. Here, we will use a fully fledged translation system to ensure the quality of the proposed extensions. To achieve fast response times, we will use word hypotheses graphs as an efficient search space representation. We will show results of our approach on the Verbmobil task and on the Canadian Hansards task.

...read moreread less

Book Chapter•DOI•

A Hybrid Rule and Example-Based Method for Machine Translation

[...]

Francis Bond, Satoshi Shirai

01 Jan 2003

TL;DR: A new example-based method of machine translation in which the examples need not be direct translations, allowing the use of currently available sentence-aligned corpora as data.

...read moreread less

Abstract: This paper introduces a new example-based method of machine translation in which the examples need not be direct translations. The system will weed out strange examples during translation, allowing the use of currently available sentence-aligned corpora as data. Rule-based modules are used where appropriate. A prototype Japanese-to-English system has been implemented that allows multiple users to share corpora.

...read moreread less

Syntax for Statistical Machine Translation

[...]

Franz Josef Och, Daniel Gildea, Sanjeev Khudanpur, Kenji Yamada, Alexander Fraser, Shankar Kumar, David A. Smith, Katherine Eng, Viren Jain, Zhen Jin, Dragomir R. Radev - Show less +7 more

01 Jan 2003

TL;DR: This workshop incrementally add new features representing syntactic knowledge that deal with specific problems of the underlying baseline, and extends previous tree-based alignment models by allowing partial tree alignments when the two syntactic structures are not isomorphic.

...read moreread less

Abstract: In recent evaluations of machine translation systems, statistical systems have outperformed classical approaches based on interpretation, transfer, and generation. Nonetheless, the output of statistical systems often contains obvious grammatical errors. This can be attributed to the fact that the syntactic well-formedness is only influenced by local n-gram language models and simple alignment models. We aim to integrate syntactic structure into statistical models to address this problem. In the workshop we start with a very strong baseline – the alignment template statistical machine translation system that obtained the best results in the 2002 and 2003 DARPA MT evaluations. This model is based on a log-linear modeling framework, which allows for the easy integration of many different knowledge sources (i.e. feature functions) into an overall model and to train the feature function combination weights discriminatively. During the workshop, we incrementally add new features representing syntactic knowledge that deal with specific problems of the underlying baseline. We want to investigate a broad range of possible feature functions, from very simple binary features to sophisticated treeto-tree translation models. Simple feature functions test if a certain constituent occurs in the source and the target language parse tree. More sophisticated features are derived from an alignment model where whole sub-trees in source and target can be aligned node by node. We also plan to investigate features based on projection of parse trees from one language onto strings of another, a useful technique when parses are available for only one of the two languages. We extend previous tree-based alignment models by allowing partial tree alignments when the two syntactic structures are not isomorphic. We work with the Chinese-English data from the recent evaluations, as large amounts of sentence-aligned training corpora, as well as multiple reference translations are available. This will also allow to compare results with the various systems participating in the evaluations. In addition, an annotated Chinese-English parallel tree-bank is available. We evaluate the improvement of our system using the BLEU metric. Using the additional feature functions developed during the workshop the BLEU score improved from 31.6% for the baseline MT system to 33.2% using rescoring of a 1000-best list.

...read moreread less

Noun phrase translation

[...]

Kevin Knight, Philipp Koehn

01 Jan 2003

...read moreread less

Abstract: We define noun phrase translation as a subtask of statistical machine translation. This enables us to build a dedicated noun phrase translation subsystem that improves over the currently best general statistical machine translation methods by incorporating special modeling and special features. We integrate such a system into a state-of-the-art statistical machine translation system with novel methods and show overall improvement in translation quality. We also carry out empirical linguistic studies on noun phrase translatability and the sources of translation errors.

...read moreread less

Proceedings Article•DOI•

Fuzzy translation of cross-lingual spelling variants

[...]

Ari Pirkola¹, Jarmo Toivonen², Heikki Keskustalo¹, Kari Visala¹, Kalervo Järvelin¹ - Show less +1 more•Institutions (2)

University of Tampere¹, Tampere University of Technology²

28 Jul 2003

TL;DR: A novel two-step fuzzy translation technique for cross-lingual spelling variants that was evaluated empirically using five source languages and English as a target language and performed better, sometimes considerably better, than fuzzy matching alone.

...read moreread less

Abstract: We will present a novel two-step fuzzy translation technique for cross-lingual spelling variants. In the first stage, transformation rules are applied to source words to render them more similar to their target language equivalents. The rules are generated automatically using translation dictionaries as source data. In the second stage, the intermediate forms obtained in the first stage are translated into a target language using fuzzy matching. The effectiveness of the technique was evaluated empirically using five source languages and English as a target language. The target word list contained 189 000 English words with the correct equivalents for the source words among them. The source words were translated using the two-step fuzzy translation technique, and the results were compared with those of plain fuzzy matching based translation. The combined technique performed better, sometimes considerably better, than fuzzy matching alone.

...read moreread less

Proceedings Article•DOI•

Extracting biochemical interactions from MEDLINE using a link grammar parser

[...]

Jing Ding¹, Daniel Berleant¹, Jun Xu², Andy Fulmer²•Institutions (2)

Iowa State University¹, Procter & Gamble²

03 Nov 2003

TL;DR: Although the parser was originally developed for conversational English and made many mistakes in parsing sentences from the biochemical domain, it nevertheless achieved better overall performance than a co-occurrence-only method.

...read moreread less

Abstract: Many natural language processing approaches at various complexity levels have been reported for extracting biochemical interactions from MEDLINE. While some algorithms using simple template matching are unable to deal with the complex syntactic structures, others exploiting sophisticated parsing techniques are hindered by greater computational cost. This study investigates link grammar parsing for extracting biochemical interactions. Link grammar parsing can handle many syntactic structures and is computationally relatively efficient. We experimented on a sample MEDLINE corpus. Although the parser was originally developed for conversational English and made many mistakes in parsing sentences from the biochemical domain, it nevertheless achieved better overall performance than a co-occurrence-only method. Customizing the parser for the biomedical domain is expected to improve its performance further.

...read moreread less

Book Chapter•DOI•

What is Example-Based Machine Translation?

[...]

Davide Turcato, Fred Popowich

01 Jan 2003

TL;DR: It is shown that Example-Based Machine Translation, as long as it is linguistically principled, significantly overlaps with other linguologically principled approaches to Machine Translation.

...read moreread less

Abstract: We maintain that the essential feature that characterizes a Machine Translation approach and sets it apart from other approaches is the kind of knowledge it uses. From this perspective, we argue that Example-Based Machine Translation is sometimes characterized in terms of nonessential features. We show that Example-Based Machine Translation, as long as it is linguistically principled, significantly overlaps with other linguistically principled approaches to Machine Translation. We make a proposal for translation knowledge bases that make such an overlap explicit. We relate our proposal to translation by analogy, which stands out as an inherently example-based technique.

...read moreread less

Journal Article•DOI•

Agile Parsing in TXL

[...]

Thomas R. Dean¹, James R. Cordy¹, A.J. Malton², Kevin A. Schneider³•Institutions (3)

Queen's University¹, University of Waterloo², University of Saskatchewan³

01 Oct 2003

TL;DR: The basic techniques of agile parsing in TXL are introduced and several industry proven techniques for exploiting agile parse in software source analysis and transformation are discussed.

...read moreread less

Abstract: Syntactic analysis forms a foundation of many source analysis and reverse engineering tools. However, a single standard grammar is not always appropriate for all source analysis and manipulation tasks. Small custom modifications to the grammar can make the programs used to implement these tasks simpler, clearer and more efficient. This leads to a new paradigm for programming these tools: agile parsing. In agile parsing the effective grammar used by a particular tool is a combination of two parts: the standard base grammar for the input language, and a set of explicit grammar overrides that modify the parse to support the task at hand. This paper introduces the basic techniques of agile parsing in TXL and discusses several industry proven techniques for exploiting agile parsing in software source analysis and transformation.

...read moreread less

Book Chapter•DOI•

Combining Query Translation and Document Translation in Cross-Language Retrieval

[...]

Aitao Chen¹, Fredric C. Gey¹•Institutions (1)

University of California, Berkeley¹

21 Aug 2003

TL;DR: The monolingual, bilingual, and multilingual retrieval experiments using the CLEF 2003 test collection show that document translation- based retrieval is slightly better than the query translation-based retrieval on the CLEFs.

...read moreread less

Abstract: This paper describes monolingual, bilingual, and multilingual retrieval experiments using the CLEF 2003 test collection. The paper compares query translation-based multilingual retrieval with document translation-based multilingual retrieval where the documents are translated into the query language by translating the document words individually using machine translation systems or statistical translation lexicons derived from parallel texts. The multilingual retrieval results show that document translation-based retrieval is slightly better than the query translation-based retrieval on the CLEF 2003 test collection. Furthermore, combining query translation and document translation in multilingual retrieval achieves even better performance.

...read moreread less

Proceedings Article•DOI•

Integrated phrase segmentation and alignment algorithm for statistical machine translation

[...]

Ying Zhang¹, Stephan Vogel¹, Alex Waibel¹•Institutions (1)

Carnegie Mellon University¹

26 Oct 2003

TL;DR: An integrated phrase segmentation/alignment algorithm (ISA) for statistical machine translation that yields phrase-to-phrase translations with significant higher precisions than the baseline system where phrase translations are extracted from the HMM word alignment.

...read moreread less

Abstract: We present an integrated phrase segmentation/alignment algorithm (ISA) for statistical machine translation. Without the need of building an initial word-to-word alignment or initially segmenting the monolingual text into phrases as other methods do, this algorithm segments the sentences into phrases and finds their alignments simultaneously. For each sentence pair, ISA builds a two-dimensional matrix to represent a sentence pair where the value of each cell corresponds to the point-wise mutual information (MI) between the source and target words. Based on the similarities of MI values among cells, we identify the aligned phrase pairs. Once all the phrase pairs are found, we know both how to segment one sentence into phrases and also the alignments between the source and target sentences. We use monolingual bigram language models to estimate the joint probabilities of the identified phrase pairs. The joint probabilities are then normalized to conditional probabilities, which are used by the decoder. Despite its simplicity, this approach yields phrase-to-phrase translations with significant higher precisions than our baseline system where phrase translations are extracted from the HMM word alignment. When we combine the phrase-to-phrase translations generated by this algorithm with the baseline system, the improvement on translation quality is even larger.

...read moreread less

Journal Article•DOI•

Experiments with a Hindi-to-English transfer-based MT system under a miserly data scenario

[...]

Alon Lavie¹, Stephan Vogel¹, Lori Levin¹, Erik J. Peterson¹, Katharina Probst¹, Ariadna Font Llitjós¹, Rachel Reynolds¹, Jaime G. Carbonell¹, Richard J. Cohen² - Show less +5 more•Institutions (2)

Carnegie Mellon University¹, University of Pittsburgh²

01 Jun 2003-ACM Transactions on Asian Language Information Processing

TL;DR: The results indicate that the Xfer system significantly outperforms both EBMT and SMT in this scenario, and automatically learned transfer rules are effective in improving translation performance, compared with a baseline word-to-word translation version of the system.

...read moreread less

Abstract: We describe an experiment designed to evaluate the capabilities of our trainable transfer-based (Xfer) machine translation approach, as applied to the task of Hindi-to-English translation, and trained under an extremely limited data scenario. We compare the performance of the Xfer approach with two corpus-based approaches---Statistical MT (SMT) and Example-based MT (EBMT)---under the limited data scenario. The results indicate that the Xfer system significantly outperforms both EBMT and SMT in this scenario. Results also indicate that automatically learned transfer rules are effective in improving translation performance, compared with a baseline word-to-word translation version of the system. Xfer system performance with a limited number of manually written transfer rules is, however, still better than the current automatically inferred rules. Furthermore, a "multiengine" version of our system that combined the output of the Xfer and SMT systems and optimizes translation selection outperformed both individual systems.

...read moreread less

Book Chapter•DOI•

Approach to construction of automatic morphological analysis systems for inflective languages with little effort

[...]

Alexander Gelbukh¹, Grigori Sidorov¹•Institutions (1)

Instituto Politécnico Nacional¹

16 Feb 2003

TL;DR: In this paper, an approach based on static processing of stem allomorphs and the method of analysis known as "analysis through generation" was proposed for morphological analysis for inflective languages.

...read moreread less

Abstract: Development of morphological analysis systems for inflective languages is a tedious and laborious task. We suggest an approach for development of such systems that permits to spend less time and effort. It is based on static processing of stem allomorphs and the method of analysis known as "analysis through generation." These features allow for using the morphological models oriented to generation, instead of developing special analysis models. Normally, generation models are presented in traditional grammars and correspond very well to the intuition of speakers. Systems based on this approach were developed for Russian and Spanish.

...read moreread less

Proceedings Article•

Speech Recognition with Dynamic Grammars Using Finite-State Transducers

[...]

Johan Schalkwyk, I. Lee Hetherington, Ezra Story

01 Jan 2003

TL;DR: This paper examines the compilation of recognition grammars with an emphasis on the dynamic (changing) properties of the grammar and how these relate to context-dependent speech recognizers through the algebra of finite-state transducers (FSTs).

...read moreread less

Abstract: Spoken language systems, ranging from interactive voice response (IVR) to mixed-initiative conversational systems, make use of a wide range of recognition grammars and vocabularies. The recognition grammars are either static (created at design time) or dynamic (dependent on database lookup at run time). This paper examines the compilation of recognition grammars with an emphasis on the dynamic (changing) properties of the grammar and how these relate to context-dependent speech recognizers. By casting the problem in the algebra of finite-state transducers (FSTs) we can use the composition operator for fast-and-efficient compilation and splicing of dynamic recognition grammars within the context of a larger precompiled static grammar.

...read moreread less

Proceedings Article•

Domain Specific Speech Acts for Spoken Language Translation

[...]

Lori Levin, Chad Langley, Alon Lavie, Donna Gates, Dorcas Wallace, Kay Peterson - Show less +2 more

01 Jan 2003

TL;DR: It is shown that, although domain actions are domain specific, the approach scales up to large domains without an explosion of domain actions and can be coded with high inter-coder reliability across research sites.

...read moreread less

Abstract: We describe a coding scheme for machine translation of spoken taskoriented dialogue. The coding scheme covers two levels of speaker intention − domain independent speech acts and domain dependent domain actions. Our database contains over 14,000 tagged sentences in English, Italian, and German. We argue that domain actions, and not speech acts, are the relevant discourse unit for improving translation quality. We also show that, although domain actions are domain specific, the approach scales up to large domains without an explosion of domain actions and can be coded with high inter-coder reliability across research sites. Furthermore, although the number of domain actions is on the order of ten times the number of speech acts, sparseness is not a problem for the training of classifiers for identifying the domain action. We describe our work on developing high accuracy speech act and domain action classifiers, which is the core of the source language analysis module of our NESPOLE machine translation system.

...read moreread less

Collapse