scispace - formally typeset
Search or ask a question
Proceedings ArticleDOI

On the integration of speech recognition and statistical machine translation.

04 Sep 2005-pp 3177-3180
TL;DR: It is shown that acoustic recognition scores of the recognized words in the lattices positively and significantly affect the translation quality and a fully integrated speech translation model is built.
Abstract: This paper focuses on the interface between speech recognition and machine translation in a speech translation system. Based on a thorough theoretical framework, we exploit word lattices of automatic speech recognition hypotheses as input to our translation system which is based on weighted finite-state transducers. We show that acoustic recognition scores of the recognized words in the lattices positively and significantly affect the translation quality. In experiments, we have found consistent improvements on three different corpora compared with translations of single best recognized results. In addition we build and evaluate a fully integrated speech translation model.

Content maybe subject to copyright    Report

Citations
More filters
Proceedings ArticleDOI
25 Jun 2007
TL;DR: An open-source toolkit for statistical machine translation whose novel contributions are support for linguistically motivated factors, confusion network decoding, and efficient data formats for translation models and language models.
Abstract: We describe an open-source toolkit for statistical machine translation whose novel contributions are (a) support for linguistically motivated factors, (b) confusion network decoding, and (c) efficient data formats for translation models and language models. In addition to the SMT decoder, the toolkit also includes a wide variety of tools for training, tuning and applying the system to many translation tasks.

6,008 citations


Cites background from "On the integration of speech recogn..."

  • ..., 2005], word lattices [Matusov et al., 2005; Mathias and Byrne, 2006], and confusion networks [Bertoldi and Federico, 2005]....

    [...]

ReportDOI
01 Feb 2008
TL;DR: It is argued that word lattice decoding provides a compelling model for translation of text genres, as well, and that prior work in translating lattices using finite state techniques can be naturally extended to more expressive synchronous context-free grammarbased models.
Abstract: Word lattice decoding has proven useful in spoken language translation; we argue that it provides a compelling model for translation of text genres, as well. We show that prior work in translating lattices using finite state techniques can be naturally extended to more expressive synchronous context-free grammarbased models. Additionally, we resolve a significant complication that non-linear word lattice inputs introduce in reordering models. Our experiments evaluating the approach demonstrate substantial gains for ChineseEnglish and Arabic-English translation.

193 citations


Cites methods from "On the integration of speech recogn..."

  • ...Matusov et al. (2005) decodes monotonically and then uses a finite state reordering model on the single-best translation, along the lines of Bangalore and Riccardi (2000)....

    [...]

  • ...The ‘noisier channel’ model of machine translation has been widely used in spoken language translation as an alternative to selecting the single-best hypothesis from an ASR system and translating it (Ney, 1999; Casacuberta et al., 2004; Zhang et al., 2005; Saleem et al., 2005; Matusov et al., 2005; Bertoldi et al., 2007; Mathias, 2007)....

    [...]

  • ...…translation has been widely used in spoken language translation as an alternative to selecting the single-best hypothesis from an ASR system and translating it (Ney, 1999; Casacuberta et al., 2004; Zhang et al., 2005; Saleem et al., 2005; Matusov et al., 2005; Bertoldi et al., 2007; Mathias, 2007)....

    [...]

Posted Content
TL;DR: This article presented a recurrent encoder-decoder deep neural network architecture that directly translates speech in one language into text in another, and achieved state-of-the-art performance on the Fisher Callhome Spanish-English speech translation task.
Abstract: We present a recurrent encoder-decoder deep neural network architecture that directly translates speech in one language into text in another. The model does not explicitly transcribe the speech into text in the source language, nor does it require supervision from the ground truth source language transcription during training. We apply a slightly modified sequence-to-sequence with attention architecture that has previously been used for speech recognition and show that it can be repurposed for this more complex task, illustrating the power of attention-based models. A single model trained end-to-end obtains state-of-the-art performance on the Fisher Callhome Spanish-English speech translation task, outperforming a cascade of independently trained sequence-to-sequence speech recognition and machine translation models by 1.8 BLEU points on the Fisher test set. In addition, we find that making use of the training data in both languages by multi-task training sequence-to-sequence speech translation and recognition models with a shared encoder network can improve performance by a further 1.4 BLEU points.

163 citations

01 Jan 2013
TL;DR: The Fisher and Callhome Spanish-English Speech Translation Corpus is introduced, supplementing existing LDC audio and transcripts with ASR 1-best, lattice, and oracle output produced by the Kaldi recognition system and English translations obtained on Amazon’s Mechanical Turk.
Abstract: Research into the translation of the output of automatic speech recognition (ASR) systems is hindered by the dearth of datasets developed for that explicit purpose. For SpanishEnglish translation, in particular, most parallel data available exists only in vastly different domains and registers. In order to support research on cross-lingual speech applications, we introduce the Fisher and Callhome Spanish-English Speech Translation Corpus, supplementing existing LDC audio and transcripts with (a) ASR 1-best, lattice, and oracle output produced by the Kaldi recognition system and (b) English translations obtained on Amazon’s Mechanical Turk. The result is a four-way parallel dataset of Spanish audio, transcriptions, ASR lattices, and English translations of approximately 163 hours of speech,1 with defined training, development, and held-out test sets. There are four translations for the Fisher evaluation data (dev and test), and a single reference for the Fisher training data and all Callhome data. We conduct baseline machine translation experiments using models trained on the provided training data, and validate the dataset by corroborating a number of known results in the field, including the utility of in-domain (information, conversational) training data, increased performance translating lattices (instead of recognizer 1-best output), and the relationship between word error rate and BLEU score.

135 citations


Cites methods from "On the integration of speech recogn..."

  • ...Previous in speech-to-text translation have demonstrated success in translating ASR n-best lists [7] and confusion networks1 [8], and lattices [9, 10]....

    [...]

Proceedings ArticleDOI
01 Jun 2018
TL;DR: The authors explore multitask models for neural translation of speech, augmenting them in order to reflect two intuitive notions: transitivity and invertibility, and show that the application of these notions on jointly trained models improves performance.
Abstract: We explore multitask models for neural translation of speech, augmenting them in order to reflect two intuitive notions. First, we introduce a model where the second task decoder receives information from the decoder of the first task, since higher-level intermediate representations should provide useful information. Second, we apply regularization that encourages transitivity and invertibility. We show that the application of these notions on jointly trained models improves performance on the tasks of low-resource speech transcription and translation. It also leads to better performance when using attention information for word discovery over unsegmented input.

124 citations

References
More filters
Proceedings ArticleDOI
06 Jul 2002
TL;DR: This paper proposed a method of automatic machine translation evaluation that is quick, inexpensive, and language-independent, that correlates highly with human evaluation, and that has little marginal cost per run.
Abstract: Human evaluations of machine translation are extensive but expensive. Human evaluations can take months to finish and involve human labor that can not be reused. We propose a method of automatic machine translation evaluation that is quick, inexpensive, and language-independent, that correlates highly with human evaluation, and that has little marginal cost per run. We present this method as an automated understudy to skilled human judges which substitutes for them when there is need for quick or frequent evaluations.

21,126 citations

Journal ArticleDOI
TL;DR: An important result is that refined alignment models with a first-order dependence and a fertility model yield significantly better results than simple heuristic models.
Abstract: We present and compare various methods for computing word alignments using statistical or heuristic models. We consider the five alignment models presented in Brown, Della Pietra, Della Pietra, and Mercer (1993), the hidden Markov alignment model, smoothing techniques, and refinements. These statistical models are compared with two heuristic models based on the Dice coefficient. We present different methods for combining word alignments to perform a symmetrization of directed statistical alignment models. As evaluation criterion, we use the quality of the resulting Viterbi alignment compared to a manually produced reference alignment. We evaluate the models on the German-English Verbmobil task and the French-English Hansards task. We perform a detailed analysis of various design decisions of our statistical alignment system and evaluate these on training corpora of various sizes. An important result is that refined alignment models with a first-order dependence and a fertility model yield significantly better results than simple heuristic models. In the Appendix, we present an efficient training algorithm for the alignment models presented.

4,402 citations


"On the integration of speech recogn..." refers background in this paper

  • ...We attribute this to the undertrained translation model (the error rates are already rather high for the correct Spanish input)....

    [...]

  • ...On the LC-STAR task, the Spanish speech recognition system had an accuracy of less than 70 % (see Table 1) due to limited training data, low sampling rate and large speaker variability....

    [...]

  • ...When we use exactly the same Spanish word lattices and translate to Catalan, we reach an improvement of over 12 % relative in WER over the translation of the single best input (Table 5)....

    [...]

  • ...Here, the translation model is more robust since the difference in structure and word order between Spanish and Catalan is much smaller than between Spanish and English....

    [...]

  • ...It is significantly er than the BTEC task and has evolved from one of the uropean-funded speech translation projects. he third task is a Spanish-English and Spanish-Catalan ation task....

    [...]

Proceedings ArticleDOI
24 Mar 2002
TL;DR: NIST commissioned NIST to develop an MT evaluation facility based on the IBM work, which is now available from NIST and serves as the primary evaluation measure for TIDES MT research.
Abstract: Evaluation is recognized as an extremely helpful forcing function in Human Language Technology R&D. Unfortunately, evaluation has not been a very powerful tool in machine translation (MT) research because it requires human judgments and is thus expensive and time-consuming and not easily factored into the MT research agenda. However, at the July 2001 TIDES PI meeting in Philadelphia, IBM described an automatic MT evaluation technique that can provide immediate feedback and guidance in MT research. Their idea, which they call an "evaluation understudy", compares MT output with expert reference translations in terms of the statistics of short sequences of words (word N-grams). The more of these N-grams that a translation shares with the reference translations, the better the translation is judged to be. The idea is elegant in its simplicity. But far more important, IBM showed a strong correlation between these automatically generated scores and human judgments of translation quality. As a result, DARPA commissioned NIST to develop an MT evaluation facility based on the IBM work. This utility is now available from NIST and serves as the primary evaluation measure for TIDES MT research.

1,734 citations

Journal ArticleDOI
TL;DR: WFSTs provide a common and natural representation for hidden Markov models (HMMs), context-dependency, pronunciation dictionaries, grammars, and alternative recognition outputs, and general transducer operations combine these representations flexibly and efficiently.
Abstract: We survey the use of weighted finite-state transducers (WFSTs) in speech recognition. We show that WFSTs provide a common and natural representation for hidden Markov models (HMMs), context-dependency, pronunciation dictionaries, grammars, and alternative recognition outputs. Furthermore, general transducer operations combine these representations flexibly and efficiently. Weighted determinization and minimization algorithms optimize their time and space requirements, and a weight pushing algorithm distributes the weights along the paths of a weighted transducer optimally for speech recognition. As an example, we describe a North American Business News (NAB) recognition system built using these techniques that combines the HMMs, full cross-word triphones, a lexicon of 40 000 words, and a large trigram grammar into a single weighted transducer that is only somewhat larger than the trigram word grammar and that runs NAB in real-time on a very simple decoder. In another example, we show that the same techniques can be used to optimize lattices for second-pass recognition. In a third example, we show how general automata operations can be used to assemble lattices from different recognizers to improve recognition performance.

908 citations

01 Jan 2000
TL;DR: The use of weighted finite-state transducers (WFSTs) in speech recognition has been extensively studied in this article, where the authors show that WFSTs provide a common and natural representation for hidden Markov models (HMMs), context-dependency, pronunciation dictionaries, grammars, and alternative recognition outputs.
Abstract: We survey the use of weighted finite-state transducers (WFSTs) in speech recognition. We show that WFSTs provide a common and natural representation for hidden Markov models (HMMs), context-dependency, pronunciation dictionaries, grammars, and alternative recognition outputs. Furthermore, general transducer operations combine these representations flexibly and efficiently. Weighted determinization and minimization algorithms optimize their time and space requirements, and a weight pushing algorithm distributes the weights along the paths of a weighted transducer optimally for speech recognition. As an example, we describe a North American Business News (NAB) recognition system built using these techniques that combines the HMMs, full cross-word triphones, a lexicon of 40 000 words, and a large trigram grammar into a single weighted transducer that is only somewhat larger than the trigram word grammar and that runs NAB in real-time on a very simple decoder. In another example, we show that the same techniques can be used to optimize lattices for second-pass recognition. In a third example, we show how general automata operations can be used to assemble lattices from different recognizers to improve recognition performance.

871 citations