scispace - formally typeset
Search or ask a question

Showing papers by "Kevin Duh published in 2019"


Proceedings ArticleDOI
01 Jun 2019
TL;DR: This work adapts Elastic Weight Consolidation (EWC)—a machine learning method for learning a new task without forgetting previous tasks—to mitigate the drop in general-domain performance as catastrophic forgetting of general- domain knowledge.
Abstract: Continued training is an effective method for domain adaptation in neural machine translation. However, in-domain gains from adaptation come at the expense of general-domain performance. In this work, we interpret the drop in general-domain performance as catastrophic forgetting of general-domain knowledge. To mitigate it, we adapt Elastic Weight Consolidation (EWC)—a machine learning method for learning a new task without forgetting previous tasks. Our method retains the majority of general-domain performance lost in continued training without degrading in-domain performance, outperforming the previous state-of-the-art. We also explore the full range of general-domain performance available when some in-domain degradation is acceptable.

128 citations


Proceedings ArticleDOI
01 Jul 2019
TL;DR: This work proposes an attention-based model that treats AMR parsing as sequence-to-graph transduction, and it can be effectively trained with limited amounts of labeled AMR data.
Abstract: We propose an attention-based model that treats AMR parsing as sequence-to-graph transduction. Unlike most AMR parsers that rely on pre-trained aligners, external semantic resources, or data augmentation, our proposed parser is aligner-free, and it can be effectively trained with limited amounts of labeled AMR data. Our experimental results outperform all previously reported SMATCH scores, on both AMR 2.0 (76.3% on LDC2017T10) and AMR 1.0 (70.2% on LDC2014T12).

107 citations


Proceedings ArticleDOI
01 Jun 2019
TL;DR: This article introduced a curriculum learning approach to adapt generic NMT models to a specific domain, where samples are grouped by their similarities to the domain of interest and each group is fed to the training algorithm with a particular schedule.
Abstract: We introduce a curriculum learning approach to adapt generic neural machine translation models to a specific domain. Samples are grouped by their similarities to the domain of interest and each group is fed to the training algorithm with a particular schedule. This approach is simple to implement on top of any neural framework or architecture, and consistently outperforms both unadapted and adapted baselines in experiments with two distinct domains and two language pairs.

78 citations


Proceedings ArticleDOI
01 Nov 2019
TL;DR: This article propose an attention-based neural transducer that incrementally builds meaning representation via a sequence of semantic relations, which can be effectively trained without relying on a pre-trained aligner.
Abstract: We unify different broad-coverage semantic parsing tasks into a transduction parsing paradigm, and propose an attention-based neural transducer that incrementally builds meaning representation via a sequence of semantic relations. By leveraging multiple attention mechanisms, the neural transducer can be effectively trained without relying on a pre-trained aligner. Experiments separately conducted on three broad-coverage semantic parsing tasks – AMR, SDP and UCCA – demonstrate that our attention-based neural transducer improves the state of the art on both AMR and UCCA, and is competitive with the state of the art on SDP.

72 citations


Proceedings ArticleDOI
01 Oct 2019
TL;DR: In this paper, the authors proposed a multilingual end-to-end speech translation (ST) model, in which speech utterances in source languages are directly translated to the desired target languages with a universal sequence-tosequence architecture.
Abstract: In this paper, we propose a simple yet effective framework for multilingual end-to-end speech translation (ST), in which speech utterances in source languages are directly translated to the desired target languages with a universal sequence-to-sequence architecture. While multilingual models have shown to be useful for automatic speech recognition (ASR) and machine translation (MT), this is the first time they are applied to the end-to-end ST problem. We show the effectiveness of multilingual end-to-end ST in two scenarios: one-to-many and many-to-many translations with publicly available data. We experimentally confirm that multilingual end-to-end ST models significantly outperform bilingual ones in both scenarios. The generalization of multilingual training is also evaluated in a transfer learning scenario to a very low-resource language pair. All of our codes and the database are publicly available to encourage further research in this emergent multilingual ST topic11Available at https://github.com/espnet/espnet..

56 citations


Posted Content
TL;DR: This work introduces a curriculum learning approach to adapt generic neural machine translation models to a specific domain and consistently outperforms both unadapted and adapted baselines in experiments with two distinct domains and two language pairs.
Abstract: We introduce a curriculum learning approach to adapt generic neural machine translation models to a specific domain. Samples are grouped by their similarities to the domain of interest and each group is fed to the training algorithm with a particular schedule. This approach is simple to implement on top of any neural framework or architecture, and consistently outperforms both unadapted and adapted baselines in experiments with two distinct domains and two language pairs.

52 citations


01 Aug 2019
TL;DR: In this paper, the authors conduct a systematic exploration on different numbers of BPE merge operations to understand how it interacts with the model architecture, the strategy to build vocabularies and the language pair.
Abstract: Most neural machine translation systems are built upon subword units extracted by methods such as Byte-Pair Encoding (BPE) or wordpiece. However, the choice of number of merge operations is generally made by following existing recipes. In this paper, we conduct a systematic exploration on different numbers of BPE merge operations to understand how it interacts with the model architecture, the strategy to build vocabularies and the language pair. Our exploration could provide guidance for selecting proper BPE configurations in the future. Most prominently: we show that for LSTM-based architectures, it is necessary to experiment with a wide range of different BPE operations as there is no typical optimal BPE configuration, whereas for Transformer architectures, smaller BPE size tends to be a typically optimal choice. We urge the community to make prudent choices with subword merge operations, as our experiments indicate that a sub-optimal BPE configuration alone could easily reduce the system performance by 3-4 BLEU points.

47 citations


Posted Content
TL;DR: An attention-based neural transducer that incrementally builds meaning representation via a sequence of semantic relations is proposed that improves the state of the art on both AMR and UCCA, and is competitive with the state-of-the- art on SDP.
Abstract: We unify different broad-coverage semantic parsing tasks under a transduction paradigm, and propose an attention-based neural framework that incrementally builds a meaning representation via a sequence of semantic relations. By leveraging multiple attention mechanisms, the transducer can be effectively trained without relying on a pre-trained aligner. Experiments conducted on three separate broad-coverage semantic parsing tasks -- AMR, SDP and UCCA -- demonstrate that our attention-based neural transducer improves the state of the art on both AMR and UCCA, and is competitive with the state of the art on SDP.

29 citations


Posted Content
TL;DR: It is experimentally confirmed that multilingual end-to-end ST models significantly outperform bilingual ones in both scenarios and the generalization of multilingual training is also evaluated in a transfer learning scenario to a very low-resource language pair.
Abstract: In this paper, we propose a simple yet effective framework for multilingual end-to-end speech translation (ST), in which speech utterances in source languages are directly translated to the desired target languages with a universal sequence-to-sequence architecture. While multilingual models have shown to be useful for automatic speech recognition (ASR) and machine translation (MT), this is the first time they are applied to the end-to-end ST problem. We show the effectiveness of multilingual end-to-end ST in two scenarios: one-to-many and many-to-many translations with publicly available data. We experimentally confirm that multilingual end-to-end ST models significantly outperform bilingual ones in both scenarios. The generalization of multilingual training is also evaluated in a transfer learning scenario to a very low-resource language pair. All of our codes and the database are publicly available to encourage further research in this emergent multilingual ST topic.

26 citations


01 Aug 2019
TL;DR: A method for automatically predicting whether translated segments are fluently inadequate by predicting fluency using grammaticality scores and predicting adequacy by augmenting sentence BLEU with a novel Bag-of-Vectors Sentence Similarity (BVSS).
Abstract: With the impressive fluency of modern machine translation output, systems may produce output that is fluent but not adequate (fluently inadequate). We seek to identify these errors and quantify their frequency in MT output of varying quality. To that end, we introduce a method for automatically predicting whether translated segments are fluently inadequate by predicting fluency using grammaticality scores and predicting adequacy by augmenting sentence BLEU with a novel Bag-of-Vectors Sentence Similarity (BVSS). We then apply this technique to analyze the outputs of statistical and neural systems for six language pairs with different levels of translation quality. We find that neural models are consistently more prone to this type of error than traditional statistical models. However, improving the overall quality of the MT system such as through domain adaptation reduces these errors.

22 citations


Posted Content
TL;DR: This work test the common hypothesis that SLKD addresses a capacity deficiency in students by "simplifying" noisy data points and finds it unlikely in this case, and proposes an alternative hypothesis under the lens of data augmentation and regularization.
Abstract: Sequence-level knowledge distillation (SLKD) is a model compression technique that leverages large, accurate teacher models to train smaller, under-parameterized student models. Why does pre-processing MT data with SLKD help us train smaller models? We test the common hypothesis that SLKD addresses a capacity deficiency in students by "simplifying" noisy data points and find it unlikely in our case. Models trained on concatenations of original and "simplified" datasets generalize just as well as baseline SLKD. We then propose an alternative hypothesis under the lens of data augmentation and regularization. We try various augmentation strategies and observe that dropout regularization can become unnecessary. Our methods achieve BLEU gains of 0.7-1.2 on TED Talks.

Journal ArticleDOI
TL;DR: This work defines the membership inference problem for sequence generation, provides an open dataset based on state-of-the-art machine translation models, and reports initial results on whether these models leak private information against several kinds of membership inference attacks.
Abstract: Data privacy is an important issue for "machine learning as a service" providers. We focus on the problem of membership inference attacks: given a data sample and black-box access to a model's API, determine whether the sample existed in the model's training data. Our contribution is an investigation of this problem in the context of sequence-to-sequence models, which are important in applications such as machine translation and video captioning. We define the membership inference problem for sequence generation, provide an open dataset based on state-of-the-art machine translation models, and report initial results on whether these models leak private information against several kinds of membership inference attacks.

01 Aug 2019
TL;DR: A robust document representation is proposed that combines N-best translations and a novel bag-of-phrases output from various ASR/MT systems and demonstrates that a richer document representation can consistently overcome issues in low translation accuracy for CLIR in low-resource settings.
Abstract: The goal of cross-lingual information retrieval (CLIR) is to find relevant documents written in languages different from that of the query. Robustness to translation errors is one of the main challenges for CLIR, especially in low-resource settings where there is limited training data for building machine translation (MT) systems or bilingual dictionaries. If the test collection contains speech documents, additional errors from automatic speech recognition (ASR) makes translation even more difficult. We propose a robust document representation that combines N-best translations and a novel bag-of-phrases output from various ASR/MT systems. We perform a comprehensive empirical analysis on three challenging collections; they consist of Somali, Swahili, and Tagalog speech/text documents to be retrieved by English queries. By comparing various ASR/MT systems with different error profiles, our results demonstrate that a richer document representation can consistently overcome issues in low translation accuracy for CLIR in low-resource settings.

Posted Content
24 May 2019
TL;DR: The authors conduct a systematic exploration of different Byte-Pair Encoding (BPE) merge operations to understand how it interacts with the model architecture, the strategy to build vocabularies and the language pair.
Abstract: Most neural machine translation systems are built upon subword units extracted by methods such as Byte-Pair Encoding (BPE) or wordpiece. However, the choice of number of merge operations is generally made by following existing recipes. In this paper, we conduct a systematic exploration of different BPE merge operations to understand how it interacts with the model architecture, the strategy to build vocabularies and the language pair. Our exploration could provide guidance for selecting proper BPE configurations in the future. Most prominently: we show that for LSTM-based architectures, it is necessary to experiment with a wide range of different BPE operations as there is no typical optimal BPE configuration, whereas for Transformer architectures, smaller BPE size tends to be a typically optimal choice. We urge the community to make prudent choices with subword merge operations, as our experiments indicate that a sub-optimal BPE configuration alone could easily reduce the system performance by 3-4 BLEU points.

Journal ArticleDOI
TL;DR: This work proposes to tune the meta-parameters of a whole large vocabulary speech recognition system using the evolution strategy with a multi-objective Pareto optimization and makes use of parallel computation on cloud computers.
Abstract: The state-of-the-art large vocabulary speech recognition systems consist of several components including hidden Markov model and deep neural network. To realize the highest recognition performance, numerous meta-parameters specifying the designs and training setups of these components must be optimized. A prominent obstacle in system development is the laborious effort required by human experts in tuning these meta-parameters. To automate the process, we propose to tune the meta-parameters of a whole large vocabulary speech recognition system using the evolution strategy with a multi-objective Pareto optimization. As the result of the evolution, the system is optimized for both low word error rate and compact model size. Since the approach requires repeated training and evaluation of the recognition systems that require large computation, we make use of parallel computation on cloud computers. Experimental results show the effectiveness of the proposed approach by discovering appropriate configuration for large vocabulary speech recognition systems automatically.

Proceedings ArticleDOI
01 Nov 2019
TL;DR: This work presents the HABLex dataset, designed to test methods for bilingual lexicon integration into neural machine translation, and presents two simple baselines - constrained decoding and continued training - and an improvement to continued training to address overfitting.
Abstract: Bilingual lexicons are valuable resources used by professional human translators. While these resources can be easily incorporated in statistical machine translation, it is unclear how to best do so in the neural framework. In this work, we present the HABLex dataset, designed to test methods for bilingual lexicon integration into neural machine translation. Our data consists of human generated alignments of words and phrases in machine translation test sets in three language pairs (Russian-English, Chinese-English, and Korean-English), resulting in clean bilingual lexicons which are well matched to the reference. We also present two simple baselines - constrained decoding and continued training - and an improvement to continued training to address overfitting.

Posted Content
TL;DR: The authors proposed an attention-based model that treats AMR parsing as sequence-to-graph transduction, which can be effectively trained with a limited amount of labeled AMR data.
Abstract: We propose an attention-based model that treats AMR parsing as sequence-to-graph transduction. Unlike most AMR parsers that rely on pre-trained aligners, external semantic resources, or data augmentation, our proposed parser is aligner-free, and it can be effectively trained with limited amounts of labeled AMR data. Our experimental results outperform all previously reported SMATCH scores, on both AMR 2.0 (76.3% F1 on LDC2017T10) and AMR 1.0 (70.2% F1 on LDC2014T12).

Proceedings ArticleDOI
01 Aug 2019
TL;DR: The goal was to evaluate the performance of baseline systems on both the official noisy test set as well as news data, in order to ensure that performance gains in the latter did not come at the expense of general-domain performance.
Abstract: We describe the JHU submissions to the French–English, Japanese–English, and English–Japanese Robustness Task at WMT 2019. Our goal was to evaluate the performance of baseline systems on both the official noisy test set as well as news data, in order to ensure that performance gains in the latter did not come at the expense of general-domain performance. To this end, we built straightforward 6-layer Transformer models and experimented with a handful of variables including subword processing (FR→EN) and a handful of hyperparameters settings (JA↔EN). As expected, our systems performed reasonably.

Proceedings ArticleDOI
01 Aug 2019
TL;DR: Interestingly, word embeddings provided no consistent benefit, and ensembling struggled to outperform the best component submodel, which suggests the variety of architectures are learning redundant information, and future work may focus on encouraging decorrelated learning.
Abstract: Our submission to the MADAR shared task on Arabic dialect identification employed a language modeling technique called Prediction by Partial Matching, an ensemble of neural architectures, and sources of additional data for training word embeddings and auxiliary language models. We found several of these techniques provided small boosts in performance, though a simple character-level language model was a strong baseline, and a lower-order LM achieved best performance on Subtask 2. Interestingly, word embeddings provided no consistent benefit, and ensembling struggled to outperform the best component submodel. This suggests the variety of architectures are learning redundant information, and future work may focus on encouraging decorrelated learning.


Posted Content
TL;DR: This paper conducts a systematic exploration of different BPE merge operations to understand how it interacts with the model architecture, the strategy to build vocabularies and the language pair, and could provide guidance for selecting proper BPE configurations in the future.
Abstract: Most neural machine translation systems are built upon subword units extracted by methods such as Byte-Pair Encoding (BPE) or wordpiece. However, the choice of number of merge operations is generally made by following existing recipes. In this paper, we conduct a systematic exploration on different numbers of BPE merge operations to understand how it interacts with the model architecture, the strategy to build vocabularies and the language pair. Our exploration could provide guidance for selecting proper BPE configurations in the future. Most prominently: we show that for LSTM-based architectures, it is necessary to experiment with a wide range of different BPE operations as there is no typical optimal BPE configuration, whereas for Transformer architectures, smaller BPE size tends to be a typically optimal choice. We urge the community to make prudent choices with subword merge operations, as our experiments indicate that a sub-optimal BPE configuration alone could easily reduce the system performance by 3-4 BLEU points.

Proceedings ArticleDOI
01 Jun 2019
TL;DR: This work explores under which conditions it is beneficial to perform dialect identification for Arabic neural machine translation versus using a general system for all dialects.
Abstract: When translating diglossic languages such as Arabic, situations may arise where we would like to translate a text but do not know which dialect it is. A traditional approach to this problem is to design dialect identification systems and dialect-specific machine translation systems. However, under the recent paradigm of neural machine translation, shared multi-dialectal systems have become a natural alternative. Here we explore under which conditions it is beneficial to perform dialect identification for Arabic neural machine translation versus using a general system for all dialects.

Posted Content
TL;DR: This work investigates expansions based on Word Embeddings, DBpedia concepts linking, and Hypernym, and shows that they outperform existing state-of-the-art methods on the cross-language question re-ranking shared task.
Abstract: Community question-answering (CQA) platforms have become very popular forums for asking and answering questions daily. While these forums are rich repositories of community knowledge, they present challenges for finding relevant answers and similar questions, due to the open-ended nature of informal discussions. Further, if the platform allows questions and answers in multiple languages, we are faced with the additional challenge of matching cross-lingual information. In this work, we focus on the cross-language question re-ranking shared task, which aims to find existing questions that may be written in different languages. Our contribution is an exploration of query expansion techniques for this problem. We investigate expansions based on Word Embeddings, DBpedia concepts linking, and Hypernym, and show that they outperform existing state-of-the-art methods.

Posted Content
TL;DR: In this article, the problem of membership inference for sequence-to-sequence models is investigated in the context of machine translation and video captioning, and an open dataset based on state-of-the-art machine translation models is provided.
Abstract: Data privacy is an important issue for "machine learning as a service" providers. We focus on the problem of membership inference attacks: given a data sample and black-box access to a model's API, determine whether the sample existed in the model's training data. Our contribution is an investigation of this problem in the context of sequence-to-sequence models, which are important in applications such as machine translation and video captioning. We define the membership inference problem for sequence generation, provide an open dataset based on state-of-the-art machine translation models, and report initial results on whether these models leak private information against several kinds of membership inference attacks.

Proceedings ArticleDOI
22 Feb 2019
TL;DR: This work presents a hands-on activity in which students build and evaluate their own MT systems using curated parallel texts, and gains intuition about why early MT research took this approach, where it fails, and what features of language make MT a challenging problem even today.
Abstract: The first step in the research process is developing an understanding of the problem at hand. Novices may be interested in learning about machine translation (MT), but often lack experience and intuition about the task of translation (either by human or machine) and its challenges. The goal of this work is to allow students to interactively discover why MT is an open problem, and encourage them to ask questions, propose solutions, and test intuitions. We present a hands-on activity in which students build and evaluate their own MT systems using curated parallel texts. By having students hand-engineer MT system rules in a simple user interface, which they can then run on real data, they gain intuition about why early MT research took this approach, where it fails, and what features of language make MT a challenging problem even today. Developing translation rules typically strikes novices as an obvious approach that should succeed, but the idea quickly struggles in the face of natural language complexity. This interactive, intuition-building exercise can be augmented by a discussion of state-of-the-art MT techniques and challenges, focusing on areas or aspects of linguistic complexity that the students found difficult. We envision this lesson plan being used in the framework of a larger AI or natural language processing course (where only a small amount of time can be dedicated to MT) or as a standalone activity. We describe and release the tool that supports this lesson, as well as accompanying data.

01 Aug 2019
TL;DR: This article proposed a character-aware decoder to capture lower-level patterns of morphology when translating into morphologically rich languages by augmenting both the softmax and embedding layers of an attention-based encoder-decoder model with convolutional neural networks that operate on the spelling of a word.
Abstract: Neural machine translation (NMT) systems operate primarily on words (or sub-words), ignoring lower-level patterns of morphology. We present a character-aware decoder designed to capture such patterns when translating into morphologically rich languages. We achieve character-awareness by augmenting both the softmax and embedding layers of an attention-based encoder-decoder model with convolutional neural networks that operate on the spelling of a word. To investigate performance on a wide variety of morphological phenomena, we translate English into 14 typologically diverse target languages using the TED multi-target dataset. In this low-resource setting, the character-aware decoder provides consistent improvements with BLEU score gains of up to $+3.05$. In addition, we analyze the relationship between the gains obtained and properties of the target language and find evidence that our model does indeed exploit morphological patterns.