scispace - formally typeset
Search or ask a question

Showing papers by "Kevin Duh published in 2018"


Posted Content
TL;DR: This work presents a large-scale dataset, ReCoRD, for machine reading comprehension requiring commonsense reasoning, and demonstrates that the performance of state-of-the-art MRC systems fall far behind human performance.
Abstract: We present a large-scale dataset, ReCoRD, for machine reading comprehension requiring commonsense reasoning. Experiments on this dataset demonstrate that the performance of state-of-the-art MRC systems fall far behind human performance. ReCoRD represents a challenge for future research to bridge the gap between human and machine commonsense reading comprehension. ReCoRD is available at this http URL.

252 citations


Proceedings ArticleDOI
01 Jul 2018
TL;DR: This paper proposed a stochastic answer network (SAN) that simulates multi-step reasoning in machine reading comprehension and achieved state-of-the-art performance on several reading comprehension tasks.
Abstract: We propose a simple yet robust stochastic answer network (SAN) that simulates multi-step reasoning in machine reading comprehension. Compared to previous work such as ReasoNet which used reinforcement learning to determine the number of steps, the unique feature is the use of a kind of stochastic prediction dropout on the answer module (final layer) of the neural network during the training. We show that this simple trick improves robustness and achieves results competitive to the state-of-the-art on the Stanford Question Answering Dataset (SQuAD), the Adversarial SQuAD, and the Microsoft MAchine Reading COmprehension Dataset (MS MARCO).

150 citations


Posted Content
TL;DR: A probabilistic view of curriculum learning is adopted, which lets us flexibly evaluate the impact of curricula design, and an extensive exploration on a German-English translation task shows it is possible to improve convergence time at no loss in translation quality.
Abstract: Machine translation systems based on deep neural networks are expensive to train. Curriculum learning aims to address this issue by choosing the order in which samples are presented during training to help train better models faster. We adopt a probabilistic view of curriculum learning, which lets us flexibly evaluate the impact of curricula design, and perform an extensive exploration on a German-English translation task. Results show that it is possible to improve convergence time at no loss in translation quality. However, results are highly sensitive to the choice of sample difficulty criteria, curriculum schedule and other hyperparameters.

87 citations


Proceedings ArticleDOI
20 Jul 2018
TL;DR: This work adds an auxiliary term to the training objective during continued training that minimizes the cross entropy between the in-domain model’s output word distribution and that of the out-of- domain model to prevent the model‘s output from differing too much from the original out- of- domains model.
Abstract: Supervised domain adaptation—where a large generic corpus and a smaller in-domain corpus are both available for training—is a challenge for neural machine translation (NMT). Standard practice is to train a generic model and use it to initialize a second model, then continue training the second model on in-domain data to produce an in-domain model. We add an auxiliary term to the training objective during continued training that minimizes the cross entropy between the in-domain model’s output word distribution and that of the out-of-domain model to prevent the model’s output from differing too much from the original out-of-domain model. We perform experiments on EMEA (descriptions of medicines) and TED (rehearsed presentations), initialized from a general domain (WMT) model. Our method shows improvements over standard continued training by up to 1.5 BLEU.

66 citations


Posted Content
TL;DR: A stochastic answer network (SAN) is proposed to explore multi-step inference strategies in Natural Language Inference and achieves the state-of-the-art results on three benchmarks.
Abstract: We propose a stochastic answer network (SAN) to explore multi-step inference strategies in Natural Language Inference. Rather than directly predicting the results given the inputs, the model maintains a state and iteratively refines its predictions. Our experiments show that SAN achieves the state-of-the-art results on three benchmarks: Stanford Natural Language Inference (SNLI) dataset, MultiGenre Natural Language Inference (MultiNLI) dataset and Quora Question Pairs dataset.

51 citations


Proceedings ArticleDOI
01 Jun 2018
TL;DR: A large-scale dataset derived from Wikipedia is introduced to support CLIR research in 25 languages and a simple yet effective neural learning-to-rank model is presented that shares representations across languages and reduces the data requirement.
Abstract: Cross-lingual information retrieval (CLIR) is a document retrieval task where the documents are written in a language different from that of the user’s query. This is a challenging problem for data-driven approaches due to the general lack of labeled training data. We introduce a large-scale dataset derived from Wikipedia to support CLIR research in 25 languages. Further, we present a simple yet effective neural learning-to-rank model that shares representations across languages and reduces the data requirement. This model can exploit training data in, for example, Japanese-English CLIR to improve the results of Swahili-English CLIR.

49 citations


Proceedings ArticleDOI
01 Oct 2018
TL;DR: In this paper, the authors analyzed the major components of a neural machine translation system (the encoder, decoder, and each embedding space) and considered each component's contribution to, and capacity for, domain adaptation.
Abstract: To better understand the effectiveness of continued training, we analyze the major components of a neural machine translation system (the encoder, decoder, and each embedding space) and consider each component’s contribution to, and capacity for, domain adaptation. We find that freezing any single component during continued training has minimal impact on performance, and that performance is surprisingly good when a single component is adapted while holding the rest of the model fixed. We also find that continued training does not move the model very far from the out-of-domain model, compared to a sensitivity analysis metric, suggesting that the out-of-domain model can provide a good generic initialization for the new domain.

33 citations


Proceedings ArticleDOI
16 Apr 2018
TL;DR: An expansion of video data from the IARPA Janus program is introduced, which adds labels for voice to the already-existing face labels, as the Janus Multimedia dataset, and the power of audiovisual fusion is shown.
Abstract: Currently, datasets that support audio-visual recognition of people in videos are scarce and limited. In this paper, we introduce an expansion of video data from the IARPA Janus program to support this research area. We refer to the expanded set, which adds labels for voice to the already-existing face labels, as the Janus Multimedia dataset. We first describe the speaker labeling process, which involved a combination of automatic and manual criteria. We then discuss two evaluation settings for this data. In the core condition, the voice and face of the labeled individual are present in every video. In the full condition, no such guarantee is made. The power of audiovisual fusion is then shown using these publicly-available videos and labels, showing significant improvement over only recognizing voice or face alone. In addition to this work, several other possible paths for future research with this dataset are discussed.

27 citations


Proceedings ArticleDOI
01 Jan 2018
TL;DR: A form of decompositional semantic analysis designed to allow systems to target varying levels of structural complexity (shallow to deep analysis), an evaluation metric to measure the similarity between system output and reference semantic analysis, and an end-to-end model with a novel annotating mechanism that supports intra-sentential coreference are presented.
Abstract: We introduce the task of cross-lingual decompositional semantic parsing: mapping content provided in a source language into a decompositional semantic analysis based on a target language We present: (1) a form of decompositional semantic analysis designed to allow systems to target varying levels of structural complexity (shallow to deep analysis), (2) an evaluation metric to measure the similarity between system output and reference semantic analysis, (3) an end-to-end model with a novel annotating mechanism that supports intra-sentential coreference, and (4) an evaluation dataset on which our model outperforms strong baselines by at least 175 F1 score

24 citations


Posted Content
TL;DR: An extension of the Stochastic Answer Network (SAN), one of the state-of-the-art machine reading comprehension models, to be able to judge whether a question is unanswerable or not, is presented.
Abstract: This paper presents an extension of the Stochastic Answer Network (SAN), one of the state-of-the-art machine reading comprehension models, to be able to judge whether a question is unanswerable or not. The extended SAN contains two components: a span detector and a binary classifier for judging whether the question is unanswerable, and both components are jointly optimized. Experiments show that SAN achieves the results competitive to the state-of-the-art on Stanford Question Answering Dataset (SQuAD) 2.0. To facilitate the research on this field, we release our code: this https URL.

24 citations


Proceedings ArticleDOI
01 Oct 2018
TL;DR: The efforts of the Johns Hopkins University to develop neural machine translation systems for the shared task for news translation organized around the Conference for Machine Translation (WMT) 2018 are reported on.
Abstract: We report on the efforts of the Johns Hopkins University to develop neural machine translation systems for the shared task for news translation organized around the Conference for Machine Translation (WMT) 2018. We developed systems for German–English, English– German, and Russian–English. Our novel contributions are iterative back-translation and fine-tuning on test sets from prior years.

Proceedings ArticleDOI
01 Jun 2018
TL;DR: It is found that word embeddings utilizing subword information consistently outperform standard word embedDings on a word similarity task and as initialization of the source word embeddeddings in a low-resource NMT system.
Abstract: Neural machine translation has achieved impressive results in the last few years, but its success has been limited to settings with large amounts of parallel data. One way to improve NMT for lower-resource settings is to initialize a word-based NMT model with pretrained word embeddings. However, rare words still suffer from lower quality word embeddings when trained with standard word-level objectives. We introduce word embeddings that utilize morphological resources, and compare to purely unsupervised alternatives. We work with Arabic, a morphologically rich language with available linguistic resources, and perform Ar-to-En MT experiments on a small corpus of TED subtitles. We find that word embeddings utilizing subword information consistently outperform standard word embeddings on a word similarity task and as initialization of the source word embeddings in a low-resource NMT system.

Proceedings ArticleDOI
01 Apr 2018
TL;DR: This paper propose a neural architecture which learns a distributional semantic representation that leverages a greater amount of semantic context (both document and sentence level information) than prior work, with further improvements gained by utilizing adaptive classification thresholds.
Abstract: Fine-grained entity typing is the task of assigning fine-grained semantic types to entity mentions. We propose a neural architecture which learns a distributional semantic representation that leverages a greater amount of semantic context – both document and sentence level information – than prior work. We find that additional context improves performance, with further improvements gained by utilizing adaptive classification thresholds. Experiments show that our approach without reliance on hand-crafted features achieves the state-of-the-art results on three benchmark datasets.

Posted Content
TL;DR: This work argues for a reconsideration of the charCNN, based on cross-lingual improvements on low-resource data, and finds that in most cases, using both BPE and a charCNN performs best, while in Hebrew, using acharCNN over words is best.
Abstract: Neural Machine Translation (NMT) in low-resource settings and of morphologically rich languages is made difficult in part by data sparsity of vocabulary words Several methods have been used to help reduce this sparsity, notably Byte-Pair Encoding (BPE) and a character-based CNN layer (charCNN) However, the charCNN has largely been neglected, possibly because it has only been compared to BPE rather than combined with it We argue for a reconsideration of the charCNN, based on cross-lingual improvements on low-resource data We translate from 8 languages into English, using a multi-way parallel collection of TED transcripts We find that in most cases, using both BPE and a charCNN performs best, while in Hebrew, using a charCNN over words is best

Proceedings ArticleDOI
TL;DR: It is found that freezing any single component during continued training has minimal impact on performance, and that performance is surprisingly good when a single component is adapted while holding the rest of the model fixed.
Abstract: To better understand the effectiveness of continued training, we analyze the major components of a neural machine translation system (the encoder, decoder, and each embedding space) and consider each component's contribution to, and capacity for, domain adaptation. We find that freezing any single component during continued training has minimal impact on performance, and that performance is surprisingly good when a single component is adapted while holding the rest of the model fixed. We also find that continued training does not move the model very far from the out-of-domain model, compared to a sensitivity analysis metric, suggesting that the out-of-domain model can provide a good generic initialization for the new domain.

Posted Content
TL;DR: This work achieves character-awareness by augmenting both the softmax and embedding layers of an attention-based encoder-decoder network with convolutional neural networks that operate on spelling of a word (or subword).
Abstract: Standard neural machine translation (NMT) systems operate primarily on words, ignoring lower-level patterns of morphology. We present a character-aware decoder for NMT that can simultaneously work with both word-level and subword-level sequences which is designed to capture such patterns. We achieve character-awareness by augmenting both the softmax and embedding layers of an attention-based encoder-decoder network with convolutional neural networks that operate on spelling of a word (or subword). While character-aware embeddings have been successfully used in the source-side, we find that mixing character-aware embeddings with standard embeddings is crucial in the target-side. Furthermore, we show that a simple approximate softmax layer can be used for large target-side vocabularies which would otherwise require prohibitively large memory. We experiment on the TED multi-target dataset, translating English into 14 typologically diverse languages. We find that in this low-resource setting, the character-aware decoder provides consistent improvements over word-level and subword-level counterparts with BLEU score gains of up to +3.37.

Posted Content
TL;DR: This work achieves character-awareness by augmenting both the softmax and embedding layers of an attention-based encoder-decoder model with convolutional neural networks that operate on the spelling of a word with evidence that the model does indeed exploit morphological patterns.
Abstract: Neural machine translation (NMT) systems operate primarily on words (or sub-words), ignoring lower-level patterns of morphology. We present a character-aware decoder designed to capture such patterns when translating into morphologically rich languages. We achieve character-awareness by augmenting both the softmax and embedding layers of an attention-based encoder-decoder model with convolutional neural networks that operate on the spelling of a word. To investigate performance on a wide variety of morphological phenomena, we translate English into 14 typologically diverse target languages using the TED multi-target dataset. In this low-resource setting, the character-aware decoder provides consistent improvements with BLEU score gains of up to $+3.05$. In addition, we analyze the relationship between the gains obtained and properties of the target language and find evidence that our model does indeed exploit morphological patterns.

Posted Content
TL;DR: It is suggested that pre-trained embeddings can be helpful if properly incorporated into NMT, especially when parallel data is limited or additional in-domain monolingual data is readily available.
Abstract: Using pre-trained word embeddings as input layer is a common practice in many natural language processing (NLP) tasks, but it is largely neglected for neural machine translation (NMT). In this paper, we conducted a systematic analysis on the effect of using pre-trained source-side monolingual word embedding in NMT. We compared several strategies, such as fixing or updating the embeddings during NMT training on varying amounts of data, and we also proposed a novel strategy called dual-embedding that blends the fixing and updating strategies. Our results suggest that pre-trained embeddings can be helpful if properly incorporated into NMT, especially when parallel data is limited or additional in-domain monolingual data is readily available.

Posted Content
TL;DR: A meaning representation designed to allow systems to target varying levels of structural complexity, an evaluation metric to measure the similarity between system output and reference meaning representations, an end-to-end model with a novel copy mechanism that supports intrasentential coreference and an evaluation dataset.
Abstract: We introduce the task of cross-lingual semantic parsing: mapping content provided in a source language into a meaning representation based on a target language. We present: (1) a meaning representation designed to allow systems to target varying levels of structural complexity (shallow to deep analysis), (2) an evaluation metric to measure the similarity between system output and reference meaning representations, (3) an end-to-end model with a novel copy mechanism that supports intrasentential coreference, and (4) an evaluation dataset where experiments show our model outperforms strong baselines by at least 1.18 F1 score.

Posted Content
TL;DR: In this paper, the local region of each hidden state of a neural model is enforced to only generate target tokens with the same semantic structure tag, thus yielding better generalization in both high and low resource settings.
Abstract: Cross-lingual information extraction (CLIE) is an important and challenging task, especially in low resource scenarios. To tackle this challenge, we propose a training method, called Halo, which enforces the local region of each hidden state of a neural model to only generate target tokens with the same semantic structure tag. This simple but powerful technique enables a neural model to learn semantics-aware representations that are robust to noise, without introducing any extra parameter, thus yielding better generalization in both high and low resource settings.

Proceedings ArticleDOI
01 May 2018
TL;DR: In this paper, the local region of each hidden state of a neural model is enforced to only generate target tokens with the same semantic structure tag, thus yielding better generalization in both high and low resource settings.
Abstract: Cross-lingual information extraction (CLIE) is an important and challenging task, especially in low resource scenarios. To tackle this challenge, we propose a training method, called Halo, which enforces the local region of each hidden state of a neural model to only generate target tokens with the same semantic structure tag. This simple but powerful technique enables a neural model to learn semantics-aware representations that are robust to noise, without introducing any extra parameter, thus yielding better generalization in both high and low resource settings.

Posted Content
TL;DR: This article propose a neural architecture which learns a distributional semantic representation that leverages a greater amount of semantic context (both document and sentence level information) than prior work, and find that additional context improves performance, with further improvements gained by utilizing adaptive classification thresholds.
Abstract: Fine-grained entity typing is the task of assigning fine-grained semantic types to entity mentions. We propose a neural architecture which learns a distributional semantic representation that leverages a greater amount of semantic context -- both document and sentence level information -- than prior work. We find that additional context improves performance, with further improvements gained by utilizing adaptive classification thresholds. Experiments show that our approach without reliance on hand-crafted features achieves the state-of-the-art results on three benchmark datasets.