Showing papers by "Luke Zettlemoyer published in 2019"

PDF

Open Access

Posted Content•

RoBERTa: A Robustly Optimized BERT Pretraining Approach

[...]

Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Michael Lewis, Luke Zettlemoyer, Veselin Stoyanov - Show less +6 more

26 Jul 2019-arXiv: Computation and Language

TL;DR: It is found that BERT was significantly undertrained, and can match or exceed the performance of every model published after it, and the best model achieves state-of-the-art results on GLUE, RACE and SQuAD.

...read moreread less

Abstract: Language model pretraining has led to significant performance gains but careful comparison between different approaches is challenging. Training is computationally expensive, often done on private datasets of different sizes, and, as we will show, hyperparameter choices have significant impact on the final results. We present a replication study of BERT pretraining (Devlin et al., 2019) that carefully measures the impact of many key hyperparameters and training data size. We find that BERT was significantly undertrained, and can match or exceed the performance of every model published after it. Our best model achieves state-of-the-art results on GLUE, RACE and SQuAD. These results highlight the importance of previously overlooked design choices, and raise questions about the source of recently reported improvements. We release our models and code.

...read moreread less

13,994 citations

Posted Content•

BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension.

[...]

Michael Lewis¹, Yinhan Liu¹, Naman Goyal¹, Marjan Ghazvininejad¹, Abdelrahman Mohamed¹, Omer Levy², Veselin Stoyanov¹, Luke Zettlemoyer¹ - Show less +4 more•Institutions (2)

Facebook¹, University of Washington²

29 Oct 2019-arXiv: Computation and Language

TL;DR: BART as mentioned in this paper is a denoising autoencoder for pretraining sequence-to-sequence models, which is trained by corrupting text with an arbitrary noising function, and then learning a model to reconstruct the original text.

...read moreread less

Abstract: We present BART, a denoising autoencoder for pretraining sequence-to-sequence models. BART is trained by (1) corrupting text with an arbitrary noising function, and (2) learning a model to reconstruct the original text. It uses a standard Tranformer-based neural machine translation architecture which, despite its simplicity, can be seen as generalizing BERT (due to the bidirectional encoder), GPT (with the left-to-right decoder), and many other more recent pretraining schemes. We evaluate a number of noising approaches, finding the best performance by both randomly shuffling the order of the original sentences and using a novel in-filling scheme, where spans of text are replaced with a single mask token. BART is particularly effective when fine tuned for text generation but also works well for comprehension tasks. It matches the performance of RoBERTa with comparable training resources on GLUE and SQuAD, achieves new state-of-the-art results on a range of abstractive dialogue, question answering, and summarization tasks, with gains of up to 6 ROUGE. BART also provides a 1.1 BLEU increase over a back-translation system for machine translation, with only target language pretraining. We also report ablation experiments that replicate other pretraining schemes within the BART framework, to better measure which factors most influence end-task performance.

...read moreread less

1,008 citations

Posted Content•

SpanBERT: Improving Pre-training by Representing and Predicting Spans

[...]

Mandar Joshi¹, Danqi Chen², Yinhan Liu³, Daniel S. Weld¹, Luke Zettlemoyer¹, Omer Levy³ - Show less +2 more•Institutions (3)

University of Washington¹, Princeton University², Facebook³

24 Jul 2019-arXiv: Computation and Language

TL;DR: SpanBERT as discussed by the authors extends BERT by masking contiguous random spans, rather than random tokens, and training the span boundary representations to predict the entire content of the masked span, without relying on the individual token representations within it.

...read moreread less

Abstract: We present SpanBERT, a pre-training method that is designed to better represent and predict spans of text. Our approach extends BERT by (1) masking contiguous random spans, rather than random tokens, and (2) training the span boundary representations to predict the entire content of the masked span, without relying on the individual token representations within it. SpanBERT consistently outperforms BERT and our better-tuned baselines, with substantial gains on span selection tasks such as question answering and coreference resolution. In particular, with the same training data and model size as BERT-large, our single model obtains 94.6% and 88.7% F1 on SQuAD 1.1 and 2.0, respectively. We also achieve a new state of the art on the OntoNotes coreference resolution task (79.6\% F1), strong performance on the TACRED relation extraction benchmark, and even show gains on GLUE.

...read moreread less

722 citations

Posted Content•

Unsupervised Cross-lingual Representation Learning at Scale.

[...]

Alexis Conneau¹, Kartikay Khandelwal², Naman Goyal¹, Vishrav Chaudhary¹, Guillaume Wenzek¹, Francisco Guzmán³, Edouard Grave¹, Myle Ott¹, Luke Zettlemoyer¹, Veselin Stoyanov¹ - Show less +6 more•Institutions (3)

Facebook¹, Microsoft², Johns Hopkins University³

05 Nov 2019-arXiv: Computation and Language

TL;DR: This paper showed that pretraining multilingual language models at scale leads to significant performance gains for a wide range of cross-lingual transfer tasks and proposed a Transformer-based masked language model on one hundred languages, using more than two terabytes of filtered CommonCrawl data.

...read moreread less

Abstract: This paper shows that pretraining multilingual language models at scale leads to significant performance gains for a wide range of cross-lingual transfer tasks. We train a Transformer-based masked language model on one hundred languages, using more than two terabytes of filtered CommonCrawl data. Our model, dubbed XLM-R, significantly outperforms multilingual BERT (mBERT) on a variety of cross-lingual benchmarks, including +14.6% average accuracy on XNLI, +13% average F1 score on MLQA, and +2.4% F1 score on NER. XLM-R performs particularly well on low-resource languages, improving 15.7% in XNLI accuracy for Swahili and 11.4% for Urdu over previous XLM models. We also present a detailed empirical analysis of the key factors that are required to achieve these gains, including the trade-offs between (1) positive transfer and capacity dilution and (2) the performance of high and low resource languages at scale. Finally, we show, for the first time, the possibility of multilingual modeling without sacrificing per-language performance; XLM-R is very competitive with strong monolingual models on the GLUE and XNLI benchmarks. We will make our code, data and models publicly available.

...read moreread less

669 citations

Proceedings Article•DOI•

Don't Take the Easy Way Out: Ensemble Based Methods for Avoiding Known Dataset Biases

[...]

Christopher Clark¹, Mark Yatskar², Luke Zettlemoyer³•Institutions (3)

Allen Institute for Artificial Intelligence¹, University of Washington², Princeton University³

01 Sep 2019

TL;DR: This paper trains a naive model that makes predictions exclusively based on dataset biases, and a robust model as part of an ensemble with the naive one in order to encourage it to focus on other patterns in the data that are more likely to generalize.

...read moreread less

Abstract: State-of-the-art models often make use of superficial patterns in the data that do not generalize well to out-of-domain or adversarial settings. For example, textual entailment models often learn that particular key words imply entailment, irrespective of context, and visual question answering models learn to predict prototypical answers, without considering evidence in the image. In this paper, we show that if we have prior knowledge of such biases, we can train a model to be more robust to domain shift. Our method has two stages: we (1) train a naive model that makes predictions exclusively based on dataset biases, and (2) train a robust model as part of an ensemble with the naive one in order to encourage it to focus on other patterns in the data that are more likely to generalize. Experiments on five datasets with out-of-domain test sets show significantly improved robustness in all settings, including a 12 point gain on a changing priors visual question answering dataset and a 9 point gain on an adversarial question answering test set.

...read moreread less

314 citations

Proceedings Article•DOI•

Evaluating Gender Bias in Machine Translation

[...]

Gabriel Stanovsky¹, Noah A. Smith¹, Luke Zettlemoyer¹•Institutions (1)

University of Washington¹

01 Jul 2019

TL;DR: An automatic gender bias evaluation method for eight target languages with grammatical gender, based on morphological analysis is devised, which shows that four popular industrial MT systems and two recent state-of-the-art academic MT models are significantly prone to gender-biased translation errors for all tested target languages.

...read moreread less

Abstract: We present the first challenge set and evaluation protocol for the analysis of gender bias in machine translation (MT). Our approach uses two recent coreference resolution datasets composed of English sentences which cast participants into non-stereotypical gender roles (e.g., “The doctor asked the nurse to help her in the operation”). We devise an automatic gender bias evaluation method for eight target languages with grammatical gender, based on morphological analysis (e.g., the use of female inflection for the word “doctor”). Our analyses show that four popular industrial MT systems and two recent state-of-the-art academic MT models are significantly prone to gender-biased translation errors for all tested target languages. Our data and code are publicly available at https://github.com/gabrielStanovsky/mt_gender.

...read moreread less

280 citations

Proceedings Article•DOI•

Mask-Predict: Parallel Decoding of Conditional Masked Language Models.

[...]

Marjan Ghazvininejad¹, Omer Levy¹, Yinhan Liu¹, Luke Zettlemoyer¹•Institutions (1)

Facebook¹

01 Apr 2019

TL;DR: The authors use a masked language modeling objective to train a model to predict any subset of the target words, conditioned on both the input text and a partially masked target translation, which allows for efficient iterative decoding.

...read moreread less

Abstract: Most machine translation systems generate text autoregressively from left to right. We, instead, use a masked language modeling objective to train a model to predict any subset of the target words, conditioned on both the input text and a partially masked target translation. This approach allows for efficient iterative decoding, where we first predict all of the target words non-autoregressively, and then repeatedly mask out and regenerate the subset of words that the model is least confident about. By applying this strategy for a constant number of iterations, our model improves state-of-the-art performance levels for non-autoregressive and parallel decoding translation models by over 4 BLEU on average. It is also able to reach within about 1 BLEU point of a typical left-to-right transformer model, while decoding significantly faster.

...read moreread less

273 citations

Posted Content•

Sparse Networks from Scratch: Faster Training without Losing Performance

[...]

Tim Dettmers, Luke Zettlemoyer¹•Institutions (1)

University of Washington¹

10 Jul 2019-arXiv: Learning

TL;DR: This work develops sparse momentum, an algorithm which uses exponentially smoothed gradients (momentum) to identify layers and weights which reduce the error efficiently and shows that the benefits of momentum redistribution and growth increase with the depth and size of the network.

...read moreread less

Abstract: We demonstrate the possibility of what we call sparse learning: accelerated training of deep neural networks that maintain sparse weights throughout training while achieving dense performance levels. We accomplish this by developing sparse momentum, an algorithm which uses exponentially smoothed gradients (momentum) to identify layers and weights which reduce the error efficiently. Sparse momentum redistributes pruned weights across layers according to the mean momentum magnitude of each layer. Within a layer, sparse momentum grows weights according to the momentum magnitude of zero-valued weights. We demonstrate state-of-the-art sparse performance on MNIST, CIFAR-10, and ImageNet, decreasing the mean error by a relative 8%, 15%, and 6% compared to other sparse algorithms. Furthermore, we show that sparse momentum reliably reproduces dense performance levels while providing up to 5.61x faster training. In our analysis, ablations show that the benefits of momentum redistribution and growth increase with the depth and size of the network. Additionally, we find that sparse momentum is insensitive to the choice of its hyperparameters suggesting that sparse momentum is robust and easy to use.

...read moreread less

252 citations

Posted Content•

ALFRED: A Benchmark for Interpreting Grounded Instructions for Everyday Tasks

[...]

Mohit Shridhar¹, Jesse Thomason¹, Daniel Gordon¹, Yonatan Bisk², Winson Han², Roozbeh Mottaghi², Luke Zettlemoyer¹, Dieter Fox³ - Show less +4 more•Institutions (3)

University of Washington¹, Allen Institute for Artificial Intelligence², Nvidia³

03 Dec 2019-arXiv: Computer Vision and Pattern Recognition

TL;DR: It is shown that a baseline model based on recent embodied vision-and-language tasks performs poorly on ALFRED, suggesting that there is significant room for developing innovative grounded visual language understanding models with this benchmark.

...read moreread less

Abstract: We present ALFRED (Action Learning From Realistic Environments and Directives), a benchmark for learning a mapping from natural language instructions and egocentric vision to sequences of actions for household tasks. ALFRED includes long, compositional tasks with non-reversible state changes to shrink the gap between research benchmarks and real-world applications. ALFRED consists of expert demonstrations in interactive visual environments for 25k natural language directives. These directives contain both high-level goals like "Rinse off a mug and place it in the coffee maker." and low-level language instructions like "Walk to the coffee maker on the right." ALFRED tasks are more complex in terms of sequence length, action space, and language than existing vision-and-language task datasets. We show that a baseline model based on recent embodied vision-and-language tasks performs poorly on ALFRED, suggesting that there is significant room for developing innovative grounded visual language understanding models with this benchmark.

...read moreread less

239 citations

Proceedings Article•DOI•

BERT for Coreference Resolution: Baselines and Analysis.

[...]

Mandar Joshi¹, Omer Levy², Luke Zettlemoyer¹, Daniel S. Weld¹•Institutions (2)

University of Washington¹, Stanford University²

01 Aug 2019

TL;DR: This paper applied BERT to coreference resolution, achieving a new state-of-the-art on the GAP (+11.5 F1) and OntoNotes (+3.9 F 1) benchmarks.

...read moreread less

Abstract: We apply BERT to coreference resolution, achieving a new state of the art on the GAP (+11.5 F1) and OntoNotes (+3.9 F1) benchmarks. A qualitative analysis of model predictions indicates that, compared to ELMo and BERT-base, BERT-large is particularly better at distinguishing between related but distinct entities (e.g., President and CEO), but that there is still room for improvement in modeling document-level context, conversations, and mention paraphrasing. We will release all code and trained models upon publication.

...read moreread less

233 citations

Proceedings Article•DOI•

A discrete hard EM approach for weakly supervised question answering

[...]

Sewon Min¹, Danqi Chen, Hannaneh Hajishirzi¹, Hannaneh Hajishirzi², Luke Zettlemoyer¹, Luke Zettlemoyer³ - Show less +2 more•Institutions (3)

University of Washington¹, Allen Institute for Artificial Intelligence², Facebook³

11 Sep 2019

TL;DR: This paper develops a hard EM learning scheme that computes gradients relative to the most likely solution at each update and significantly outperforms previous methods on six QA tasks, including absolute gains of 2–10%, and achieves the state-of-the-art on five of them.

...read moreread less

Abstract: Many question answering (QA) tasks only provide weak supervision for how the answer should be computed. For example, TriviaQA answers are entities that can be mentioned multiple times in supporting documents, while DROP answers can be computed by deriving many different equations from numbers in the reference text. In this paper, we show it is possible to convert such tasks into discrete latent variable learning problems with a precomputed, task-specific set of possible solutions (e.g. different mentions or equations) that contains one correct option. We then develop a hard EM learning scheme that computes gradients relative to the most likely solution at each update. Despite its simplicity, we show that this approach significantly outperforms previous methods on six QA tasks, including absolute gains of 2–10%, and achieves the state-of-the-art on five of them. Using hard updates instead of maximizing marginal likelihood is key to these results as it encourages the model to find the one correct answer, which we show through detailed qualitative analysis.

...read moreread less

Posted Content•

Transformers with convolutional context for ASR

[...]

Abdelrahman Mohamed¹, Dmytro Okhonko¹, Luke Zettlemoyer¹•Institutions (1)

Facebook¹

26 Apr 2019-arXiv: Computation and Language

TL;DR: This paper proposes replacing the sinusoidal positional embedding for transformers with convolutionally learned input representations that provide subsequent transformer blocks with relative positional information needed for discovering long-range relationships between local concepts.

...read moreread less

Abstract: The recent success of transformer networks for neural machine translation and other NLP tasks has led to a surge in research work trying to apply it for speech recognition. Recent efforts studied key research questions around ways of combining positional embedding with speech features, and stability of optimization for large scale learning of transformer networks. In this paper, we propose replacing the sinusoidal positional embedding for transformers with convolutionally learned input representations. These contextual representations provide subsequent transformer blocks with relative positional information needed for discovering long-range relationships between local concepts. The proposed system has favorable optimization characteristics where our reported results are produced with fixed learning rate of 1.0 and no warmup steps. The proposed model achieves a competitive 4.7% and 12.9% WER on the Librispeech ``test clean'' and ``test other'' subsets when no extra LM text is provided.

...read moreread less

Posted Content•

Cloze-driven Pretraining of Self-attention Networks

[...]

Alexei Baevski¹, Sergey Edunov¹, Yinhan Liu¹, Luke Zettlemoyer¹, Michael Auli¹ - Show less +1 more•Institutions (1)

Facebook¹

19 Mar 2019-arXiv: Computation and Language

TL;DR: A new approach for pretraining a bi-directional transformer model that provides significant performance gains across a variety of language understanding problems, including cloze-style word reconstruction task, and a detailed analysis of a number of factors that contribute to effective pretraining.

...read moreread less

Abstract: We present a new approach for pretraining a bi-directional transformer model that provides significant performance gains across a variety of language understanding problems. Our model solves a cloze-style word reconstruction task, where each word is ablated and must be predicted given the rest of the text. Experiments demonstrate large performance gains on GLUE and new state of the art results on NER as well as constituency parsing benchmarks, consistent with the concurrently introduced BERT model. We also present a detailed analysis of a number of factors that contribute to effective pretraining, including data domain and size, model capacity, and variations on the cloze objective.

...read moreread less

Proceedings Article•DOI•

Compositional Questions Do Not Necessitate Multi-hop Reasoning.

[...]

Sewon Min¹, Eric Wallace², Sameer Singh³, Matt Gardner¹, Hannaneh Hajishirzi¹, Luke Zettlemoyer¹ - Show less +2 more•Institutions (3)

University of Washington¹, University of California, Berkeley², University of California, Irvine³

01 Jul 2019

TL;DR: This work introduces a single-hop BERT-based RC model that achieves 67 F1—comparable to state-of-the-art multi-hop models and designs an evaluation setting where humans are not shown all of the necessary paragraphs for the intendedmulti-hop reasoning but can still answer over 80% of questions.

...read moreread less

Abstract: Multi-hop reading comprehension (RC) questions are challenging because they require reading and reasoning over multiple paragraphs. We argue that it can be difficult to construct large multi-hop RC datasets. For example, even highly compositional questions can be answered with a single hop if they target specific entity types, or the facts needed to answer them are redundant. Our analysis is centered on HotpotQA, where we show that single-hop reasoning can solve much more of the dataset than previously thought. We introduce a single-hop BERT-based RC model that achieves 67 F1—comparable to state-of-the-art multi-hop models. We also design an evaluation setting where humans are not shown all of the necessary paragraphs for the intended multi-hop reasoning but can still answer over 80% of questions. Together with detailed error analysis, these results suggest there should be an increasing focus on the role of evidence in multi-hop reasoning and possibly even a shift towards information retrieval style evaluations with large and diverse evidence collections.

...read moreread less

Proceedings Article•DOI•

Cloze-driven Pretraining of Self-attention Networks.

[...]

Alexei Baevski¹, Sergey Edunov¹, Yinhan Liu¹, Luke Zettlemoyer¹, Michael Auli¹ - Show less +1 more•Institutions (1)

Facebook¹

19 Mar 2019

TL;DR: This paper propose a cloze-style word reconstruction task, where each word is ablated and must be predicted given the rest of the text, and demonstrate large performance gains on GLUE and new state of the art results on NER as well as constituency parsing benchmarks, consistent with BERT.

...read moreread less

Abstract: We present a new approach for pretraining a bi-directional transformer model that provides significant performance gains across a variety of language understanding problems. Our model solves a cloze-style word reconstruction task, where each word is ablated and must be predicted given the rest of the text. Experiments demonstrate large performance gains on GLUE and new state of the art results on NER as well as constituency parsing benchmarks, consistent with BERT. We also present a detailed analysis of a number of factors that contribute to effective pretraining, including data domain and size, model capacity, and variations on the cloze objective.

...read moreread less

Posted Content•

BERT for Coreference Resolution: Baselines and Analysis

[...]

Mandar Joshi, Omer Levy, Daniel S. Weld, Luke Zettlemoyer

24 Aug 2019-arXiv: Computation and Language

TL;DR: A qualitative analysis of model predictions indicates that, compared to ELMo and Bert-base, BERT-large is particularly better at distinguishing between related but distinct entities, but that there is still room for improvement in modeling document-level context, conversations, and mention paraphrasing.

...read moreread less

Abstract: We apply BERT to coreference resolution, achieving strong improvements on the OntoNotes (+3.9 F1) and GAP (+11.5 F1) benchmarks. A qualitative analysis of model predictions indicates that, compared to ELMo and BERT-base, BERT-large is particularly better at distinguishing between related but distinct entities (e.g., President and CEO). However, there is still room for improvement in modeling document-level context, conversations, and mention paraphrasing. Our code and models are publicly available.

...read moreread less

Posted Content•

Multi-hop Reading Comprehension through Question Decomposition and Rescoring

[...]

Sewon Min¹, Victor Zhong¹, Luke Zettlemoyer¹, Hannaneh Hajishirzi¹•Institutions (1)

University of Washington¹

07 Jun 2019-arXiv: Computation and Language

TL;DR: This paper recast sub-question generation as a span prediction problem and showed that their method, trained using only 400 labeled examples, generates sub-questions that are as effective as human-authored sub-QA.

...read moreread less

Abstract: Multi-hop Reading Comprehension (RC) requires reasoning and aggregation across several paragraphs. We propose a system for multi-hop RC that decomposes a compositional question into simpler sub-questions that can be answered by off-the-shelf single-hop RC models. Since annotations for such decomposition are expensive, we recast sub-question generation as a span prediction problem and show that our method, trained using only 400 labeled examples, generates sub-questions that are as effective as human-authored sub-questions. We also introduce a new global rescoring approach that considers each decomposition (i.e. the sub-questions and their answers) to select the best final answer, greatly improving overall performance. Our experiments on HotpotQA show that this approach achieves the state-of-the-art results, while providing explainable evidence for its decision making in the form of sub-questions.

...read moreread less

Proceedings Article•DOI•

Multi-hop Reading Comprehension through Question Decomposition and Rescoring

[...]

Sewon Min¹, Victor Zhong¹, Luke Zettlemoyer¹, Hannaneh Hajishirzi²•Institutions (2)

University of Washington¹, Allen Institute for Artificial Intelligence²

01 Jul 2019

TL;DR: A system that decomposes a compositional question into simpler sub-questions that can be answered by off-the-shelf single-hop RC models is proposed and a new global rescoring approach is introduced that considers each decomposition to select the best final answer, greatly improving overall performance.

...read moreread less

Abstract: Multi-hop Reading Comprehension (RC) requires reasoning and aggregation across several paragraphs. We propose a system for multi-hop RC that decomposes a compositional question into simpler sub-questions that can be answered by off-the-shelf single-hop RC models. Since annotations for such decomposition are expensive, we recast subquestion generation as a span prediction problem and show that our method, trained using only 400 labeled examples, generates sub-questions that are as effective as human-authored sub-questions. We also introduce a new global rescoring approach that considers each decomposition (i.e. the sub-questions and their answers) to select the best final answer, greatly improving overall performance. Our experiments on HotpotQA show that this approach achieves the state-of-the-art results, while providing explainable evidence for its decision making in the form of sub-questions.

...read moreread less

Vision-and-Dialog Navigation

[...]

Jesse Thomason, Michael Murray, Maya Cakmak, Luke Zettlemoyer

10 Jul 2019

TL;DR: In this paper, the authors introduce Cooperative Vision-and-Dialog Navigation (CVDN), a dataset of over 2k embodied, human-human dialogs situated in simulated, photorealistic home environments.

...read moreread less

Abstract: Robots navigating in human environments should use language to ask for assistance and be able to understand human responses. To study this challenge, we introduce Cooperative Vision-and-Dialog Navigation, a dataset of over 2k embodied, human-human dialogs situated in simulated, photorealistic home environments. The Navigator asks questions to their partner, the Oracle, who has privileged access to the best next steps the Navigator should take according to a shortest path planner. To train agents that search an environment for a goal location, we define the Navigation from Dialog History task. An agent, given a target object and a dialog history between humans cooperating to find that object, must infer navigation actions towards the goal in unexplored environments. We establish an initial, multi-modal sequence-to-sequence model and demonstrate that looking farther back in the dialog history improves performance. Sourcecode and a live interface demo can be found at https://cvdn.dev/

...read moreread less

Posted Content•

Knowledge Guided Text Retrieval and Reading for Open Domain Question Answering

[...]

Sewon Min¹, Danqi Chen¹, Luke Zettlemoyer¹, Hannaneh Hajishirzi²•Institutions (2)

University of Washington¹, Princeton University²

10 Nov 2019-arXiv: Computation and Language

TL;DR: This work introduces an approach for open-domain question answering (QA) that retrieves and reads a passage graph, where vertices are passages of text and edges represent relationships that are derived from an external knowledge base or co-occurrence in the same article.

...read moreread less

Abstract: We introduce an approach for open-domain question answering (QA) that retrieves and reads a passage graph, where vertices are passages of text and edges represent relationships that are derived from an external knowledge base or co-occurrence in the same article. Our goals are to boost coverage by using knowledge-guided retrieval to find more relevant passages than text-matching methods, and to improve accuracy by allowing for better knowledge-guided fusion of information across related passages. Our graph retrieval method expands a set of seed keyword-retrieved passages by traversing the graph structure of the knowledge base. Our reader extends a BERT-based architecture and updates passage representations by propagating information from related passages and their relations, instead of reading each passage in isolation. Experiments on three open-domain QA datasets, WebQuestions, Natural Questions and TriviaQA, show improved performance over non-graph baselines by 2-11% absolute. Our approach also matches or exceeds the state-of-the-art in every case, without using an expensive end-to-end training regime.

...read moreread less

Posted Content•

Vision-and-Dialog Navigation

[...]

Jesse Thomason, Michael Murray, Maya Cakmak, Luke Zettlemoyer¹•Institutions (1)

University of Washington¹

10 Jul 2019-arXiv: Computation and Language

TL;DR: This work introduces Cooperative Vision-and-Dialog Navigation, a dataset of over 2k embodied, human-human dialogs situated in simulated, photorealistic home environments and establishes an initial, multi-modal sequence-to-sequence model.

...read moreread less

Posted Content•

Don't Take the Easy Way Out: Ensemble Based Methods for Avoiding Known Dataset Biases.

[...]

Christopher Clark¹, Mark Yatskar², Luke Zettlemoyer³•Institutions (3)

Allen Institute for Artificial Intelligence¹, University of Washington², Princeton University³

09 Sep 2019-arXiv: Computation and Language

TL;DR: This article showed that if we have prior knowledge of such biases, we can train a model to be more robust to domain shift by training a naive model that makes predictions exclusively based on dataset biases, and then training a robust model as part of an ensemble with the naive one in order to encourage it to focus on other patterns in the data that are more likely to generalize.

...read moreread less

Posted Content•

Zero-shot Entity Linking with Dense Entity Retrieval.

[...]

Ledell Wu, Fabio Petroni, Martin Josifoski, Sebastian Riedel, Luke Zettlemoyer - Show less +1 more

10 Nov 2019-arXiv: Computation and Language

TL;DR: This paper proposed a two-stage approach for zero-shot entity-linking, where each entity is defined by a short textual description, and the model must read these descriptions together with the mention context to make the final linking decisions.

...read moreread less

Abstract: We consider the zero-shot entity-linking challenge where each entity is defined by a short textual description, and the model must read these descriptions together with the mention context to make the final linking decisions. In this setting, retrieving entity candidates can be particularly challenging, since many of the common linking cues such as entity alias tables and link popularity are not available. In this paper, we introduce a simple and effective two-stage approach for zero-shot linking, based on fine-tuned BERT architectures. In the first stage, we do retrieval in a dense space defined by a bi-encoder that independently embeds the mention context and the entity descriptions. Each candidate is then examined more carefully with a cross-encoder, that concatenates the mention and entity text. Our approach achieves a nearly 6 point absolute gain on a recently introduced zero-shot entity linking benchmark, driven largely by improvements over previous IR-based candidate retrieval. We also show that it performs well in the non-zero-shot setting, obtaining the state-of-the-art result on TACKBP-2010. The code and pre-trained models are available at this https URL.

...read moreread less

Posted Content•

Compositional Questions Do Not Necessitate Multi-hop Reasoning

[...]

Sewon Min¹, Eric Wallace², Sameer Singh³, Matt Gardner¹, Hannaneh Hajishirzi¹, Luke Zettlemoyer¹ - Show less +2 more•Institutions (3)

University of Washington¹, University of California, Berkeley², University of California, Irvine³

07 Jun 2019-arXiv: Computation and Language

TL;DR: This article proposed a single-hop BERT-based reading comprehension model for HotpotQA and achieved state-of-the-art performance on 67% of the questions on the Hotpot QA dataset.

...read moreread less

Abstract: Multi-hop reading comprehension (RC) questions are challenging because they require reading and reasoning over multiple paragraphs. We argue that it can be difficult to construct large multi-hop RC datasets. For example, even highly compositional questions can be answered with a single hop if they target specific entity types, or the facts needed to answer them are redundant. Our analysis is centered on HotpotQA, where we show that single-hop reasoning can solve much more of the dataset than previously thought. We introduce a single-hop BERT-based RC model that achieves 67 F1---comparable to state-of-the-art multi-hop models. We also design an evaluation setting where humans are not shown all of the necessary paragraphs for the intended multi-hop reasoning but can still answer over 80% of questions. Together with detailed error analysis, these results suggest there should be an increasing focus on the role of evidence in multi-hop reasoning and possibly even a shift towards information retrieval style evaluations with large and diverse evidence collections.

...read moreread less

Posted Content•

Generalization through Memorization: Nearest Neighbor Language Models

[...]

Urvashi Khandelwal¹, Omer Levy¹, Dan Jurafsky², Luke Zettlemoyer², Michael Lewis² - Show less +1 more•Institutions (2)

Stanford University¹, Facebook²

01 Nov 2019-arXiv: Computation and Language

TL;DR: It is suggested that learning similarity between sequences of text is easier than predicting the next word, and that nearest neighbor search is an effective approach for language modeling in the long tail.

...read moreread less

Abstract: We introduce $k$NN-LMs, which extend a pre-trained neural language model (LM) by linearly interpolating it with a $k$-nearest neighbors ($k$NN) model. The nearest neighbors are computed according to distance in the pre-trained LM embedding space, and can be drawn from any text collection, including the original LM training data. Applying this augmentation to a strong Wikitext-103 LM, with neighbors drawn from the original training set, our $k$NN-LM achieves a new state-of-the-art perplexity of 15.79 - a 2.9 point improvement with no additional training. We also show that this approach has implications for efficiently scaling up to larger training sets and allows for effective domain adaptation, by simply varying the nearest neighbor datastore, again without further training. Qualitatively, the model is particularly helpful in predicting rare patterns, such as factual knowledge. Together, these results strongly suggest that learning similarity between sequences of text is easier than predicting the next word, and that nearest neighbor search is an effective approach for language modeling in the long tail.

...read moreread less

Proceedings Article•DOI•

Iterative Search for Weakly Supervised Semantic Parsing

[...]

Pradeep Dasigi¹, Matt Gardner², Shikhar Murty³, Luke Zettlemoyer², Eduard Hovy⁴ - Show less +1 more•Institutions (4)

Allen Institute for Artificial Intelligence¹, University of Washington², Indian Institutes of Technology³, Carnegie Mellon University⁴

01 Jun 2019

TL;DR: A novel iterative training algorithm is proposed that alternates between searching for consistent logical forms and maximizing the marginal likelihood of the retrieved ones, thus dealing with the problem of spuriousness.

...read moreread less

Abstract: Training semantic parsers from question-answer pairs typically involves searching over an exponentially large space of logical forms, and an unguided search can easily be misled by spurious logical forms that coincidentally evaluate to the correct answer. We propose a novel iterative training algorithm that alternates between searching for consistent logical forms and maximizing the marginal likelihood of the retrieved ones. This training scheme lets us iteratively train models that provide guidance to subsequent ones to search for logical forms of increasing complexity, thus dealing with the problem of spuriousness. We evaluate these techniques on two hard datasets: WikiTableQuestions (WTQ) and Cornell Natural Language Visual Reasoning (NLVR), and show that our training algorithm outperforms the previous best systems, on WTQ in a comparable setting, and on NLVR with significantly less supervision.

...read moreread less

Proceedings Article•DOI•

JuICe: A Large Scale Distantly Supervised Dataset for Open Domain Context-based Code Generation

[...]

Rajas Agashe, Srinivasan Iyer¹, Luke Zettlemoyer¹•Institutions (1)

University of Washington¹

01 Oct 2019

TL;DR: To study code generation conditioned on a long context history, JuICe is presented, a corpus of 1.5 million examples with a curated test set of 3.7K instances based on online programming assignments that provides refined human-curated data, open-domain code, and an order of magnitude more training data.

...read moreread less

Abstract: Interactive programming with interleaved code snippet cells and natural language markdown is recently gaining popularity in the form of Jupyter notebooks, which accelerate prototyping and collaboration. To study code generation conditioned on a long context history, we present JuICe, a corpus of 1.5 million examples with a curated test set of 3.7K instances based on online programming assignments. Compared with existing contextual code generation datasets, JuICe provides refined human-curated data, open-domain code, and an order of magnitude more training data. Using JuICe, we train models for two tasks: (1) generation of the API call sequence in a code cell, and (2) full code cell generation, both conditioned on the NL-Code history up to a particular code cell. Experiments using current baseline code generation models show that both context and distant supervision aid in generation, and that the dataset is challenging for current systems.

...read moreread less

Proceedings Article•DOI•

pair2vec: Compositional Word-Pair Embeddings for Cross-Sentence Inference.

[...]

Mandar Joshi¹, Eunsol Choi¹, Omer Levy², Daniel S. Weld¹, Luke Zettlemoyer¹ - Show less +1 more•Institutions (2)

University of Washington¹, Facebook²

01 Jun 2019

TL;DR: The authors propose to learn word embeddings by maximizing the pointwise mutual information (PMI) with the contexts in which the two words co-occur and add these representations to the cross-sentence attention layer of existing inference models.

...read moreread less

Abstract: Reasoning about implied relationships (e.g. paraphrastic, common sense, encyclopedic) between pairs of words is crucial for many cross-sentence inference problems. This paper proposes new methods for learning and using embeddings of word pairs that implicitly represent background knowledge about such relationships. Our pairwise embeddings are computed as a compositional function of each word’s representation, which is learned by maximizing the pointwise mutual information (PMI) with the contexts in which the the two words co-occur. We add these representations to the cross-sentence attention layer of existing inference models (e.g. BiDAF for QA, ESIM for NLI), instead of extending or replacing existing word embeddings. Experiments show a gain of 2.7% on the recently released SQuAD 2.0 and 1.3% on MultiNLI. Our representations also aid in better generalization with gains of around 6-7% on adversarial SQuAD datasets, and 8.8% on the adversarial entailment test set by Glockner et al. (2018).

...read moreread less

Proceedings Article•DOI•

Learning Programmatic Idioms for Scalable Semantic Parsing

[...]

Srinivasan Iyer¹, Alvin Cheung¹, Luke Zettlemoyer²•Institutions (2)

University of Washington¹, Facebook²

01 Nov 2019

TL;DR: An iterative method to extract code idioms from large source code corpora by repeatedly collapsing most-frequent depth-2 subtrees of their syntax trees, and train semantic parsers to apply these idioms during decoding.

...read moreread less

Abstract: Programmers typically organize executable source code using high-level coding patterns or idiomatic structures such as nested loops, exception handlers and recursive blocks, rather than as individual code tokens. In contrast, state of the art (SOTA) semantic parsers still map natural language instructions to source code by building the code syntax tree one node at a time. In this paper, we introduce an iterative method to extract code idioms from large source code corpora by repeatedly collapsing most-frequent depth-2 subtrees of their syntax trees, and train semantic parsers to apply these idioms during decoding. Applying idiom-based decoding on a recent context-dependent semantic parsing task improves the SOTA by 2.2% BLEU score while reducing training time by more than 50%. This improved speed enables us to scale up the model by training on an extended training set that is 5× larger, to further move up the SOTA by an additional 2.3% BLEU and 0.9% exact match. Finally, idioms also significantly improve accuracy of semantic parsing to SQL on the ATIS-SQL dataset, when training data is limited.

...read moreread less