Showing papers in "arXiv: Computation and Language in 2018"

PDF

Open Access

Posted Content•

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

[...]

Jacob Devlin¹, Ming-Wei Chang¹, Kenton Lee¹, Kristina Toutanova¹•Institutions (1)

11 Oct 2018-arXiv: Computation and Language

TL;DR: A new language representation model, BERT, designed to pre-train deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers, which can be fine-tuned with just one additional output layer to create state-of-the-art models for a wide range of tasks.

...read moreread less

Abstract: We introduce a new language representation model called BERT, which stands for Bidirectional Encoder Representations from Transformers. Unlike recent language representation models, BERT is designed to pre-train deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers. As a result, the pre-trained BERT model can be fine-tuned with just one additional output layer to create state-of-the-art models for a wide range of tasks, such as question answering and language inference, without substantial task-specific architecture modifications. BERT is conceptually simple and empirically powerful. It obtains new state-of-the-art results on eleven natural language processing tasks, including pushing the GLUE score to 80.5% (7.7% point absolute improvement), MultiNLI accuracy to 86.7% (4.6% absolute improvement), SQuAD v1.1 question answering Test F1 to 93.2 (1.5 point absolute improvement) and SQuAD v2.0 Test F1 to 83.1 (5.1 point absolute improvement).

...read moreread less

29,480 citations

Posted Content•

Deep contextualized word representations

[...]

Matthew E. Peters¹, Mark Neumann¹, Mohit Iyyer², Matt Gardner¹, Christopher Clark¹, Kenton Lee³, Luke Zettlemoyer⁴ - Show less +3 more•Institutions (4)

Allen Institute for Artificial Intelligence¹, University of Massachusetts Amherst², Google³, University of Washington⁴

15 Feb 2018-arXiv: Computation and Language

TL;DR: This article introduced a new type of deep contextualized word representation that models both complex characteristics of word use (e.g., syntax and semantics), and how these uses vary across linguistic contexts (i.e., to model polysemy).

...read moreread less

Abstract: We introduce a new type of deep contextualized word representation that models both (1) complex characteristics of word use (e.g., syntax and semantics), and (2) how these uses vary across linguistic contexts (i.e., to model polysemy). Our word vectors are learned functions of the internal states of a deep bidirectional language model (biLM), which is pre-trained on a large text corpus. We show that these representations can be easily added to existing models and significantly improve the state of the art across six challenging NLP problems, including question answering, textual entailment and sentiment analysis. We also present an analysis showing that exposing the deep internals of the pre-trained network is crucial, allowing downstream models to mix different types of semi-supervision signals.

...read moreread less

1,696 citations

Posted Content•

Universal Sentence Encoder

[...]

Daniel Cer, Yinfei Yang, Sheng-yi Kong, Nan Hua, Nicole Lyn Untalan Limtiaco, Rhomni St. John, Noah Constant, Mario Guajardo-Cespedes, Steve Yuan, Chris Tar, Yun-Hsuan Sung, Brian Strope, Ray Kurzweil - Show less +9 more

29 Mar 2018-arXiv: Computation and Language

TL;DR: It is found that transfer learning using sentence embeddings tends to outperform word level transfer with surprisingly good performance with minimal amounts of supervised training data for a transfer task.

...read moreread less

Abstract: We present models for encoding sentences into embedding vectors that specifically target transfer learning to other NLP tasks. The models are efficient and result in accurate performance on diverse transfer tasks. Two variants of the encoding models allow for trade-offs between accuracy and compute resources. For both variants, we investigate and report the relationship between model complexity, resource consumption, the availability of transfer task training data, and task performance. Comparisons are made with baselines that use word level transfer learning via pretrained word embeddings as well as baselines do not use any transfer learning. We find that transfer learning using sentence embeddings tends to outperform word level transfer. With transfer learning via sentence embeddings, we observe surprisingly good performance with minimal amounts of supervised training data for a transfer task. We obtain encouraging results on Word Embedding Association Tests (WEAT) targeted at detecting model bias. Our pre-trained sentence encoding models are made freely available for download and on TF Hub.

...read moreread less

1,259 citations

Posted Content•

Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition

[...]

Pete Warden

09 Apr 2018-arXiv: Computation and Language

TL;DR: An audio dataset of spoken words designed to help train and evaluate keyword spotting systems and suggests a methodology for reproducible and comparable accuracy metrics for this task.

...read moreread less

Abstract: Describes an audio dataset of spoken words designed to help train and evaluate keyword spotting systems. Discusses why this task is an interesting challenge, and why it requires a specialized dataset that is different from conventional datasets used for automatic speech recognition of full sentences. Suggests a methodology for reproducible and comparable accuracy metrics for this task. Describes how the data was collected and verified, what it contains, previous versions and properties. Concludes by reporting baseline results of models trained on this dataset.

...read moreread less

972 citations

Posted Content•

Federated Learning for Mobile Keyboard Prediction

[...]

Andrew Straiton Hard, Chloe Kiddon, Daniel Ramage¹, Francoise Beaufays, Hubert Eichner, Kanishka Rao, Rajiv Mathews, Sean Augenstein - Show less +4 more•Institutions (1)

Google¹

08 Nov 2018-arXiv: Computation and Language

TL;DR: The federation algorithm, which enables training on a higher-quality dataset for this use case, is shown to achieve better prediction recall and the feasibility and benefit of training language models on client devices without exporting sensitive user data to servers are demonstrated.

...read moreread less

Abstract: We train a recurrent neural network language model using a distributed, on-device learning framework called federated learning for the purpose of next-word prediction in a virtual keyboard for smartphones. Server-based training using stochastic gradient descent is compared with training on client devices using the Federated Averaging algorithm. The federated algorithm, which enables training on a higher-quality dataset for this use case, is shown to achieve better prediction recall. This work demonstrates the feasibility and benefit of training language models on client devices without exporting sensitive user data to servers. The federated learning environment gives users greater control over the use of their data and simplifies the task of incorporating privacy by default with distributed training and aggregation across a population of client devices.

...read moreread less

955 citations

Posted Content•

A Call for Clarity in Reporting BLEU Scores.

[...]

Matt Post

23 Apr 2018-arXiv: Computation and Language

TL;DR: The authors found differences as high as 1.8 between commonly used configurations of the BLEU score between different tokenization and normalization schemes applied to the reference, and suggested that machine translation researchers settle upon the standard WMT scheme, which does not allow for user-supplied reference processing.

...read moreread less

Abstract: The field of machine translation faces an under-recognized problem because of inconsistency in the reporting of scores from its dominant metric. Although people refer to "the" BLEU score, BLEU is in fact a parameterized metric whose values can vary wildly with changes to these parameters. These parameters are often not reported or are hard to find, and consequently, BLEU scores between papers cannot be directly compared. I quantify this variation, finding differences as high as 1.8 between commonly used configurations. The main culprit is different tokenization and normalization schemes applied to the reference. Pointing to the success of the parsing community, I suggest machine translation researchers settle upon the BLEU scheme used by the annual Conference on Machine Translation (WMT), which does not allow for user-supplied reference processing, and provide a new tool, SacreBLEU, to facilitate this.

...read moreread less

867 citations

Posted Content•

Know What You Don't Know: Unanswerable Questions for SQuAD

[...]

Pranav Rajpurkar¹, Robin Jia¹, Percy Liang¹•Institutions (1)

Stanford University¹

11 Jun 2018-arXiv: Computation and Language

TL;DR: SQuadRUn is a new dataset that combines the existing Stanford Question Answering Dataset (SQuAD) with over 50,000 unanswerable questions written adversarially by crowdworkers to look similar to answerable ones.

...read moreread less

Abstract: Extractive reading comprehension systems can often locate the correct answer to a question in a context document, but they also tend to make unreliable guesses on questions for which the correct answer is not stated in the context. Existing datasets either focus exclusively on answerable questions, or use automatically generated unanswerable questions that are easy to identify. To address these weaknesses, we present SQuAD 2.0, the latest version of the Stanford Question Answering Dataset (SQuAD). SQuAD 2.0 combines existing SQuAD data with over 50,000 unanswerable questions written adversarially by crowdworkers to look similar to answerable ones. To do well on SQuAD 2.0, systems must not only answer questions when possible, but also determine when no answer is supported by the paragraph and abstain from answering. SQuAD 2.0 is a challenging natural language understanding task for existing models: a strong neural system that gets 86% F1 on SQuAD 1.1 achieves only 66% F1 on SQuAD 2.0.

...read moreread less

793 citations

Posted Content•

AllenNLP: A Deep Semantic Natural Language Processing Platform

[...]

Matt Gardner, Joel Grus, Mark Neumann, Oyvind Tafjord, Pradeep Dasigi, Nelson F. Liu, Matthew E. Peters, Michael Schmitz, Luke Zettlemoyer - Show less +5 more

20 Mar 2018-arXiv: Computation and Language

TL;DR: AllenNLP is designed to support researchers who want to build novel language understanding models quickly and easily and provides a flexible data API that handles intelligent batching and padding, and a modular and extensible experiment framework that makes doing good science easy.

...read moreread less

Abstract: This paper describes AllenNLP, a platform for research on deep learning methods in natural language understanding. AllenNLP is designed to support researchers who want to build novel language understanding models quickly and easily. It is built on top of PyTorch, allowing for dynamic computation graphs, and provides (1) a flexible data API that handles intelligent batching and padding, (2) high-level abstractions for common operations in working with text, and (3) a modular and extensible experiment framework that makes doing good science easy. It also includes reference implementations of high quality approaches for both core semantic problems (e.g. semantic role labeling (Palmer et al., 2005)) and language understanding applications (e.g. machine comprehension (Rajpurkar et al., 2016)). AllenNLP is an ongoing open-source effort maintained by engineers and researchers at the Allen Institute for Artificial Intelligence.

...read moreread less

767 citations

Posted Content•

FEVER: a large-scale dataset for Fact Extraction and VERification

[...]

James Thorne¹, Andreas Vlachos², Christos Christodoulopoulos³, Arpit Mittal³•Institutions (3)

University of Sheffield¹, University of Cambridge², Amazon.com³

14 Mar 2018-arXiv: Computation and Language

TL;DR: This paper introduces a new publicly available dataset for verification against textual sources, FEVER, which consists of 185,445 claims generated by altering sentences extracted from Wikipedia and subsequently verified without knowledge of the sentence they were derived from.

...read moreread less

Abstract: In this paper we introduce a new publicly available dataset for verification against textual sources, FEVER: Fact Extraction and VERification. It consists of 185,445 claims generated by altering sentences extracted from Wikipedia and subsequently verified without knowledge of the sentence they were derived from. The claims are classified as Supported, Refuted or NotEnoughInfo by annotators achieving 0.6841 in Fleiss $\kappa$. For the first two classes, the annotators also recorded the sentence(s) forming the necessary evidence for their judgment. To characterize the challenge of the dataset presented, we develop a pipeline approach and compare it to suitably designed oracles. The best accuracy we achieve on labeling a claim accompanied by the correct evidence is 31.87%, while if we ignore the evidence we achieve 50.91%. Thus we believe that FEVER is a challenging testbed that will help stimulate progress on claim verification against textual sources.

...read moreread less

671 citations

Posted Content•

MultiWOZ - A Large-Scale Multi-Domain Wizard-of-Oz Dataset for Task-Oriented Dialogue Modelling

[...]

Paweł Budzianowski¹, Tsung-Hsien Wen¹, Bo-Hsiang Tseng¹, Iñigo Casanueva¹, Stefan Ultes¹, Osman Ramadan, Milica Gasic¹ - Show less +3 more•Institutions (1)

University of Cambridge¹

29 Sep 2018-arXiv: Computation and Language

TL;DR: The Multi-Domain Wizard-of-Oz dataset (MultiWOZ) as discussed by the authors is a fully-labeled collection of human-human written conversations spanning over multiple domains and topics.

...read moreread less

Abstract: Even though machine learning has become the major scene in dialogue research community, the real breakthrough has been blocked by the scale of data available. To address this fundamental obstacle, we introduce the Multi-Domain Wizard-of-Oz dataset (MultiWOZ), a fully-labeled collection of human-human written conversations spanning over multiple domains and topics. At a size of $10$k dialogues, it is at least one order of magnitude larger than all previous annotated task-oriented corpora. The contribution of this work apart from the open-sourced dataset labelled with dialogue belief states and dialogue actions is two-fold: firstly, a detailed description of the data collection procedure along with a summary of data structure and analysis is provided. The proposed data-collection pipeline is entirely based on crowd-sourcing without the need of hiring professional annotators; secondly, a set of benchmark results of belief tracking, dialogue act and response generation is reported, which shows the usability of the data and sets a baseline for future studies.

...read moreread less

623 citations

Posted Content•

The Natural Language Decathlon: Multitask Learning as Question Answering

[...]

Bryan McCann, Nitish Shirish Keskar, Caiming Xiong, Richard Socher

28 Aug 2018-arXiv: Computation and Language

TL;DR: Presented on August 28, 2018 at 12:15 p.m. in the Pettit Microelectronics Research Center, Room 102 A/B.

...read moreread less

Abstract: Presented on August 28, 2018 at 12:15 p.m. in the Pettit Microelectronics Research Center, Room 102 A/B.

...read moreread less

Posted Content•

HotpotQA: A Dataset for Diverse, Explainable Multi-hop Question Answering

[...]

Zhilin Yang¹, Peng Qi², Saizheng Zhang³, Yoshua Bengio³, William W. Cohen⁴, Ruslan Salakhutdinov¹, Christopher D. Manning² - Show less +3 more•Institutions (4)

Carnegie Mellon University¹, Stanford University², Université de Montréal³, Google⁴

25 Sep 2018-arXiv: Computation and Language

TL;DR: It is shown that HotpotQA is challenging for the latest QA systems, and the supporting facts enable models to improve performance and make explainable predictions.

...read moreread less

Abstract: Existing question answering (QA) datasets fail to train QA systems to perform complex reasoning and provide explanations for answers. We introduce HotpotQA, a new dataset with 113k Wikipedia-based question-answer pairs with four key features: (1) the questions require finding and reasoning over multiple supporting documents to answer; (2) the questions are diverse and not constrained to any pre-existing knowledge bases or knowledge schemas; (3) we provide sentence-level supporting facts required for reasoning, allowing QA systems to reason with strong supervision and explain the predictions; (4) we offer a new type of factoid comparison questions to test QA systems' ability to extract relevant facts and perform necessary comparison. We show that HotpotQA is challenging for the latest QA systems, and the supporting facts enable models to improve performance and make explainable predictions.

...read moreread less

Posted Content•

Snips Voice Platform: an embedded Spoken Language Understanding system for private-by-design voice interfaces

[...]

Alice Coucke, Alaa Saade, Adrien Ball, Theodore Bluche, Alexandre Caulier, David Leroy, Clément Doumouro, Thibault Gisselbrecht, Francesco Caltagirone, Thibaut Lavril, Maël Primet, Joseph Dureau - Show less +8 more

25 May 2018-arXiv: Computation and Language

TL;DR: The machine learning architecture of the Snips Voice Platform is presented, a software solution to perform Spoken Language Understanding on microprocessors typical of IoT devices that is fast and accurate while enforcing privacy by design, as no personal user data is ever collected.

...read moreread less

Abstract: This paper presents the machine learning architecture of the Snips Voice Platform, a software solution to perform Spoken Language Understanding on microprocessors typical of IoT devices. The embedded inference is fast and accurate while enforcing privacy by design, as no personal user data is ever collected. Focusing on Automatic Speech Recognition and Natural Language Understanding, we detail our approach to training high-performance Machine Learning models that are small enough to run in real-time on small devices. Additionally, we describe a data generation procedure that provides sufficient, high-quality training data without compromising user privacy.

...read moreread less

Posted Content•

Achieving Human Parity on Automatic Chinese to English News Translation

[...]

15 Mar 2018-arXiv: Computation and Language

TL;DR: It is found that Microsoft's latest neural machine translation system has reached a new state-of-the-art, and that the translation quality is at human parity when compared to professional human translations.

...read moreread less

Abstract: Machine translation has made rapid advances in recent years. Millions of people are using it today in online translation systems and mobile applications in order to communicate across language barriers. The question naturally arises whether such systems can approach or achieve parity with human translations. In this paper, we first address the problem of how to define and accurately measure human parity in translation. We then describe Microsoft's machine translation system and measure the quality of its translations on the widely used WMT 2017 news translation task from Chinese to English. We find that our latest neural machine translation system has reached a new state-of-the-art, and that the translation quality is at human parity when compared to professional human translations. We also find that it significantly exceeds the quality of crowd-sourced non-professional translations.

...read moreread less

Posted Content•

Massively Multilingual Sentence Embeddings for Zero-Shot Cross-Lingual Transfer and Beyond

[...]

Mikel Artetxe¹, Holger Schwenk²•Institutions (2)

University of the Basque Country¹, Facebook²

26 Dec 2018-arXiv: Computation and Language

TL;DR: This article used a single BiLSTM encoder with a shared BPE vocabulary for all languages, coupled with an auxiliary decoder and trained on publicly available parallel corpora to learn joint multilingual sentence representations for 93 languages, belonging to more than 30 different families and written in 28 different scripts.

...read moreread less

Abstract: We introduce an architecture to learn joint multilingual sentence representations for 93 languages, belonging to more than 30 different families and written in 28 different scripts. Our system uses a single BiLSTM encoder with a shared BPE vocabulary for all languages, which is coupled with an auxiliary decoder and trained on publicly available parallel corpora. This enables us to learn a classifier on top of the resulting embeddings using English annotated data only, and transfer it to any of the 93 languages without any modification. Our experiments in cross-lingual natural language inference (XNLI dataset), cross-lingual document classification (MLDoc dataset) and parallel corpus mining (BUCC dataset) show the effectiveness of our approach. We also introduce a new test set of aligned sentences in 112 languages, and show that our sentence embeddings obtain strong results in multilingual similarity search even for low-resource languages. Our implementation, the pre-trained encoder and the multilingual test set are available at this https URL

...read moreread less

Posted Content•

Don't Give Me the Details, Just the Summary! Topic-Aware Convolutional Neural Networks for Extreme Summarization

[...]

Shashi Narayan, Shay B. Cohen, Mirella Lapata

27 Aug 2018-arXiv: Computation and Language

TL;DR: A novel abstractive model is proposed which is conditioned on the article’s topics and based entirely on convolutional neural networks, outperforming an oracle extractive system and state-of-the-art abstractive approaches when evaluated automatically and by humans.

...read moreread less

Abstract: We introduce extreme summarization, a new single-document summarization task which does not favor extractive strategies and calls for an abstractive modeling approach. The idea is to create a short, one-sentence news summary answering the question "What is the article about?". We collect a real-world, large-scale dataset for this task by harvesting online articles from the British Broadcasting Corporation (BBC). We propose a novel abstractive model which is conditioned on the article's topics and based entirely on convolutional neural networks. We demonstrate experimentally that this architecture captures long-range dependencies in a document and recognizes pertinent content, outperforming an oracle extractive system and state-of-the-art abstractive approaches when evaluated automatically and by humans.

...read moreread less

Posted Content•

ESPnet: End-to-End Speech Processing Toolkit

[...]

Shinji Watanabe¹, Takaaki Hori², Shigeki Karita, Tomoki Hayashi³, Jiro Nishitoba, Yuya Unno, Nelson Yalta⁴, Jahn Heymann⁵, Matthew Wiesner¹, Nanxin Chen¹, Adithya Renduchintala¹, Tsubasa Ochiai⁶ - Show less +8 more•Institutions (6)

Johns Hopkins University¹, Mitsubishi Electric², Nagoya University³, Waseda University⁴, University of Paderborn⁵, Doshisha University⁶

30 Mar 2018-arXiv: Computation and Language

TL;DR: A major architecture of this software platform, several important functionalities, which differentiate ESPnet from other open source ASR toolkits, and experimental results with major ASR benchmarks are explained.

...read moreread less

Abstract: This paper introduces a new open source platform for end-to-end speech processing named ESPnet. ESPnet mainly focuses on end-to-end automatic speech recognition (ASR), and adopts widely-used dynamic neural network toolkits, Chainer and PyTorch, as a main deep learning engine. ESPnet also follows the Kaldi ASR toolkit style for data processing, feature extraction/format, and recipes to provide a complete setup for speech recognition and other speech processing experiments. This paper explains a major architecture of this software platform, several important functionalities, which differentiate ESPnet from other open source ASR toolkits, and experimental results with major ASR benchmarks.

...read moreread less

Posted Content•

Sentence Encoders on STILTs: Supplementary Training on Intermediate Labeled-data Tasks

[...]

Jason Phang¹, Thibault Févry¹, Samuel R. Bowman¹•Institutions (1)

New York University¹

02 Nov 2018-arXiv: Computation and Language

TL;DR: The benefits of supplementary training with further training on data-rich supervised tasks, such as natural language inference, obtain additional performance improvements on the GLUE benchmark, as well as observing reduced variance across random restarts in this setting.

...read moreread less

Abstract: Pretraining sentence encoders with language modeling and related unsupervised tasks has recently been shown to be very effective for language understanding tasks. By supplementing language model-style pretraining with further training on data-rich supervised tasks, such as natural language inference, we obtain additional performance improvements on the GLUE benchmark. Applying supplementary training on BERT (Devlin et al., 2018), we attain a GLUE score of 81.8---the state of the art (as of 02/24/2019) and a 1.4 point improvement over BERT. We also observe reduced variance across random restarts in this setting. Our approach yields similar improvements when applied to ELMo (Peters et al., 2018a) and Radford et al. (2018)'s model. In addition, the benefits of supplementary training are particularly pronounced in data-constrained regimes, as we show in experiments with artificially limited training data.

...read moreread less

Posted Content•

Universal Transformers

[...]

Mostafa Dehghani¹, Stephan Gouws², Oriol Vinyals², Jakob Uszkoreit², Łukasz Kaiser² - Show less +1 more•Institutions (2)

University of Amsterdam¹, Google²

10 Jul 2018-arXiv: Computation and Language

TL;DR: The authors proposed the Universal Transformer model, which employs a self-attention mechanism in every recursive step to combine information from different parts of a sequence, and further employs an adaptive computation time (ACT) mechanism to dynamically adjust the number of times the representation of each position in a sequence is revised.

...read moreread less

Abstract: Self-attentive feed-forward sequence models have been shown to achieve impressive results on sequence modeling tasks, thereby presenting a compelling alternative to recurrent neural networks (RNNs) which has remained the de-facto standard architecture for many sequence modeling problems to date. Despite these successes, however, feed-forward sequence models like the Transformer fail to generalize in many tasks that recurrent models handle with ease (e.g. copying when the string lengths exceed those observed at training time). Moreover, and in contrast to RNNs, the Transformer model is not computationally universal, limiting its theoretical expressivity. In this paper we propose the Universal Transformer which addresses these practical and theoretical shortcomings and we show that it leads to improved performance on several tasks. Instead of recurring over the individual symbols of sequences like RNNs, the Universal Transformer repeatedly revises its representations of all symbols in the sequence with each recurrent step. In order to combine information from different parts of a sequence, it employs a self-attention mechanism in every recurrent step. Assuming sufficient memory, its recurrence makes the Universal Transformer computationally universal. We further employ an adaptive computation time (ACT) mechanism to allow the model to dynamically adjust the number of times the representation of each position in a sequence is revised. Beyond saving computation, we show that ACT can improve the accuracy of the model. Our experiments show that on various algorithmic tasks and a diverse set of large-scale language understanding tasks the Universal Transformer generalizes significantly better and outperforms both a vanilla Transformer and an LSTM in machine translation, and achieves a new state of the art on the bAbI linguistic reasoning task and the challenging LAMBADA language modeling task.

...read moreread less

Posted Content•

Scaling Neural Machine Translation

[...]

Myle Ott, Sergey Edunov, David Grangier, Michael Auli

01 Jun 2018-arXiv: Computation and Language

TL;DR: This paper showed that reduced precision and large batch training can speed up training by nearly 5x on a single 8-GPU machine with careful tuning and implementation, achieving state-of-the-art performance on the Paracrawl dataset.

...read moreread less

Abstract: Sequence to sequence learning models still require several days to reach state of the art performance on large benchmark datasets using a single machine. This paper shows that reduced precision and large batch training can speedup training by nearly 5x on a single 8-GPU machine with careful tuning and implementation. On WMT'14 English-German translation, we match the accuracy of Vaswani et al. (2017) in under 5 hours when training on 8 GPUs and we obtain a new state of the art of 29.3 BLEU after training for 85 minutes on 128 GPUs. We further improve these results to 29.8 BLEU by training on the much larger Paracrawl dataset. On the WMT'14 English-French task, we obtain a state-of-the-art BLEU of 43.2 in 8.5 hours on 128 GPUs.

...read moreread less

Posted Content•

Style Tokens: Unsupervised Style Modeling, Control and Transfer in End-to-End Speech Synthesis

[...]

Yuxuan Wang¹, Daisy Stanton¹, Yu Zhang¹, RJ Skerry-Ryan¹, Eric Battenberg², Joel Shor¹, Ying Xiao¹, Fei Ren, Ye Jia¹, Rif A. Saurous¹ - Show less +6 more•Institutions (2)

Google¹, Baidu²

23 Mar 2018-arXiv: Computation and Language

TL;DR: "global style tokens" (GSTs), a bank of embeddings that are jointly trained within Tacotron, a state-of-the-art end-to-end speech synthesis system, learn to factorize noise and speaker identity, providing a path towards highly scalable but robust speech synthesis.

...read moreread less

Abstract: In this work, we propose "global style tokens" (GSTs), a bank of embeddings that are jointly trained within Tacotron, a state-of-the-art end-to-end speech synthesis system. The embeddings are trained with no explicit labels, yet learn to model a large range of acoustic expressiveness. GSTs lead to a rich set of significant results. The soft interpretable "labels" they generate can be used to control synthesis in novel ways, such as varying speed and speaking style - independently of the text content. They can also be used for style transfer, replicating the speaking style of a single audio clip across an entire long-form text corpus. When trained on noisy, unlabeled found data, GSTs learn to factorize noise and speaker identity, providing a path towards highly scalable but robust speech synthesis.

...read moreread less

Posted Content•

Neural Approaches to Conversational AI

[...]

Jianfeng Gao¹, Michel Galley¹, Lihong Li²•Institutions (2)

Microsoft¹, Google²

21 Sep 2018-arXiv: Computation and Language

TL;DR: In this article, the authors present a survey of state-of-the-art neural approaches to conversational AI, and discuss the progress that has been made and challenges still being faced, using specific systems and models as case studies.

...read moreread less

Abstract: The present paper surveys neural approaches to conversational AI that have been developed in the last few years. We group conversational systems into three categories: (1) question answering agents, (2) task-oriented dialogue agents, and (3) chatbots. For each category, we present a review of state-of-the-art neural approaches, draw the connection between them and traditional approaches, and discuss the progress that has been made and challenges still being faced, using specific systems and models as case studies.

...read moreread less

Posted Content•

Annotation Artifacts in Natural Language Inference Data

[...]

Suchin Gururangan¹, Swabha Swayamdipta², Omer Levy¹, Roy Schwartz¹, Roy Schwartz³, Samuel R. Bowman⁴, Noah A. Smith¹ - Show less +3 more•Institutions (4)

University of Washington¹, Carnegie Mellon University², Allen Institute for Artificial Intelligence³, Google⁴

06 Mar 2018-arXiv: Computation and Language

TL;DR: It is shown that a simple text categorization model can correctly classify the hypothesis alone in about 67% of SNLI and 53% of MultiNLI, and that specific linguistic phenomena such as negation and vagueness are highly correlated with certain inference classes.

...read moreread less

Abstract: Large-scale datasets for natural language inference are created by presenting crowd workers with a sentence (premise), and asking them to generate three new sentences (hypotheses) that it entails, contradicts, or is logically neutral with respect to. We show that, in a significant portion of such data, this protocol leaves clues that make it possible to identify the label by looking only at the hypothesis, without observing the premise. Specifically, we show that a simple text categorization model can correctly classify the hypothesis alone in about 67% of SNLI (Bowman et. al, 2015) and 53% of MultiNLI (Williams et. al, 2017). Our analysis reveals that specific linguistic phenomena such as negation and vagueness are highly correlated with certain inference classes. Our findings suggest that the success of natural language inference models to date has been overestimated, and that the task remains a hard open problem.

...read moreread less

Posted Content•

Spider: A Large-Scale Human-Labeled Dataset for Complex and Cross-Domain Semantic Parsing and Text-to-SQL Task

[...]

Tao Yu¹, Rui Zhang, Kai Yang¹, Michihiro Yasunaga¹, Dongxu Wang¹, Zifan Li¹, James Ma¹, Irene Li¹, Qingning Yao¹, Shanelle Roman¹, Zilin Zhang¹, Dragomir R. Radev¹ - Show less +8 more•Institutions (1)

Yale University¹

24 Sep 2018-arXiv: Computation and Language

TL;DR: This work defines a new complex and cross-domain semantic parsing and text-to-SQL task so that different complicated SQL queries and databases appear in train and test sets and experiments with various state-of-the-art models show that Spider presents a strong challenge for future research.

...read moreread less

Abstract: We present Spider, a large-scale, complex and cross-domain semantic parsing and text-to-SQL dataset annotated by 11 college students. It consists of 10,181 questions and 5,693 unique complex SQL queries on 200 databases with multiple tables, covering 138 different domains. We define a new complex and cross-domain semantic parsing and text-to-SQL task where different complex SQL queries and databases appear in train and test sets. In this way, the task requires the model to generalize well to both new SQL queries and new database schemas. Spider is distinct from most of the previous semantic parsing tasks because they all use a single database and the exact same programs in the train set and the test set. We experiment with various state-of-the-art models and the best model achieves only 12.4% exact matching accuracy on a database split setting. This shows that Spider presents a strong challenge for future research. Our dataset and task are publicly available at this https URL

...read moreread less

Posted Content•

Fine-tuned Language Models for Text Classification.

[...]

Jeremy Howard, Sebastian Ruder

18 Jan 2018-arXiv: Computation and Language

TL;DR: Fine-tuned Language Models (FitLaM) is proposed, an effective transfer learning method that can be applied to any task in NLP, and techniques that are key for fine-tuning a state-of-the-art language model are introduced.

...read moreread less

Abstract: Transfer learning has revolutionized computer vision, but existing approaches in NLP still require task-specific modifications and training from scratch. We propose Fine-tuned Language Models (FitLaM), an effective transfer learning method that can be applied to any task in NLP, and introduce techniques that are key for fine-tuning a state-of-the-art language model. Our method significantly outperforms the state-of-the-art on five text classification tasks, reducing the error by 18-24% on the majority of datasets. We open-source our pretrained models and code to enable adoption by the community.

...read moreread less

Posted Content•

Learning Word Vectors for 157 Languages

[...]

Edouard Grave¹, Piotr Bojanowski¹, Prakhar Gupta², Armand Joulin¹, Tomas Mikolov¹ - Show less +1 more•Institutions (2)

Facebook¹, École Polytechnique Fédérale de Lausanne²

19 Feb 2018-arXiv: Computation and Language

TL;DR: This paper describes how high quality word representations for 157 languages were trained on the free online encyclopedia Wikipedia and data from the common crawl project, and introduces three new word analogy datasets to evaluate these word vectors.

...read moreread less

Abstract: Distributed word representations, or word vectors, have recently been applied to many tasks in natural language processing, leading to state-of-the-art performance. A key ingredient to the successful application of these representations is to train them on very large corpora, and use these pre-trained models in downstream tasks. In this paper, we describe how we trained such high quality word representations for 157 languages. We used two sources of data to train these models: the free online encyclopedia Wikipedia and data from the common crawl project. We also introduce three new word analogy datasets to evaluate these word vectors, for French, Hindi and Polish. Finally, we evaluate our pre-trained word vectors on 10 languages for which evaluation datasets exists, showing very strong performance compared to previous models.

...read moreread less

Posted Content•

A Survey on Deep Learning for Named Entity Recognition

[...]

Jing Li¹, Aixin Sun¹, Jianglei Han¹, Chenliang Li²•Institutions (2)

Nanyang Technological University¹, Wuhan University²

22 Dec 2018-arXiv: Computation and Language

TL;DR: A comprehensive review on existing deep learning techniques for NER, including tagged NER corpora and off-the-shelf NER tools, and systematically categorizes existing works based on a taxonomy along three axes.

...read moreread less

Abstract: Named entity recognition (NER) is the task to identify mentions of rigid designators from text belonging to predefined semantic types such as person, location, organization etc. NER always serves as the foundation for many natural language applications such as question answering, text summarization, and machine translation. Early NER systems got a huge success in achieving good performance with the cost of human engineering in designing domain-specific features and rules. In recent years, deep learning, empowered by continuous real-valued vector representations and semantic composition through nonlinear processing, has been employed in NER systems, yielding stat-of-the-art performance. In this paper, we provide a comprehensive review on existing deep learning techniques for NER. We first introduce NER resources, including tagged NER corpora and off-the-shelf NER tools. Then, we systematically categorize existing works based on a taxonomy along three axes: distributed representations for input, context encoder, and tag decoder. Next, we survey the most representative methods for recent applied techniques of deep learning in new NER problem settings and applications. Finally, we present readers with the challenges faced by NER systems and outline future directions in this area.

...read moreread less

Posted Content•

Towards Empathetic Open-domain Conversation Models: a New Benchmark and Dataset

[...]

Hannah Rashkin¹, Eric Michael Smith², Margaret Li², Y-Lan Boureau²•Institutions (2)

University of Washington¹, Facebook²

01 Nov 2018-arXiv: Computation and Language

TL;DR: This work proposes a new benchmark for empathetic dialogue generation and EmpatheticDialogues, a novel dataset of 25k conversations grounded in emotional situations, and presents empirical comparisons of dialogue model adaptations forEmpathetic responding, leveraging existing models or datasets without requiring lengthy re-training of the full model.

...read moreread less

Abstract: One challenge for dialogue agents is recognizing feelings in the conversation partner and replying accordingly, a key communicative skill. While it is straightforward for humans to recognize and acknowledge others' feelings in a conversation, this is a significant challenge for AI systems due to the paucity of suitable publicly-available datasets for training and evaluation. This work proposes a new benchmark for empathetic dialogue generation and EmpatheticDialogues, a novel dataset of 25k conversations grounded in emotional situations. Our experiments indicate that dialogue models that use our dataset are perceived to be more empathetic by human evaluators, compared to models merely trained on large-scale Internet conversation data. We also present empirical comparisons of dialogue model adaptations for empathetic responding, leveraging existing models or datasets without requiring lengthy re-training of the full model.

...read moreread less

Posted Content•

A Discourse-Aware Attention Model for Abstractive Summarization of Long Documents

[...]

Arman Cohan¹, Franck Dernoncourt², Doo Soon Kim³, Trung Bui³, Seokhwan Kim³, Walter Chang³, Nazli Goharian¹ - Show less +3 more•Institutions (3)

Georgetown University¹, Massachusetts Institute of Technology², Adobe Systems³

16 Apr 2018-arXiv: Computation and Language

TL;DR: This work proposes the first model for abstractive summarization of single, longer-form documents (e.g., research papers), consisting of a new hierarchical encoder that models the discourse structure of a document, and an attentive discourse-aware decoder to generate the summary.

...read moreread less

Abstract: Neural abstractive summarization models have led to promising results in summarizing relatively short documents. We propose the first model for abstractive summarization of single, longer-form documents (e.g., research papers). Our approach consists of a new hierarchical encoder that models the discourse structure of a document, and an attentive discourse-aware decoder to generate the summary. Empirical results on two large-scale datasets of scientific papers show that our model significantly outperforms state-of-the-art models.

...read moreread less

Posted Content•

Deep Learning for Sentiment Analysis : A Survey

[...]

Lei Zhang¹, Shuai Wang², Bing Liu²•Institutions (2)

LinkedIn¹, University of Illinois at Urbana–Champaign²

24 Jan 2018-arXiv: Computation and Language

TL;DR: An overview of deep learning is given and a comprehensive survey of its current applications in sentiment analysis is provided.

...read moreread less

Abstract: Deep learning has emerged as a powerful machine learning technique that learns multiple layers of representations or features of the data and produces state-of-the-art prediction results. Along with the success of deep learning in many other application domains, deep learning is also popularly used in sentiment analysis in recent years. This paper first gives an overview of deep learning and then provides a comprehensive survey of its current applications in sentiment analysis.

...read moreread less

Collapse