Showing papers in "arXiv: Computation and Language in 2016"

PDF

Open Access

Posted Content•

Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation

[...]

26 Sep 2016-arXiv: Computation and Language

TL;DR: GNMT, Google's Neural Machine Translation system, is presented, which attempts to address many of the weaknesses of conventional phrase-based translation systems and provides a good balance between the flexibility of "character"-delimited models and the efficiency of "word"-delicited models.

...read moreread less

Abstract: Neural Machine Translation (NMT) is an end-to-end learning approach for automated translation, with the potential to overcome many of the weaknesses of conventional phrase-based translation systems. Unfortunately, NMT systems are known to be computationally expensive both in training and in translation inference. Also, most NMT systems have difficulty with rare words. These issues have hindered NMT's use in practical deployments and services, where both accuracy and speed are essential. In this work, we present GNMT, Google's Neural Machine Translation system, which attempts to address many of these issues. Our model consists of a deep LSTM network with 8 encoder and 8 decoder layers using attention and residual connections. To improve parallelism and therefore decrease training time, our attention mechanism connects the bottom layer of the decoder to the top layer of the encoder. To accelerate the final translation speed, we employ low-precision arithmetic during inference computations. To improve handling of rare words, we divide words into a limited set of common sub-word units ("wordpieces") for both input and output. This method provides a good balance between the flexibility of "character"-delimited models and the efficiency of "word"-delimited models, naturally handles translation of rare words, and ultimately improves the overall accuracy of the system. Our beam search technique employs a length-normalization procedure and uses a coverage penalty, which encourages generation of an output sentence that is most likely to cover all the words in the source sentence. On the WMT'14 English-to-French and English-to-German benchmarks, GNMT achieves competitive results to state-of-the-art. Using a human side-by-side evaluation on a set of isolated simple sentences, it reduces translation errors by an average of 60% compared to Google's phrase-based production system.

...read moreread less

5,737 citations

Posted Content•

SQuAD: 100,000+ Questions for Machine Comprehension of Text

[...]

Pranav Rajpurkar¹, Jian Zhang¹, Konstantin Lopyrev¹, Percy Liang¹•Institutions (1)

Stanford University¹

16 Jun 2016-arXiv: Computation and Language

TL;DR: The Stanford Question Answering Dataset (SQuAD) as mentioned in this paper is a reading comprehension dataset consisting of 100,000+ questions posed by crowdworkers on a set of Wikipedia articles, where the answer to each question is a segment of text from the corresponding reading passage.

...read moreread less

Abstract: We present the Stanford Question Answering Dataset (SQuAD), a new reading comprehension dataset consisting of 100,000+ questions posed by crowdworkers on a set of Wikipedia articles, where the answer to each question is a segment of text from the corresponding reading passage. We analyze the dataset to understand the types of reasoning required to answer the questions, leaning heavily on dependency and constituency trees. We build a strong logistic regression model, which achieves an F1 score of 51.0%, a significant improvement over a simple baseline (20%). However, human performance (86.8%) is much higher, indicating that the dataset presents a good challenge problem for future research. The dataset is freely available at this https URL

...read moreread less

4,336 citations

Posted Content•

Enriching Word Vectors with Subword Information

[...]

Piotr Bojanowski¹, Edouard Grave¹, Armand Joulin¹, Tomas Mikolov¹•Institutions (1)

Facebook¹

15 Jul 2016-arXiv: Computation and Language

TL;DR: A new approach based on the skipgram model, where each word is represented as a bag of character n-grams, with words being represented as the sum of these representations, which achieves state-of-the-art performance on word similarity and analogy tasks.

...read moreread less

Abstract: Continuous word representations, trained on large unlabeled corpora are useful for many natural language processing tasks. Popular models that learn such representations ignore the morphology of words, by assigning a distinct vector to each word. This is a limitation, especially for languages with large vocabularies and many rare words. In this paper, we propose a new approach based on the skipgram model, where each word is represented as a bag of character $n$-grams. A vector representation is associated to each character $n$-gram; words being represented as the sum of these representations. Our method is fast, allowing to train models on large corpora quickly and allows us to compute word representations for words that did not appear in the training data. We evaluate our word representations on nine different languages, both on word similarity and analogy tasks. By comparing to recently proposed morphological word representations, we show that our vectors achieve state-of-the-art performance on these tasks.

...read moreread less

2,425 citations

Posted Content•

Abstractive Text Summarization Using Sequence-to-Sequence RNNs and Beyond

[...]

Ramesh Nallapati¹, Bowen Zhou¹, Cicero Nogueira dos Santos¹, Caglar Gulcehre², Bing Xiang¹ - Show less +1 more•Institutions (2)

IBM¹, Université de Montréal²

19 Feb 2016-arXiv: Computation and Language

TL;DR: This paper proposed several novel models that address critical problems in summarization that are not adequately modeled by the basic architecture, such as modeling key-words, capturing the hierarchy of sentence-to-word structure, and emitting words that are rare or unseen at training time.

...read moreread less

Abstract: In this work, we model abstractive text summarization using Attentional Encoder-Decoder Recurrent Neural Networks, and show that they achieve state-of-the-art performance on two different corpora. We propose several novel models that address critical problems in summarization that are not adequately modeled by the basic architecture, such as modeling key-words, capturing the hierarchy of sentence-to-word structure, and emitting words that are rare or unseen at training time. Our work shows that many of our proposed models contribute to further improvement in performance. We also propose a new dataset consisting of multi-sentence summaries, and establish performance benchmarks for further research.

...read moreread less

1,141 citations

Posted Content•

Exploring the limits of language modeling

[...]

Rafal Jozefowicz¹, Oriol Vinyals¹, Mike Schuster¹, Noam Shazeer¹, Yonghui Wu¹ - Show less +1 more•Institutions (1)

Google¹

07 Feb 2016-arXiv: Computation and Language

TL;DR: This work explores recent advances in Recurrent Neural Networks for large scale Language Modeling, and extends current models to deal with two key challenges present in this task: corpora and vocabulary sizes, and complex, long term structure of language.

...read moreread less

Abstract: In this work we explore recent advances in Recurrent Neural Networks for large scale Language Modeling, a task central to language understanding. We extend current models to deal with two key challenges present in this task: corpora and vocabulary sizes, and complex, long term structure of language. We perform an exhaustive study on techniques such as character Convolutional Neural Networks or Long-Short Term Memory, on the One Billion Word Benchmark. Our best single model significantly improves state-of-the-art perplexity from 51.3 down to 30.0 (whilst reducing the number of parameters by a factor of 20), while an ensemble of models sets a new record by improving perplexity from 41.0 down to 23.7. We also release these models for the NLP and ML community to study and improve upon.

...read moreread less

1,100 citations

Posted Content•

Man is to Computer Programmer as Woman is to Homemaker? Debiasing Word Embeddings

[...]

Tolga Bolukbasi¹, Kai-Wei Chang², James Zou², Venkatesh Saligrama², Adam Tauman Kalai² - Show less +1 more•Institutions (2)

Boston University¹, Microsoft²

21 Jul 2016-arXiv: Computation and Language

TL;DR: This work empirically demonstrates that its algorithms significantly reduce gender bias in embeddings while preserving the its useful properties such as the ability to cluster related concepts and to solve analogy tasks.

...read moreread less

Abstract: The blind application of machine learning runs the risk of amplifying biases present in data. Such a danger is facing us with word embedding, a popular framework to represent text data as vectors which has been used in many machine learning and natural language processing tasks. We show that even word embeddings trained on Google News articles exhibit female/male gender stereotypes to a disturbing extent. This raises concerns because their widespread use, as we describe, often tends to amplify these biases. Geometrically, gender bias is first shown to be captured by a direction in the word embedding. Second, gender neutral words are shown to be linearly separable from gender definition words in the word embedding. Using these properties, we provide a methodology for modifying an embedding to remove gender stereotypes, such as the association between between the words receptionist and female, while maintaining desired associations such as between the words queen and female. We define metrics to quantify both direct and indirect gender biases in embeddings, and develop algorithms to "debias" the embedding. Using crowd-worker evaluation as well as standard benchmarks, we empirically demonstrate that our algorithms significantly reduce gender bias in embeddings while preserving the its useful properties such as the ability to cluster related concepts and to solve analogy tasks. The resulting embeddings can be used in applications without amplifying gender bias.

...read moreread less

1,074 citations

Posted Content•

ConceptNet 5.5: An Open Multilingual Graph of General Knowledge

[...]

Robert Speer, Joshua Chin¹, Catherine Havasi•Institutions (1)

Union College¹

12 Dec 2016-arXiv: Computation and Language

TL;DR: ConceptNet as discussed by the authors is a knowledge graph that connects words and phrases of natural language with labeled edges to represent the general knowledge involved in understanding language, improving natural language applications by allowing the application to better understand the meanings behind the words people use.

...read moreread less

Abstract: Machine learning about language can be improved by supplying it with specific knowledge and sources of external information. We present here a new version of the linked open data resource ConceptNet that is particularly well suited to be used with modern NLP techniques such as word embeddings. ConceptNet is a knowledge graph that connects words and phrases of natural language with labeled edges. Its knowledge is collected from many sources that include expert-created resources, crowd-sourcing, and games with a purpose. It is designed to represent the general knowledge involved in understanding language, improving natural language applications by allowing the application to better understand the meanings behind the words people use. When ConceptNet is combined with word embeddings acquired from distributional semantics (such as word2vec), it provides applications with understanding that they would not acquire from distributional semantics alone, nor from narrower resources such as WordNet or DBPedia. We demonstrate this with state-of-the-art results on intrinsic evaluations of word relatedness that translate into improvements on applications of word vectors, including solving SAT-style analogies.

...read moreread less

964 citations

Posted Content•

Google's Multilingual Neural Machine Translation System: Enabling Zero-Shot Translation

[...]

Melvin Johnson¹, Mike Schuster¹, Quoc V. Le¹, Maxim Krikun¹, Yonghui Wu¹, Zhifeng Chen¹, Nikhil Thorat¹, Fernanda B. Viégas¹, Martin Wattenberg¹, Greg S. Corrado¹, Macduff Hughes¹, Jeffrey Dean¹ - Show less +8 more•Institutions (1)

Google¹

14 Nov 2016-arXiv: Computation and Language

TL;DR: The authors propose to add an artificial token at the beginning of the input sentence to specify the required target language, which improves the translation quality of all involved language pairs, even while keeping the total number of model parameters constant.

...read moreread less

Abstract: We propose a simple solution to use a single Neural Machine Translation (NMT) model to translate between multiple languages. Our solution requires no change in the model architecture from our base system but instead introduces an artificial token at the beginning of the input sentence to specify the required target language. The rest of the model, which includes encoder, decoder and attention, remains unchanged and is shared across all languages. Using a shared wordpiece vocabulary, our approach enables Multilingual NMT using a single model without any increase in parameters, which is significantly simpler than previous proposals for Multilingual NMT. Our method often improves the translation quality of all involved language pairs, even while keeping the total number of model parameters constant. On the WMT'14 benchmarks, a single multilingual model achieves comparable performance for English$\rightarrow$French and surpasses state-of-the-art results for English$\rightarrow$German. Similarly, a single multilingual model surpasses state-of-the-art results for French$\rightarrow$English and German$\rightarrow$English on WMT'14 and WMT'15 benchmarks respectively. On production corpora, multilingual models of up to twelve language pairs allow for better translation of many individual pairs. In addition to improving the translation quality of language pairs that the model was trained with, our models can also learn to perform implicit bridging between language pairs never seen explicitly during training, showing that transfer learning and zero-shot translation is possible for neural translation. Finally, we show analyses that hints at a universal interlingua representation in our models and show some interesting examples when mixing languages.

...read moreread less

947 citations

Posted Content•

Language Modeling with Gated Convolutional Networks

[...]

Yann N. Dauphin¹, Angela Fan¹, Michael Auli¹, David Grangier¹•Institutions (1)

Facebook¹

23 Dec 2016-arXiv: Computation and Language

TL;DR: The authors proposed a finite context approach through stacked convolutions, which can be more efficient since they allow parallelization over sequential tokens and achieved state-of-the-art results on the WikiText-103 benchmark.

...read moreread less

Abstract: The pre-dominant approach to language modeling to date is based on recurrent neural networks. Their success on this task is often linked to their ability to capture unbounded context. In this paper we develop a finite context approach through stacked convolutions, which can be more efficient since they allow parallelization over sequential tokens. We propose a novel simplified gating mechanism that outperforms Oord et al (2016) and investigate the impact of key architectural decisions. The proposed approach achieves state-of-the-art on the WikiText-103 benchmark, even though it features long-term dependencies, as well as competitive results on the Google Billion Words benchmark. Our model reduces the latency to score a sentence by an order of magnitude compared to a recurrent baseline. To our knowledge, this is the first time a non-recurrent approach is competitive with strong recurrent models on these large scale language tasks.

...read moreread less

880 citations

Posted Content•

A Hierarchical Latent Variable Encoder-Decoder Model for Generating Dialogues

[...]

Iulian Vlad Serban¹, Alessandro Sordoni, Ryan Lowe², Laurent Charlin³, Joelle Pineau², Aaron Courville¹, Yoshua Bengio¹ - Show less +3 more•Institutions (3)

Université de Montréal¹, McGill University², HEC Montréal³

19 May 2016-arXiv: Computation and Language

TL;DR: A neural network-based generative architecture, with latent stochastic variables that span a variable number of time steps, that improves upon recently proposed models and that the latent variables facilitate the generation of long outputs and maintain the context.

...read moreread less

Abstract: Sequential data often possesses a hierarchical structure with complex dependencies between subsequences, such as found between the utterances in a dialogue. In an effort to model this kind of generative process, we propose a neural network-based generative architecture, with latent stochastic variables that span a variable number of time steps. We apply the proposed model to the task of dialogue response generation and compare it with recent neural network architectures. We evaluate the model performance through automatic evaluation metrics and by carrying out a human evaluation. The experiments demonstrate that our model improves upon recently proposed models and that the latent variables facilitate the generation of long outputs and maintain the context.

...read moreread less

853 citations

Posted Content•

How NOT To Evaluate Your Dialogue System: An Empirical Study of Unsupervised Evaluation Metrics for Dialogue Response Generation

[...]

Chia-Wei Liu¹, Ryan Lowe¹, Iulian Vlad Serban², Michael Noseworthy³, Laurent Charlin¹, Joelle Pineau¹ - Show less +2 more•Institutions (3)

McGill University¹, Université de Montréal², Massachusetts Institute of Technology³

25 Mar 2016-arXiv: Computation and Language

TL;DR: The authors investigate evaluation metrics for dialogue response generation systems where supervised labels, such as task completion, are not available, and provide qualitative and quantitative results highlighting specific weaknesses in existing metrics and provide recommendations for future development of better automatic evaluation metrics.

...read moreread less

Abstract: We investigate evaluation metrics for dialogue response generation systems where supervised labels, such as task completion, are not available. Recent works in response generation have adopted metrics from machine translation to compare a model's generated response to a single target response. We show that these metrics correlate very weakly with human judgements in the non-technical Twitter domain, and not at all in the technical Ubuntu domain. We provide quantitative and qualitative results highlighting specific weaknesses in existing metrics, and provide recommendations for future development of better automatic evaluation metrics for dialogue systems.

...read moreread less

Posted Content•

FastText.zip: Compressing text classification models

[...]

Armand Joulin, Edouard Grave, Piotr Bojanowski, Matthijs Douze, Hervé Jégou, Tomas Mikolov - Show less +2 more

04 Nov 2016-arXiv: Computation and Language

TL;DR: This work proposes a method built upon product quantization to store the word embeddings, which produces a text classifier, derived from the fastText approach, which at test time requires only a fraction of the memory compared to the original one, without noticeably sacrificing the quality in terms of classification accuracy.

...read moreread less

Abstract: We consider the problem of producing compact architectures for text classification, such that the full model fits in a limited amount of memory. After considering different solutions inspired by the hashing literature, we propose a method built upon product quantization to store word embeddings. While the original technique leads to a loss in accuracy, we adapt this method to circumvent quantization artefacts. Our experiments carried out on several benchmarks show that our approach typically requires two orders of magnitude less memory than fastText while being only slightly inferior with respect to accuracy. As a result, it outperforms the state of the art by a good margin in terms of the compromise between memory usage and accuracy.

...read moreread less

Posted Content•

MS MARCO: A Human Generated MAchine Reading COmprehension Dataset

[...]

Payal Bajaj, Daniel Campos, Nick Craswell, Li Deng, Jianfeng Gao, Xiaodong Liu, Rangan Majumder, Andrew McNamara, Bhaskar Mitra, Tri Nguyen, Mir Rosenberg, Xia Song, Alina Stoica, Saurabh Tiwary, Tong Wang - Show less +11 more

28 Nov 2016-arXiv: Computation and Language

TL;DR: This new dataset is aimed to overcome a number of well-known weaknesses of previous publicly available datasets for the same task of reading comprehension and question answering, and is the most comprehensive real-world dataset of its kind in both quantity and quality.

...read moreread less

Abstract: We introduce a large scale MAchine Reading COmprehension dataset, which we name MS MARCO. The dataset comprises of 1,010,916 anonymized questions---sampled from Bing's search query logs---each with a human generated answer and 182,669 completely human rewritten generated answers. In addition, the dataset contains 8,841,823 passages---extracted from 3,563,535 web documents retrieved by Bing---that provide the information necessary for curating the natural language answers. A question in the MS MARCO dataset may have multiple answers or no answers at all. Using this dataset, we propose three different tasks with varying levels of difficulty: (i) predict if a question is answerable given a set of context passages, and extract and synthesize the answer as a human would (ii) generate a well-formed answer (if possible) based on the context passages that can be understood with the question and passage context, and finally (iii) rank a set of retrieved passages given a question. The size of the dataset and the fact that the questions are derived from real user search queries distinguishes MS MARCO from other well-known publicly available datasets for machine reading comprehension and question-answering. We believe that the scale and the real-world nature of this dataset makes it attractive for benchmarking machine reading comprehension and question-answering models.

...read moreread less

Posted Content•

A Persona-Based Neural Conversation Model

[...]

Jiwei Li¹, Michel Galley², Chris Brockett³, Georgios P. Spithourakis⁴, Jianfeng Gao³, Bill Dolan³ - Show less +2 more•Institutions (4)

Stanford University¹, Carnegie Mellon University², Microsoft³, National Technical University of Athens⁴

19 Mar 2016-arXiv: Computation and Language

TL;DR: This work presents persona-based models for handling the issue of speaker consistency in neural response generation that yield qualitative performance improvements in both perplexity and BLEU scores over baseline sequence-to-sequence models.

...read moreread less

Abstract: We present persona-based models for handling the issue of speaker consistency in neural response generation. A speaker model encodes personas in distributed embeddings that capture individual characteristics such as background information and speaking style. A dyadic speaker-addressee model captures properties of interactions between two interlocutors. Our models yield qualitative performance improvements in both perplexity and BLEU scores over baseline sequence-to-sequence models, with similar gains in speaker consistency as measured by human judges.

...read moreread less

Posted Content•

SummaRuNNer: A Recurrent Neural Network based Sequence Model for Extractive Summarization of Documents

[...]

Ramesh Nallapati¹, Feifei Zhai¹, Bowen Zhou¹•Institutions (1)

IBM¹

14 Nov 2016-arXiv: Computation and Language

TL;DR: SummaRuNNer as mentioned in this paper is a recurrent neural network (RNN) based sequence model for extractive summarization of documents and achieves performance better than or comparable to state-of-the-art.

...read moreread less

Abstract: We present SummaRuNNer, a Recurrent Neural Network (RNN) based sequence model for extractive summarization of documents and show that it achieves performance better than or comparable to state-of-the-art. Our model has the additional advantage of being very interpretable, since it allows visualization of its predictions broken up by abstract features such as information content, salience and novelty. Another novel contribution of our work is abstractive training of our extractive model that can train on human generated reference summaries alone, eliminating the need for sentence-level extractive labels.

...read moreread less

Posted Content•

Learning End-to-End Goal-Oriented Dialog

[...]

Antoine Bordes¹, Y-Lan Boureau², Jason Weston¹•Institutions (2)

Facebook¹, New York University²

24 May 2016-arXiv: Computation and Language

TL;DR: In this article, an end-to-end dialog system based on memory networks is proposed for goal-oriented reservation systems, which can reach promising, yet imperfect, performance and learn to perform non-trivial operations.

...read moreread less

Abstract: Traditional dialog systems used in goal-oriented applications require a lot of domain-specific handcrafting, which hinders scaling up to new domains. End-to-end dialog systems, in which all components are trained from the dialogs themselves, escape this limitation. But the encouraging success recently obtained in chit-chat dialog may not carry over to goal-oriented settings. This paper proposes a testbed to break down the strengths and shortcomings of end-to-end dialog systems in goal-oriented applications. Set in the context of restaurant reservation, our tasks require manipulating sentences and symbols, so as to properly conduct conversations, issue API calls and use the outputs of such calls. We show that an end-to-end dialog system based on Memory Networks can reach promising, yet imperfect, performance and learn to perform non-trivial operations. We confirm those results by comparing our system to a hand-crafted slot-filling baseline on data from the second Dialog State Tracking Challenge (Henderson et al., 2014a). We show similar result patterns on data extracted from an online concierge service.

...read moreread less

Posted Content•

Neural Summarization by Extracting Sentences and Words

[...]

Jianpeng Cheng¹, Mirella Lapata•Institutions (1)

University of Edinburgh¹

23 Mar 2016-arXiv: Computation and Language

TL;DR: This work develops a general framework for single-document summarization composed of a hierarchical document encoder and an attention-based extractor that allows for different classes of summarization models which can extract sentences or words.

...read moreread less

Abstract: Traditional approaches to extractive summarization rely heavily on human-engineered features. In this work we propose a data-driven approach based on neural networks and continuous sentence features. We develop a general framework for single-document summarization composed of a hierarchical document encoder and an attention-based extractor. This architecture allows us to develop different classes of summarization models which can extract sentences or words. We train our models on large scale corpora containing hundreds of thousands of document-summary pairs. Experimental results on two summarization datasets demonstrate that our models obtain results comparable to the state of the art without any access to linguistic annotation.

...read moreread less

Posted Content•

Achieving Human Parity in Conversational Speech Recognition

[...]

Wayne Xiong, Jasha Droppo, Xuedong Huang, Frank Seide, Michael L. Seltzer, Andreas Stolcke, Dong Yu, Geoffrey Zweig - Show less +4 more

17 Oct 2016-arXiv: Computation and Language

TL;DR: The human error rate on the widely used NIST 2000 test set is measured, and the latest automated speech recognition system has reached human parity, establishing a new state of the art, and edges past the human benchmark.

...read moreread less

Abstract: Conversational speech recognition has served as a flagship speech recognition task since the release of the Switchboard corpus in the 1990s. In this paper, we measure the human error rate on the widely used NIST 2000 test set, and find that our latest automated system has reached human parity. The error rate of professional transcribers is 5.9% for the Switchboard portion of the data, in which newly acquainted pairs of people discuss an assigned topic, and 11.3% for the CallHome portion where friends and family members have open-ended conversations. In both cases, our automated system establishes a new state of the art, and edges past the human benchmark, achieving error rates of 5.8% and 11.0%, respectively. The key to our system's performance is the use of various convolutional and LSTM acoustic model architectures, combined with a novel spatial smoothing method and lattice-free MMI acoustic training, multiple recurrent neural network language modeling approaches, and a systematic use of system combination.

...read moreread less

Posted Content•

Pointer Sentinel Mixture Models

[...]

Stephen Merity¹, Caiming Xiong¹, James Bradbury¹, Richard Socher¹•Institutions (1)

Salesforce.com¹

26 Sep 2016-arXiv: Computation and Language

TL;DR: The authors introduced the pointer sentinel mixture architecture for neural sequence models which has the ability to either reproduce a word from the recent context or produce a word using a standard softmax classifier.

...read moreread less

Abstract: Recent neural network sequence models with softmax classifiers have achieved their best language modeling performance only with very large hidden states and large vocabularies. Even then they struggle to predict rare or unseen words even if the context makes the prediction unambiguous. We introduce the pointer sentinel mixture architecture for neural sequence models which has the ability to either reproduce a word from the recent context or produce a word from a standard softmax classifier. Our pointer sentinel-LSTM model achieves state of the art language modeling performance on the Penn Treebank (70.9 perplexity) while using far fewer parameters than a standard softmax LSTM. In order to evaluate how well language models can exploit longer contexts and deal with more realistic vocabularies and larger corpora we also introduce the freely available WikiText corpus.

...read moreread less

Posted Content•

Dual Learning for Machine Translation

[...]

Yingce Xia, Di He, Tao Qin, Liwei Wang, Nenghai Yu, Tie-Yan Liu, Wei-Ying Ma - Show less +3 more

01 Nov 2016-arXiv: Computation and Language

TL;DR: In this paper, the authors proposed a dual-learning mechanism, which can enable an NMT system to automatically learn from unlabeled data through a dual learning game, inspired by the following observation: any machine translation task has a dual task, e.g., Englishto-French translation (primal) versus French-to-English translation (dual), the primal and dual tasks can form a closed loop and generate informative feedback signals to train the translation models, even if without the involvement of a human labeler.

...read moreread less

Abstract: While neural machine translation (NMT) is making good progress in the past two years, tens of millions of bilingual sentence pairs are needed for its training. However, human labeling is very costly. To tackle this training data bottleneck, we develop a dual-learning mechanism, which can enable an NMT system to automatically learn from unlabeled data through a dual-learning game. This mechanism is inspired by the following observation: any machine translation task has a dual task, e.g., English-to-French translation (primal) versus French-to-English translation (dual); the primal and dual tasks can form a closed loop, and generate informative feedback signals to train the translation models, even if without the involvement of a human labeler. In the dual-learning mechanism, we use one agent to represent the model for the primal task and the other agent to represent the model for the dual task, then ask them to teach each other through a reinforcement learning process. Based on the feedback signals generated during this process (e.g., the language-model likelihood of the output of a model, and the reconstruction error of the original sentence after the primal and dual translations), we can iteratively update the two models until convergence (e.g., using the policy gradient methods). We call the corresponding approach to neural machine translation \emph{dual-NMT}. Experiments show that dual-NMT works very well on English$\leftrightarrow$French translation; especially, by learning from monolingual data (with 10% bilingual data for warm start), it achieves a comparable accuracy to NMT trained from the full bilingual data for the French-to-English translation task.

...read moreread less

Posted Content•

Neural Machine Translation in Linear Time

[...]

Nal Kalchbrenner, Lasse Espeholt, Karen Simonyan, Aaron van den Oord, Alex Graves, Koray Kavukcuoglu - Show less +2 more

31 Oct 2016-arXiv: Computation and Language

TL;DR: The ByteNet decoder attains state-of-the-art performance on character-level language modelling and outperforms the previous best results obtained with recurrent networks and the latent alignment structure contained in the representations reflects the expected alignment between the tokens.

...read moreread less

Abstract: We present a novel neural network for processing sequences The ByteNet is a one-dimensional convolutional neural network that is composed of two parts, one to encode the source sequence and the other to decode the target sequence The two network parts are connected by stacking the decoder on top of the encoder and preserving the temporal resolution of the sequences To address the differing lengths of the source and the target, we introduce an efficient mechanism by which the decoder is dynamically unfolded over the representation of the encoder The ByteNet uses dilation in the convolutional layers to increase its receptive field The resulting network has two core properties: it runs in time that is linear in the length of the sequences and it sidesteps the need for excessive memorization The ByteNet decoder attains state-of-the-art performance on character-level language modelling and outperforms the previous best results obtained with recurrent networks The ByteNet also achieves state-of-the-art performance on character-to-character machine translation on the English-to-German WMT translation task, surpassing comparable neural translation models that are based on recurrent networks with attentional pooling and run in quadratic time We find that the latent alignment structure contained in the representations reflects the expected alignment between the tokens

...read moreread less

Posted Content•

Joint CTC-Attention based End-to-End Speech Recognition using Multi-task Learning

[...]

Suyoun Kim¹, Takaaki Hori¹, Shinji Watanabe¹•Institutions (1)

Mitsubishi Electric Research Laboratories¹

21 Sep 2016-arXiv: Computation and Language

TL;DR: A novel method for end-to-end speech recognition to improve robustness and achieve fast convergence by using a joint CTC-attention model within the multi-task learning framework, thereby mitigating the alignment issue.

...read moreread less

Abstract: Recently, there has been an increasing interest in end-to-end speech recognition that directly transcribes speech to text without any predefined alignments. One approach is the attention-based encoder-decoder framework that learns a mapping between variable-length input and output sequences in one step using a purely data-driven method. The attention model has often been shown to improve the performance over another end-to-end approach, the Connectionist Temporal Classification (CTC), mainly because it explicitly uses the history of the target character without any conditional independence assumptions. However, we observed that the performance of the attention has shown poor results in noisy condition and is hard to learn in the initial training stage with long input sequences. This is because the attention model is too flexible to predict proper alignments in such cases due to the lack of left-to-right constraints as used in CTC. This paper presents a novel method for end-to-end speech recognition to improve robustness and achieve fast convergence by using a joint CTC-attention model within the multi-task learning framework, thereby mitigating the alignment issue. An experiment on the WSJ and CHiME-4 tasks demonstrates its advantages over both the CTC and attention-based encoder-decoder baselines, showing 5.4-14.6% relative improvements in Character Error Rate (CER).

...read moreread less

Posted Content•

Understanding Neural Networks through Representation Erasure

[...]

Jiwei Li, Will S. Monroe, Dan Jurafsky

24 Dec 2016-arXiv: Computation and Language

TL;DR: This paper proposes a general methodology to analyze and interpret decisions from a neural model by observing the effects on the model of erasing various parts of the representation, such as input word-vector dimensions, intermediate hidden units, or input words.

...read moreread less

Abstract: While neural networks have been successfully applied to many natural language processing tasks, they come at the cost of interpretability. In this paper, we propose a general methodology to analyze and interpret decisions from a neural model by observing the effects on the model of erasing various parts of the representation, such as input word-vector dimensions, intermediate hidden units, or input words. We present several approaches to analyzing the effects of such erasure, from computing the relative difference in evaluation metrics, to using reinforcement learning to erase the minimum set of input words in order to flip a neural model's decision. In a comprehensive analysis of multiple NLP tasks, including linguistic feature classification, sentence-level sentiment analysis, and document level sentiment aspect prediction, we show that the proposed methodology not only offers clear explanations about neural model decisions, but also provides a way to conduct error analysis on neural models.

...read moreread less

Posted Content•

Deep Reinforcement Learning for Dialogue Generation

[...]

Jiwei Li, Will S. Monroe, Alan Ritter, Michel Galley, Jianfeng Gao, Dan Jurafsky - Show less +2 more

05 Jun 2016-arXiv: Computation and Language

TL;DR: The authors apply deep reinforcement learning to model future reward in chatbot dialogue, using policy gradient methods to reward sequences that display three useful conversational properties: informativity (nonrepetitive turns), coherence, and ease of answering.

...read moreread less

Abstract: Recent neural models of dialogue generation offer great promise for generating responses for conversational agents, but tend to be shortsighted, predicting utterances one at a time while ignoring their influence on future outcomes. Modeling the future direction of a dialogue is crucial to generating coherent, interesting dialogues, a need which led traditional NLP models of dialogue to draw on reinforcement learning. In this paper, we show how to integrate these goals, applying deep reinforcement learning to model future reward in chatbot dialogue. The model simulates dialogues between two virtual agents, using policy gradient methods to reward sequences that display three useful conversational properties: informativity (non-repetitive turns), coherence, and ease of answering (related to forward-looking function). We evaluate our model on diversity, length as well as with human judges, showing that the proposed algorithm generates more interactive responses and manages to foster a more sustained conversation in dialogue simulation. This work marks a first step towards learning a neural conversational model based on the long-term success of dialogues.

...read moreread less

Posted Content•

An Empirical Evaluation of doc2vec with Practical Insights into Document Embedding Generation

[...]

Jey Han Lau¹, Timothy Baldwin²•Institutions (2)

King's College London¹, University of Melbourne²

19 Jul 2016-arXiv: Computation and Language

TL;DR: It is found that doc2vec performs robustly when using models trained on large external corpora, and can be further improved by using pre-trained word embeddings.

...read moreread less

Abstract: Recently, Le and Mikolov (2014) proposed doc2vec as an extension to word2vec (Mikolov et al., 2013a) to learn document-level embeddings. Despite promising results in the original paper, others have struggled to reproduce those results. This paper presents a rigorous empirical evaluation of doc2vec over two tasks. We compare doc2vec to two baselines and two state-of-the-art document embedding methodologies. We found that doc2vec performs robustly when using models trained on large external corpora, and can be further improved by using pre-trained word embeddings. We also provide recommendations on hyper-parameter settings for general purpose applications, and release source code to induce document embeddings using our trained doc2vec models.

...read moreread less

Posted Content•

A Thorough Examination of the CNN/Daily Mail Reading Comprehension Task

[...]

Danqi Chen, Jason Bolton¹, Christopher D. Manning¹•Institutions (1)

Stanford University¹

09 Jun 2016-arXiv: Computation and Language

TL;DR: This paper conducted a thorough examination of this new reading comprehension task and showed that simple, carefully designed systems can obtain accuracies of 73.6% and 76.6 % on these two datasets, exceeding current state-of-the-art results by 7-10% and approaching what they believe is the ceiling for performance on this task.

...read moreread less

Abstract: Enabling a computer to understand a document so that it can answer comprehension questions is a central, yet unsolved goal of NLP. A key factor impeding its solution by machine learned systems is the limited availability of human-annotated data. Hermann et al. (2015) seek to solve this problem by creating over a million training examples by pairing CNN and Daily Mail news articles with their summarized bullet points, and show that a neural network can then be trained to give good performance on this task. In this paper, we conduct a thorough examination of this new reading comprehension task. Our primary aim is to understand what depth of language understanding is required to do well on this task. We approach this from one side by doing a careful hand-analysis of a small subset of the problems and from the other by showing that simple, carefully designed systems can obtain accuracies of 73.6% and 76.6% on these two datasets, exceeding current state-of-the-art results by 7-10% and approaching what we believe is the ceiling for performance on this task.

...read moreread less

Posted Content•

Language to Logical Form with Neural Attention

[...]

Li Dong¹, Mirella Lapata•Institutions (1)

University of Edinburgh¹

06 Jan 2016-arXiv: Computation and Language

TL;DR: This paper presents a general method based on an attention-enhanced encoder-decoder model that encode input utterances into vector representations, and generate their logical forms by conditioning the output sequences or trees on the encoding vectors.

...read moreread less

Abstract: Semantic parsing aims at mapping natural language to machine interpretable meaning representations. Traditional approaches rely on high-quality lexicons, manually-built templates, and linguistic features which are either domain- or representation-specific. In this paper we present a general method based on an attention-enhanced encoder-decoder model. We encode input utterances into vector representations, and generate their logical forms by conditioning the output sequences or trees on the encoding vectors. Experimental results on four datasets show that our approach performs competitively without using hand-engineered features and is easy to adapt across domains and meaning representations.

...read moreread less

Posted Content•

Bag of Tricks for Efficient Text Classification

[...]

Armand Joulin¹, Edouard Grave², Piotr Bojanowski¹, Tomas Mikolov¹•Institutions (2)

Facebook¹, Columbia University²

06 Jul 2016-arXiv: Computation and Language

TL;DR: A simple and efficient baseline for text classification is explored that shows that the fast text classifier fastText is often on par with deep learning classifiers in terms of accuracy, and many orders of magnitude faster for training and evaluation.

...read moreread less

Abstract: This paper explores a simple and efficient baseline for text classification. Our experiments show that our fast text classifier fastText is often on par with deep learning classifiers in terms of accuracy, and many orders of magnitude faster for training and evaluation. We can train fastText on more than one billion words in less than ten minutes using a standard multicore~CPU, and classify half a million sentences among~312K classes in less than a minute.

...read moreread less

Posted Content•

Bidirectional Attention Flow for Machine Comprehension

[...]

Minjoon Seo¹, Aniruddha Kembhavi², Ali Farhadi¹, Hannaneh Hajishirzi¹•Institutions (2)

University of Washington¹, Allen Institute for Artificial Intelligence²

05 Nov 2016-arXiv: Computation and Language

TL;DR: Bi-Directional Attention Flow (BIDAF) as mentioned in this paper is a multi-stage hierarchical process that represents the context at different levels of granularity and uses bi-directional attention flow mechanism to obtain a query-aware context representation without early summarization.

...read moreread less

Abstract: Machine comprehension (MC), answering a query about a given context paragraph, requires modeling complex interactions between the context and the query. Recently, attention mechanisms have been successfully extended to MC. Typically these methods use attention to focus on a small portion of the context and summarize it with a fixed-size vector, couple attentions temporally, and/or often form a uni-directional attention. In this paper we introduce the Bi-Directional Attention Flow (BIDAF) network, a multi-stage hierarchical process that represents the context at different levels of granularity and uses bi-directional attention flow mechanism to obtain a query-aware context representation without early summarization. Our experimental evaluations show that our model achieves the state-of-the-art results in Stanford Question Answering Dataset (SQuAD) and CNN/DailyMail cloze test.

...read moreread less

Posted Content•

Text Matching as Image Recognition

[...]

Liang Pang¹, Yanyan Lan¹, Jiafeng Guo¹, Jun Xu¹, Shengxian Wan¹, Xueqi Cheng¹ - Show less +2 more•Institutions (1)

Chinese Academy of Sciences¹

20 Feb 2016-arXiv: Computation and Language

TL;DR: In this article, a convolutional neural network is utilized to capture rich matching patterns in a layer-by-layer way, which can successfully identify salient signals such as n-gram and n-term matchings.

...read moreread less

Abstract: Matching two texts is a fundamental problem in many natural language processing tasks. An effective way is to extract meaningful matching patterns from words, phrases, and sentences to produce the matching score. Inspired by the success of convolutional neural network in image recognition, where neurons can capture many complicated patterns based on the extracted elementary visual patterns such as oriented edges and corners, we propose to model text matching as the problem of image recognition. Firstly, a matching matrix whose entries represent the similarities between words is constructed and viewed as an image. Then a convolutional neural network is utilized to capture rich matching patterns in a layer-by-layer way. We show that by resembling the compositional hierarchies of patterns in image recognition, our model can successfully identify salient signals such as n-gram and n-term matchings. Experimental results demonstrate its superiority against the baselines.

...read moreread less

Collapse