scispace - formally typeset
Search or ask a question

Showing papers by "Quoc V. Le published in 2015"


Posted Content
Oriol Vinyals1, Quoc V. Le
TL;DR: A simple approach to conversational modeling which uses the recently proposed sequence to sequence framework, and is able to extract knowledge from both a domain specific dataset, and from a large, noisy, and general domain dataset of movie subtitles.
Abstract: Conversational modeling is an important task in natural language understanding and machine intelligence Although previous approaches exist, they are often restricted to specific domains (eg, booking an airline ticket) and require hand-crafted rules In this paper, we present a simple approach for this task which uses the recently proposed sequence to sequence framework Our model converses by predicting the next sentence given the previous sentence or sentences in a conversation The strength of our model is that it can be trained end-to-end and thus requires much fewer hand-crafted rules We find that this straightforward model can generate simple conversations given a large conversational training dataset Our preliminary results suggest that, despite optimizing the wrong objective function, the model is able to converse well It is able extract knowledge from both a domain specific dataset, and from a large, noisy, and general domain dataset of movie subtitles On a domain-specific IT helpdesk dataset, the model can find a solution to a technical problem via conversations On a noisy open-domain movie transcript dataset, the model can perform simple forms of common sense reasoning As expected, we also find that the lack of consistency is a common failure mode of our model

1,774 citations


Proceedings ArticleDOI
01 Jul 2015
TL;DR: This paper proposed and implemented an effective technique to address the problem of out-of-vocabulary (OOV) word translation in NMT, which trains an NMT system on data that is augmented by the output of a word alignment algorithm, and then uses this information in a post-processing step that translates every OOV word using a dictionary.
Abstract: Neural Machine Translation (NMT) is a new approach to machine translation that has shown promising results that are comparable to traditional approaches. A significant weakness in conventional NMT systems is their inability to correctly translate very rare words: end-to-end NMTs tend to have relatively small vocabularies with a single unk symbol that represents every possible out-of-vocabulary (OOV) word. In this paper, we propose and implement an effective technique to address this problem. We train an NMT system on data that is augmented by the output of a word alignment algorithm, allowing the NMT system to emit, for each OOV word in the target sentence, the position of its corresponding word in the source sentence. This information is later utilized in a post-processing step that translates every OOV word using a dictionary. Our experiments on the WMT’14 English to French translation task show that this method provides a substantial improvement of up to 2.8 BLEU points over an equivalent NMT system that does not use this technique. With 37.5 BLEU points, our NMT system is the first to surpass the best result achieved on a WMT’14 contest task.

728 citations


Posted Content
Andrew M. Dai1, Quoc V. Le1
TL;DR: This paper used unlabeled data to improve sequence learning with recurrent networks, which can be used as a "pre-training" step for a later supervised sequence learning algorithm, so that the parameters obtained from the unsupervised step can be a starting point for other supervised training models.
Abstract: We present two approaches that use unlabeled data to improve sequence learning with recurrent networks. The first approach is to predict what comes next in a sequence, which is a conventional language model in natural language processing. The second approach is to use a sequence autoencoder, which reads the input sequence into a vector and predicts the input sequence again. These two algorithms can be used as a "pretraining" step for a later supervised sequence learning algorithm. In other words, the parameters obtained from the unsupervised step can be used as a starting point for other supervised training models. In our experiments, we find that long short term memory recurrent networks after being pretrained with the two approaches are more stable and generalize better. With pretraining, we are able to train long short term memory recurrent networks up to a few hundred timesteps, thereby achieving strong performance in many text classification tasks, such as IMDB, DBpedia and 20 Newsgroups.

711 citations


Proceedings Article
Andrew M. Dai1, Quoc V. Le1
07 Dec 2015
TL;DR: Two approaches to use unlabeled data to improve Sequence Learning with recurrent networks are presented and it is found that long short term memory recurrent networks after pretrained with the two approaches become more stable to train and generalize better.
Abstract: We present two approaches to use unlabeled data to improve Sequence Learning with recurrent networks. The first approach is to predict what comes next in a sequence, which is a language model in NLP. The second approach is to use a sequence autoencoder, which reads the input sequence into a vector and predicts the input sequence again. These two algorithms can be used as a "pretraining" algorithm for a later supervised sequence learning algorithm. In other words, the parameters obtained from the pretraining step can then be used as a starting point for other supervised training models. In our experiments, we find that long short term memory recurrent networks after pretrained with the two approaches become more stable to train and generalize better. With pretraining, we were able to achieve strong performance in many classification tasks, such as text classification with IMDB, DBpedia or image recognition in CIFAR-10.

688 citations


Posted Content
TL;DR: This paper proposes a simpler solution that use recurrent neural networks composed of rectified linear units that is comparable to LSTM on four benchmarks: two toy problems involving long-range temporal structures, a large language modeling problem and a benchmark speech recognition problem.
Abstract: Learning long term dependencies in recurrent networks is difficult due to vanishing and exploding gradients. To overcome this difficulty, researchers have developed sophisticated optimization techniques and network architectures. In this paper, we propose a simpler solution that use recurrent neural networks composed of rectified linear units. Key to our solution is the use of the identity matrix or its scaled version to initialize the recurrent weight matrix. We find that our solution is comparable to LSTM on our four benchmarks: two toy problems involving long-range temporal structures, a large language modeling problem and a benchmark speech recognition problem.

655 citations


Posted Content
TL;DR: This paper explores the low-overhead and easy-to-implement optimization technique of adding annealed Gaussian noise to the gradient, which it is found surprisingly effective when training these very deep architectures.
Abstract: Deep feedforward and recurrent networks have achieved impressive results in many perception and language processing applications. This success is partially attributed to architectural innovations such as convolutional and long short-term memory networks. The main motivation for these architectural innovations is that they capture better domain knowledge, and importantly are easier to optimize than more basic architectures. Recently, more complex architectures such as Neural Turing Machines and Memory Networks have been proposed for tasks including question answering and general computation, creating a new set of optimization challenges. In this paper, we discuss a low-overhead and easy-to-implement technique of adding gradient noise which we find to be surprisingly effective when training these very deep architectures. The technique not only helps to avoid overfitting, but also can result in lower training loss. This method alone allows a fully-connected 20-layer deep network to be trained with standard gradient descent, even starting from a poor initialization. We see consistent improvements for many complex models, including a 72% relative reduction in error rate over a carefully-tuned baseline on a challenging question-answering task, and a doubling of the number of accurate binary multiplication models learned across 7,000 random restarts. We encourage further application of this technique to additional complex modern architectures.

433 citations


Posted Content
TL;DR: A neural network that learns to transcribe speech utterances to characters without making any independence assumptions between the characters, which is the key improvement of LAS over previous end-to-end CTC models.
Abstract: We present Listen, Attend and Spell (LAS), a neural network that learns to transcribe speech utterances to characters. Unlike traditional DNN-HMM models, this model learns all the components of a speech recognizer jointly. Our system has two components: a listener and a speller. The listener is a pyramidal recurrent network encoder that accepts filter bank spectra as inputs. The speller is an attention-based recurrent network decoder that emits characters as outputs. The network produces character sequences without making any independence assumptions between the characters. This is the key improvement of LAS over previous end-to-end CTC models. On a subset of the Google voice search task, LAS achieves a word error rate (WER) of 14.1% without a dictionary or a language model, and 10.3% with language model rescoring over the top 32 beams. By comparison, the state-of-the-art CLDNN-HMM model achieves a WER of 8.0%.

427 citations


Posted Content
TL;DR: This work observes that the Paragraph Vector method performs significantly better than other methods, and proposes a simple improvement to enhance embedding quality, and shows that much like word embeddings, vector operations on Paragraph Vectors can perform useful semantic results.
Abstract: Paragraph Vectors has been recently proposed as an unsupervised method for learning distributed representations for pieces of texts. In their work, the authors showed that the method can learn an embedding of movie review texts which can be leveraged for sentiment analysis. That proof of concept, while encouraging, was rather narrow. Here we consider tasks other than sentiment analysis, provide a more thorough comparison of Paragraph Vectors to other document modelling algorithms such as Latent Dirichlet Allocation, and evaluate performance of the method as we vary the dimensionality of the learned representation. We benchmarked the models on two document similarity data sets, one from Wikipedia, one from arXiv. We observe that the Paragraph Vector method performs significantly better than other methods, and propose a simple improvement to enhance embedding quality. Somewhat surprisingly, we also show that much like word embeddings, vector operations on Paragraph Vectors can perform useful semantic results.

325 citations


Posted Content
TL;DR: Neural Programmer as discussed by the authors is an end-to-end differentiable neural network augmented with a small set of basic arithmetic and logic operations, which can call these augmented operations over several steps, thereby inducing compositional programs that are more complex than the built-in operations.
Abstract: Deep neural networks have achieved impressive supervised classification performance in many tasks including image recognition, speech recognition, and sequence to sequence learning. However, this success has not been translated to applications like question answering that may involve complex arithmetic and logic reasoning. A major limitation of these models is in their inability to learn even simple arithmetic and logic operations. For example, it has been shown that neural networks fail to learn to add two binary numbers reliably. In this work, we propose Neural Programmer, an end-to-end differentiable neural network augmented with a small set of basic arithmetic and logic operations. Neural Programmer can call these augmented operations over several steps, thereby inducing compositional programs that are more complex than the built-in operations. The model learns from a weak supervision signal which is the result of execution of the correct program, hence it does not require expensive annotation of the correct program itself. The decisions of what operations to call, and what data segments to apply to are inferred by Neural Programmer. Such decisions, during training, are done in a differentiable fashion so that the entire network can be trained jointly by gradient descent. We find that training the model is difficult, but it can be greatly improved by adding random noise to the gradient. On a fairly complex synthetic table-comprehension dataset, traditional recurrent networks and attentional models perform poorly while Neural Programmer typically obtains nearly perfect accuracy.

224 citations


Posted Content
Minh-Thang Luong1, Quoc V. Le1, Ilya Sutskever1, Oriol Vinyals1, Lukasz Kaiser1 
TL;DR: This paper explored three multi-task learning settings for sequence to sequence models: (a) the oneto-many setting, where the encoder is shared between several tasks such as machine translation and syntactic parsing, (b) the many-to-one setting, useful when only the decoder can be shared, as in the case of translation and image caption generation, and (c) the multiple encoders and decoders are shared, which is the case with unsupervised objectives and translation.
Abstract: Sequence to sequence learning has recently emerged as a new paradigm in supervised learning. To date, most of its applications focused on only one task and not much work explored this framework for multiple tasks. This paper examines three multi-task learning (MTL) settings for sequence to sequence models: (a) the oneto-many setting - where the encoder is shared between several tasks such as machine translation and syntactic parsing, (b) the many-to-one setting - useful when only the decoder can be shared, as in the case of translation and image caption generation, and (c) the many-to-many setting - where multiple encoders and decoders are shared, which is the case with unsupervised objectives and translation. Our results show that training on a small amount of parsing and image caption data can improve the translation quality between English and German by up to 1.5 BLEU points over strong single-task baselines on the WMT benchmarks. Furthermore, we have established a new state-of-the-art result in constituent parsing with 93.0 F1. Lastly, we reveal interesting properties of the two unsupervised learning objectives, autoencoder and skip-thought, in the MTL context: autoencoder helps less in terms of perplexities but more on BLEU scores compared to skip-thought.

134 citations


Quoc V. Le1
01 Jan 2015
TL;DR: This tutorial explains how to parallelize the training of neural networks, which is an important topic because parallelizing neural networks has played an important role in the current deep learning movement.
Abstract: The flexibility of neural networks is a very powerful property. In many cases, these changes lead to great improvements in accuracy compared to basic models that we discussed in the previous tutorial. In the last part of the tutorial, I will also explain how to parallelize the training of neural networks. This is also an important topic because parallelizing neural networks has played an important role in the current deep learning movement.


Patent
Quoc V. Le1, Minh-Thang Luong1, Ilya Sutskever1, Oriol Vinyals1, Wojciech Zaremba1 
23 Oct 2015
TL;DR: This article proposed a method of training a neural network translation system to track the source in source sentences of unknown words in target sentences, in a source language and a target language, respectively.
Abstract: Method, system and apparatus, including computer programs encoded on computer storage media, for neural translation systems with rare word processing. A method of training a neural network translation system to track the source in source sentences of unknown words in target sentences, in a source language and a target language, respectively.

Patent
Quoc V. Le1
30 Jan 2015


Posted Content
TL;DR: In this article, an encoder recurrent neural network (RNN) is used to compute features at the same frame rate as the input, and a transducer RNN that operates over blocks of input steps.
Abstract: Sequence-to-sequence models have achieved impressive results on various tasks. However, they are unsuitable for tasks that require incremental predictions to be made as more data arrives. This is because they generate an output sequence conditioned on an entire input sequence. In this paper, we present a new model that can make incremental predictions as more input arrives, without redoing the entire computation. Unlike sequence-to-sequence models, our method computes the next-step distribution conditioned on the partial input sequence observed and the partial sequence generated. It accomplishes this goal using an encoder recurrent neural network (RNN) that computes features at the same frame rate as the input, and a transducer RNN that operates over blocks of input steps. The transducer RNN extends the sequence produced so far using a local sequence-to-sequence model. During training, our method uses alignment information to generate supervised targets for each block. Approximate alignment is easily available for tasks such as speech recognition, action recognition in videos, etc. During inference (decoding), beam search is used to find the most likely output sequence for an input sequence. This decoding is performed online - at the end of each block, the best candidates from the previous block are extended through the local sequence-to-sequence model. On TIMIT, our online method achieves 19.8% phone error rate (PER). For comparison with published sequence-to-sequence methods, we used a bidirectional encoder and achieved 18.7% PER compared to 17.6% from the best reported sequence-to-sequence model. Importantly, unlike sequence-to-sequence our model is minimally impacted by the length of the input. On artificially created longer utterances, it achieves 20.9% with a unidirectional model, compared to 20% from the best bidirectional sequence-to-sequence models.