scispace - formally typeset
Search or ask a question

Showing papers by "Ilya Sutskever published in 2013"


Proceedings Article
Tomas Mikolov1, Ilya Sutskever1, Kai Chen1, Greg S. Corrado1, Jeffrey Dean1 
05 Dec 2013
TL;DR: This paper presents a simple method for finding phrases in text, and shows that learning good vector representations for millions of phrases is possible and describes a simple alternative to the hierarchical softmax called negative sampling.
Abstract: The recently introduced continuous Skip-gram model is an efficient method for learning high-quality distributed vector representations that capture a large number of precise syntactic and semantic word relationships. In this paper we present several extensions that improve both the quality of the vectors and the training speed. By subsampling of the frequent words we obtain significant speedup and also learn more regular word representations. We also describe a simple alternative to the hierarchical softmax called negative sampling. An inherent limitation of word representations is their indifference to word order and their inability to represent idiomatic phrases. For example, the meanings of "Canada" and "Air" cannot be easily combined to obtain "Air Canada". Motivated by this example, we present a simple method for finding phrases in text, and show that learning good vector representations for millions of phrases is possible.

24,012 citations


Posted Content
Tomas Mikolov1, Ilya Sutskever1, Kai Chen1, Greg S. Corrado1, Jeffrey Dean1 
TL;DR: In this paper, the Skip-gram model is used to learn high-quality distributed vector representations that capture a large number of precise syntactic and semantic word relationships and improve both the quality of the vectors and the training speed.
Abstract: The recently introduced continuous Skip-gram model is an efficient method for learning high-quality distributed vector representations that capture a large number of precise syntactic and semantic word relationships. In this paper we present several extensions that improve both the quality of the vectors and the training speed. By subsampling of the frequent words we obtain significant speedup and also learn more regular word representations. We also describe a simple alternative to the hierarchical softmax called negative sampling. An inherent limitation of word representations is their indifference to word order and their inability to represent idiomatic phrases. For example, the meanings of "Canada" and "Air" cannot be easily combined to obtain "Air Canada". Motivated by this example, we present a simple method for finding phrases in text, and show that learning good vector representations for millions of phrases is possible.

11,343 citations


Proceedings Article
16 Jun 2013
TL;DR: It is shown that when stochastic gradient descent with momentum uses a well-designed random initialization and a particular type of slowly increasing schedule for the momentum parameter, it can train both DNNs and RNNs to levels of performance that were previously achievable only with Hessian-Free optimization.
Abstract: Deep and recurrent neural networks (DNNs and RNNs respectively) are powerful models that were considered to be almost impossible to train using stochastic gradient descent with momentum. In this paper, we show that when stochastic gradient descent with momentum uses a well-designed random initialization and a particular type of slowly increasing schedule for the momentum parameter, it can train both DNNs and RNNs (on datasets with long-term dependencies) to levels of performance that were previously achievable only with Hessian-Free optimization. We find that both the initialization and the momentum are crucial since poorly initialized networks cannot be trained with momentum and well-initialized networks perform markedly worse when the momentum is absent or poorly tuned. Our success training these models suggests that previous attempts to train deep and recurrent neural networks from random initializations have likely failed due to poor initialization schemes. Furthermore, carefully tuned momentum methods suffice for dealing with the curvature issues in deep and recurrent network training objectives without the need for sophisticated second-order methods.

4,121 citations


Posted Content
TL;DR: This method can translate missing word and phrase entries by learning language structures based on large monolingual data and mapping between languages from small bilingual data and uses distributed representation of words and learns a linear mapping between vector spaces of languages.
Abstract: Dictionaries and phrase tables are the basis of modern statistical machine translation systems. This paper develops a method that can automate the process of generating and extending dictionaries and phrase tables. Our method can translate missing word and phrase entries by learning language structures based on large monolingual data and mapping between languages from small bilingual data. It uses distributed representation of words and learns a linear mapping between vector spaces of languages. Despite its simplicity, our method is surprisingly effective: we can achieve almost 90% precision@5 for translation of words between English and Spanish. This method makes little assumption about the languages, so it can be used to extend and refine dictionaries and translation tables for any language pairs.

1,564 citations


Posted Content
TL;DR: This article showed that deep neural networks learn input-output mappings that are fairly discontinuous to a significant extend, which suggests that it is the space, rather than individual units, that contains of the semantic information in the high layers of neural networks.
Abstract: Deep neural networks are highly expressive models that have recently achieved state of the art performance on speech and visual recognition tasks. While their expressiveness is the reason they succeed, it also causes them to learn uninterpretable solutions that could have counter-intuitive properties. In this paper we report two such properties. First, we find that there is no distinction between individual high level units and random linear combinations of high level units, according to various methods of unit analysis. It suggests that it is the space, rather than the individual units, that contains of the semantic information in the high layers of neural networks. Second, we find that deep neural networks learn input-output mappings that are fairly discontinuous to a significant extend. We can cause the network to misclassify an image by applying a certain imperceptible perturbation, which is found by maximizing the network's prediction error. In addition, the specific nature of these perturbations is not a random artifact of learning: the same perturbation can cause a different network, that was trained on a different subset of the dataset, to misclassify the same input.

1,313 citations


Dissertation
01 Jan 2013
TL;DR: A new probabilistic sequence model that combines Restricted Boltzmann Machines and RNNs is described, more powerful than similar models while being less difficult to train, and a random parameter initialization scheme is described that allows gradient descent with momentum to train Rnns on problems with long-term dependencies.
Abstract: Recurrent Neural Networks (RNNs) are powerful sequence models that were believed to be difficult to train, and as a result they were rarely used in machine learning applications. This thesis presents methods that overcome the difficulty of training RNNs, and applications of RNNs to challenging problems. We first describe a new probabilistic sequence model that combines Restricted Boltzmann Machines and RNNs. The new model is more powerful than similar models while being less difficult to train. Next, we present a new variant of the Hessian-free (HF) optimizer and show that it can train RNNs on tasks that have extreme long-range temporal dependencies, which were previously considered to be impossibly hard. We then apply HF to character-level language modelling and get excellent results. We also apply HF to optimal control and obtain RNN control laws that can successfully operate under conditions of delayed feedback and unknown disturbances. Finally, we describe a random parameter initialization scheme that allows gradient descent with momentum to train RNNs on problems with long-term dependencies. This directly contradicts widespread beliefs about the inability of first-order methods to do so, and suggests that previous attempts at training RNNs failed partly due to flaws in the random initialization.

353 citations


Posted Content
TL;DR: Deep Mixture of Experts as mentioned in this paper is a stacked model with multiple sets of gating and experts, which exponentially increases the number of effective experts by associating each input with a combination of experts at each layer, yet maintains a modest model size.
Abstract: Mixtures of Experts combine the outputs of several "expert" networks, each of which specializes in a different part of the input space This is achieved by training a "gating" network that maps each input to a distribution over the experts Such models show promise for building larger networks that are still cheap to compute at test time, and more parallelizable at training time In this this work, we extend the Mixture of Experts to a stacked model, the Deep Mixture of Experts, with multiple sets of gating and experts This exponentially increases the number of effective experts by associating each input with a combination of experts at each layer, yet maintains a modest model size On a randomly translated version of the MNIST dataset, we find that the Deep Mixture of Experts automatically learns to develop location-dependent ("where") experts at the first layer, and class-specific ("what") experts at the second layer In addition, we see that the different combinations are in use when the model is applied to a dataset of speech monophones These demonstrate effective use of all expert combinations

185 citations


Patent
23 Dec 2013
TL;DR: In this paper, a parallel convolutional neural network (CNN) is implemented by a plurality of CNNs each on a respective processing node, and each CNN has a multiplicity of layers.
Abstract: A parallel convolutional neural network is provided. The CNN is implemented by a plurality of convolutional neural networks each on a respective processing node. Each CNN has a plurality of layers. A subset of the layers are interconnected between processing nodes such that activations are fed forward across nodes. The remaining subset is not so interconnected.

66 citations


Patent
30 Aug 2013
TL;DR: In this article, a switch is linked to feature detectors in at least some of the layers of the neural network and randomly selectively disables each of the feature detectors according to a preconfigured probability.
Abstract: A system for training a neural network. A switch is linked to feature detectors in at least some of the layers of the neural network. For each training case, the switch randomly selectively disables each of the feature detectors in accordance with a preconfigured probability. The weights from each training case are then normalized for applying the neural network to test data.

41 citations


Proceedings Article
16 Jun 2013
TL;DR: KNCA is presented, which generalizes NCA by learning distance metrics that are appropriate for kNN with arbitrary k, and shows that kNCA often improves classification accuracy over state of the art methods, produces qualitative differences in the embeddings as k is varied, and is more robust with respect to label noise.
Abstract: Neighborhood Components Analysis (NCA) is a popular method for learning a distance metric to be used within a k-nearest neighbors (kNN) classifier. A key assumption built into the model is that each point stochastically selects a single neighbor, which makes the model well-justified only for kNN with k = 1. However, kNN classifiers with k > 1 are more robust and usually preferred in practice. Here we present kNCA, which generalizes NCA by learning distance metrics that are appropriate for kNN with arbitrary k. The main technical contribution is showing how to efficiently compute and optimize the expected accuracy of a kNN classifier. We apply similar ideas in an unsupervised setting to yield kSNE and kt-SNE, generalizations of Stochastic Neighbor Embedding (SNE, t-SNE) that operate on neighborhoods of size k, which provide an axis of control over embeddings that allow for more homogeneous and interpretable regions. Empirically, we show that kNCA often improves classification accuracy over state of the art methods, produces qualitative differences in the embeddings as k is varied, and is more robust with respect to label noise.

37 citations


Patent
20 Aug 2013
TL;DR: In this article, an image processing module performs color-space deformation on each pixel of the existing training image and then associates the classification to the color space deformed training image, which may be applied to increase the size of a training set for training a neural network.
Abstract: A system and method for generating training images. An existing training image is associated with a classification. The system includes an image processing module that performs color-space deformation on each pixel of the existing training image and then associates the classification to the color-space deformed training image. The technique may be applied to increase the size of a training set for training a neural network.

Patent
18 Sep 2013
TL;DR: In this paper, a parallel convolutional neural network (CNN) is implemented by a plurality of CNNs each on a respective processing node, and each CNN has a multiplicity of layers.
Abstract: A parallel convolutional neural network is provided. The CNN is implemented by a plurality of convolutional neural networks each on a respective processing node. Each CNN has a plurality of layers. A subset of the layers are interconnected between processing nodes such that activations are fed forward across nodes. The remaining subset is not so interconnected.

Patent
23 Dec 2013
TL;DR: In this paper, a system of reseau de neurones is presented, which concerne un systeme d'apprentissage d'un reseau of neurones, in which le commutateur desactive de facon selective and aleatoire chacun des detecteurs de caracteristique en fonction d'une probabilite preconfiguree.
Abstract: L'invention concerne un systeme d'apprentissage d'un reseau de neurones. Un commutateur est associe a des detecteurs de caracteristique dans au moins certaines des couches du reseau de neurones. Pour chaque cas d'apprentissage, le commutateur desactive de facon selective et aleatoire chacun des detecteurs de caracteristique en fonction d'une probabilite preconfiguree. Les ponderations de chacun des cas d'apprentissage sont ensuite normalisees pour appliquer le reseau de neurones aux donnees de test.

Patent
23 Dec 2013
TL;DR: In this article, a reseau neuronal classique parallele is introduced, which is a trait of reseau neuronaux classiques parallele, and it is defined as "a set of couches interconnecte entre les nœuds de traitement de sorte that des activations sont soumises a directe d'un bout a l'autre des nŒuds".
Abstract: La presente invention a trait a un reseau neuronal classique parallele. Le CNN est mis en œuvre au moyen d'une pluralite de reseaux neuronaux classiques, chacun se situant sur un nœud de traitement respectif. Chaque CNN est dote d'une pluralite de couches. Un sous-ensemble de couches est interconnecte entre les nœuds de traitement de sorte que des activations sont soumises a une action directe d'un bout a l'autre des nœuds. Le sous-ensemble restant n'est pas interconnecte de la sorte.