Showing papers by "Ilya Sutskever published in 2013"

PDF

Open Access

Proceedings Article•

Distributed Representations of Words and Phrases and their Compositionality

[...]

Tomas Mikolov¹, Ilya Sutskever¹, Kai Chen¹, Greg S. Corrado¹, Jeffrey Dean¹ - Show less +1 more•Institutions (1)

05 Dec 2013

TL;DR: This paper presents a simple method for finding phrases in text, and shows that learning good vector representations for millions of phrases is possible and describes a simple alternative to the hierarchical softmax called negative sampling.

...read moreread less

Abstract: The recently introduced continuous Skip-gram model is an efficient method for learning high-quality distributed vector representations that capture a large number of precise syntactic and semantic word relationships. In this paper we present several extensions that improve both the quality of the vectors and the training speed. By subsampling of the frequent words we obtain significant speedup and also learn more regular word representations. We also describe a simple alternative to the hierarchical softmax called negative sampling. An inherent limitation of word representations is their indifference to word order and their inability to represent idiomatic phrases. For example, the meanings of "Canada" and "Air" cannot be easily combined to obtain "Air Canada". Motivated by this example, we present a simple method for finding phrases in text, and show that learning good vector representations for millions of phrases is possible.

...read moreread less

24,012 citations

Posted Content•

Distributed Representations of Words and Phrases and their Compositionality

[...]

Tomas Mikolov¹, Ilya Sutskever¹, Kai Chen¹, Greg S. Corrado¹, Jeffrey Dean¹ - Show less +1 more•Institutions (1)

Google¹

16 Oct 2013-arXiv: Computation and Language

TL;DR: In this paper, the Skip-gram model is used to learn high-quality distributed vector representations that capture a large number of precise syntactic and semantic word relationships and improve both the quality of the vectors and the training speed.

...read moreread less

11,343 citations

Proceedings Article•

On the importance of initialization and momentum in deep learning

[...]

Ilya Sutskever¹, James Martens², George E. Dahl², Geoffrey E. Hinton²•Institutions (2)

Google¹, University of Toronto²

16 Jun 2013

TL;DR: It is shown that when stochastic gradient descent with momentum uses a well-designed random initialization and a particular type of slowly increasing schedule for the momentum parameter, it can train both DNNs and RNNs to levels of performance that were previously achievable only with Hessian-Free optimization.

...read moreread less

Abstract: Deep and recurrent neural networks (DNNs and RNNs respectively) are powerful models that were considered to be almost impossible to train using stochastic gradient descent with momentum. In this paper, we show that when stochastic gradient descent with momentum uses a well-designed random initialization and a particular type of slowly increasing schedule for the momentum parameter, it can train both DNNs and RNNs (on datasets with long-term dependencies) to levels of performance that were previously achievable only with Hessian-Free optimization. We find that both the initialization and the momentum are crucial since poorly initialized networks cannot be trained with momentum and well-initialized networks perform markedly worse when the momentum is absent or poorly tuned. Our success training these models suggests that previous attempts to train deep and recurrent neural networks from random initializations have likely failed due to poor initialization schemes. Furthermore, carefully tuned momentum methods suffice for dealing with the curvature issues in deep and recurrent network training objectives without the need for sophisticated second-order methods.

...read moreread less

4,121 citations

Posted Content•

[...]

Tomas Mikolov, Quoc V. Le, Ilya Sutskever

17 Sep 2013-arXiv: Computation and Language

TL;DR: This method can translate missing word and phrase entries by learning language structures based on large monolingual data and mapping between languages from small bilingual data and uses distributed representation of words and learns a linear mapping between vector spaces of languages.

...read moreread less

Abstract: Dictionaries and phrase tables are the basis of modern statistical machine translation systems. This paper develops a method that can automate the process of generating and extending dictionaries and phrase tables. Our method can translate missing word and phrase entries by learning language structures based on large monolingual data and mapping between languages from small bilingual data. It uses distributed representation of words and learns a linear mapping between vector spaces of languages. Despite its simplicity, our method is surprisingly effective: we can achieve almost 90% precision@5 for translation of words between English and Spanish. This method makes little assumption about the languages, so it can be used to extend and refine dictionaries and translation tables for any language pairs.

...read moreread less

1,564 citations

Posted Content•

Intriguing properties of neural networks

[...]

Christian Szegedy¹, Wojciech Zaremba², Ilya Sutskever¹, Joan Bruna², Dumitru Erhan¹, Ian Goodfellow³, Rob Fergus², Rob Fergus⁴ - Show less +4 more•Institutions (4)

Google¹, New York University², Université de Montréal³, Facebook⁴

21 Dec 2013-arXiv: Computer Vision and Pattern Recognition

TL;DR: This article showed that deep neural networks learn input-output mappings that are fairly discontinuous to a significant extend, which suggests that it is the space, rather than individual units, that contains of the semantic information in the high layers of neural networks.

...read moreread less

Abstract: Deep neural networks are highly expressive models that have recently achieved state of the art performance on speech and visual recognition tasks. While their expressiveness is the reason they succeed, it also causes them to learn uninterpretable solutions that could have counter-intuitive properties. In this paper we report two such properties. First, we find that there is no distinction between individual high level units and random linear combinations of high level units, according to various methods of unit analysis. It suggests that it is the space, rather than the individual units, that contains of the semantic information in the high layers of neural networks. Second, we find that deep neural networks learn input-output mappings that are fairly discontinuous to a significant extend. We can cause the network to misclassify an image by applying a certain imperceptible perturbation, which is found by maximizing the network's prediction error. In addition, the specific nature of these perturbations is not a random artifact of learning: the same perturbation can cause a different network, that was trained on a different subset of the dataset, to misclassify the same input.

...read moreread less

1,313 citations

Dissertation•

Training recurrent neural networks

[...]

Geoffrey E. Hinton¹, Ilya Sutskever¹•Institutions (1)

University of Toronto¹

01 Jan 2013

TL;DR: A new probabilistic sequence model that combines Restricted Boltzmann Machines and RNNs is described, more powerful than similar models while being less difficult to train, and a random parameter initialization scheme is described that allows gradient descent with momentum to train Rnns on problems with long-term dependencies.

...read moreread less

Abstract: Recurrent Neural Networks (RNNs) are powerful sequence models that were believed to be difficult to train, and as a result they were rarely used in machine learning applications. This thesis presents methods that overcome the difficulty of training RNNs, and applications of RNNs to challenging problems. We first describe a new probabilistic sequence model that combines Restricted Boltzmann Machines and RNNs. The new model is more powerful than similar models while being less difficult to train. Next, we present a new variant of the Hessian-free (HF) optimizer and show that it can train RNNs on tasks that have extreme long-range temporal dependencies, which were previously considered to be impossibly hard. We then apply HF to character-level language modelling and get excellent results. We also apply HF to optimal control and obtain RNN control laws that can successfully operate under conditions of delayed feedback and unknown disturbances. Finally, we describe a random parameter initialization scheme that allows gradient descent with momentum to train RNNs on problems with long-term dependencies. This directly contradicts widespread beliefs about the inability of first-order methods to do so, and suggests that previous attempts at training RNNs failed partly due to flaws in the random initialization.

...read moreread less

353 citations

Posted Content•

Learning Factored Representations in a Deep Mixture of Experts

[...]

David Eigen¹, Marc'Aurelio Ranzato², Ilya Sutskever³•Institutions (3)

New York University¹, Google², University of Toronto³

16 Dec 2013-arXiv: Learning

TL;DR: Deep Mixture of Experts as mentioned in this paper is a stacked model with multiple sets of gating and experts, which exponentially increases the number of effective experts by associating each input with a combination of experts at each layer, yet maintains a modest model size.

...read moreread less

Abstract: Mixtures of Experts combine the outputs of several "expert" networks, each of which specializes in a different part of the input space This is achieved by training a "gating" network that maps each input to a distribution over the experts Such models show promise for building larger networks that are still cheap to compute at test time, and more parallelizable at training time In this this work, we extend the Mixture of Experts to a stacked model, the Deep Mixture of Experts, with multiple sets of gating and experts This exponentially increases the number of effective experts by associating each input with a combination of experts at each layer, yet maintains a modest model size On a randomly translated version of the MNIST dataset, we find that the Deep Mixture of Experts automatically learns to develop location-dependent ("where") experts at the first layer, and class-specific ("what") experts at the second layer In addition, we see that the different combinations are in use when the model is applied to a dataset of speech monophones These demonstrate effective use of all expert combinations

...read moreread less

185 citations

Patent•

System and method for parallelizing convolutional neural networks

[...]

Alex Krizhevsky¹, Ilya Sutskever¹, Geoffrey E. Hinton¹•Institutions (1)

Google¹

23 Dec 2013

TL;DR: In this paper, a parallel convolutional neural network (CNN) is implemented by a plurality of CNNs each on a respective processing node, and each CNN has a multiplicity of layers.

...read moreread less

Abstract: A parallel convolutional neural network is provided. The CNN is implemented by a plurality of convolutional neural networks each on a respective processing node. Each CNN has a plurality of layers. A subset of the layers are interconnected between processing nodes such that activations are fed forward across nodes. The remaining subset is not so interconnected.

...read moreread less

66 citations

Patent•

System and method for addressing overfitting in a neural network

[...]

Geoffrey E. Hinton¹, Alex Krizhevsky¹, Ilya Sutskever¹, Nitish Srivastva¹•Institutions (1)

Google¹

30 Aug 2013

TL;DR: In this article, a switch is linked to feature detectors in at least some of the layers of the neural network and randomly selectively disables each of the feature detectors according to a preconfigured probability.

...read moreread less

Abstract: A system for training a neural network. A switch is linked to feature detectors in at least some of the layers of the neural network. For each training case, the switch randomly selectively disables each of the feature detectors in accordance with a preconfigured probability. The weights from each training case are then normalized for applying the neural network to test data.

...read moreread less

41 citations

Proceedings Article•

Stochastic k-Neighborhood Selection for Supervised and Unsupervised Learning

[...]

Daniel Tarlow¹, Kevin Swersky², Laurent Charlin², Ilya Sutskever³, Rich Zemel² - Show less +1 more•Institutions (3)

Microsoft¹, University of Toronto², Google³

16 Jun 2013

TL;DR: KNCA is presented, which generalizes NCA by learning distance metrics that are appropriate for kNN with arbitrary k, and shows that kNCA often improves classification accuracy over state of the art methods, produces qualitative differences in the embeddings as k is varied, and is more robust with respect to label noise.

...read moreread less

Abstract: Neighborhood Components Analysis (NCA) is a popular method for learning a distance metric to be used within a k-nearest neighbors (kNN) classifier. A key assumption built into the model is that each point stochastically selects a single neighbor, which makes the model well-justified only for kNN with k = 1. However, kNN classifiers with k > 1 are more robust and usually preferred in practice. Here we present kNCA, which generalizes NCA by learning distance metrics that are appropriate for kNN with arbitrary k. The main technical contribution is showing how to efficiently compute and optimize the expected accuracy of a kNN classifier. We apply similar ideas in an unsupervised setting to yield kSNE and kt-SNE, generalizations of Stochastic Neighbor Embedding (SNE, t-SNE) that operate on neighborhoods of size k, which provide an axis of control over embeddings that allow for more homogeneous and interpretable regions. Empirically, we show that kNCA often improves classification accuracy over state of the art methods, produces qualitative differences in the embeddings as k is varied, and is more robust with respect to label noise.

...read moreread less

37 citations

Patent•

System and method for generating training cases for image classification

[...]

Alex Krizhevsky¹, Ilya Sutskever¹, Geoffrey E. Hinton¹•Institutions (1)

Google¹

20 Aug 2013

TL;DR: In this article, an image processing module performs color-space deformation on each pixel of the existing training image and then associates the classification to the color space deformed training image, which may be applied to increase the size of a training set for training a neural network.

...read moreread less

Abstract: A system and method for generating training images. An existing training image is associated with a classification. The system includes an image processing module that performs color-space deformation on each pixel of the existing training image and then associates the classification to the color-space deformed training image. The technique may be applied to increase the size of a training set for training a neural network.

...read moreread less

Patent•

Parallelizing neural networks during training

[...]

Alex Krizhevsky¹, Ilya Sutskever¹, Geoffrey E. Hinton¹•Institutions (1)

Google¹

18 Sep 2013

TL;DR: In this paper, a parallel convolutional neural network (CNN) is implemented by a plurality of CNNs each on a respective processing node, and each CNN has a multiplicity of layers.

...read moreread less

Patent•

Système et méthode de résolution du problème de surapprentissage dans un réseau de neurones

[...]

Geoffrey E. Hinton, Alex Krizhevsky, Ilya Sutskever, Nitish Srivastva

23 Dec 2013

TL;DR: In this paper, a system of reseau de neurones is presented, which concerne un systeme d'apprentissage d'un reseau of neurones, in which le commutateur desactive de facon selective and aleatoire chacun des detecteurs de caracteristique en fonction d'une probabilite preconfiguree.

...read moreread less

Abstract: L'invention concerne un systeme d'apprentissage d'un reseau de neurones. Un commutateur est associe a des detecteurs de caracteristique dans au moins certaines des couches du reseau de neurones. Pour chaque cas d'apprentissage, le commutateur desactive de facon selective et aleatoire chacun des detecteurs de caracteristique en fonction d'une probabilite preconfiguree. Les ponderations de chacun des cas d'apprentissage sont ensuite normalisees pour appliquer le reseau de neurones aux donnees de test.

...read moreread less

Patent•

Système et procédé permettant de paralléliser des réseaux neuronaux classiques

[...]

Alex Krizhevsky, Ilya Sutskever, Geoffrey E. Hinton

23 Dec 2013

TL;DR: In this article, a reseau neuronal classique parallele is introduced, which is a trait of reseau neuronaux classiques parallele, and it is defined as "a set of couches interconnecte entre les nœuds de traitement de sorte that des activations sont soumises a directe d'un bout a l'autre des nŒuds".

...read moreread less

Abstract: La presente invention a trait a un reseau neuronal classique parallele. Le CNN est mis en œuvre au moyen d'une pluralite de reseaux neuronaux classiques, chacun se situant sur un nœud de traitement respectif. Chaque CNN est dote d'une pluralite de couches. Un sous-ensemble de couches est interconnecte entre les nœuds de traitement de sorte que des activations sont soumises a une action directe d'un bout a l'autre des nœuds. Le sous-ensemble restant n'est pas interconnecte de la sorte.

...read moreread less