Showing papers by "Ilya Sutskever published in 2012"

PDF

Open Access

Proceedings Article•

ImageNet Classification with Deep Convolutional Neural Networks

[...]

Alex Krizhevsky¹, Ilya Sutskever¹, Geoffrey E. Hinton¹•Institutions (1)

03 Dec 2012

TL;DR: The state-of-the-art performance of CNNs was achieved by Deep Convolutional Neural Networks (DCNNs) as discussed by the authors, which consists of five convolutional layers, some of which are followed by max-pooling layers, and three fully-connected layers with a final 1000-way softmax.

...read moreread less

Abstract: We trained a large, deep convolutional neural network to classify the 1.2 million high-resolution images in the ImageNet LSVRC-2010 contest into the 1000 different classes. On the test data, we achieved top-1 and top-5 error rates of 37.5% and 17.0% which is considerably better than the previous state-of-the-art. The neural network, which has 60 million parameters and 650,000 neurons, consists of five convolutional layers, some of which are followed by max-pooling layers, and three fully-connected layers with a final 1000-way softmax. To make training faster, we used non-saturating neurons and a very efficient GPU implementation of the convolution operation. To reduce overriding in the fully-connected layers we employed a recently-developed regularization method called "dropout" that proved to be very effective. We also entered a variant of this model in the ILSVRC-2012 competition and achieved a winning top-5 test error rate of 15.3%, compared to 26.2% achieved by the second-best entry.

...read moreread less

73,978 citations

Posted Content•

Improving neural networks by preventing co-adaptation of feature detectors

[...]

Geoffrey E. Hinton¹, Nitish Srivastava, Alex Krizhevsky, Ilya Sutskever, Ruslan Salakhutdinov¹ - Show less +1 more•Institutions (1)

University of Toronto¹

03 Jul 2012-arXiv: Neural and Evolutionary Computing

TL;DR: The authors randomly omits half of the feature detectors on each training case to prevent complex co-adaptations in which a feature detector is only helpful in the context of several other specific feature detectors.

...read moreread less

Abstract: When a large feedforward neural network is trained on a small training set, it typically performs poorly on held-out test data. This "overfitting" is greatly reduced by randomly omitting half of the feature detectors on each training case. This prevents complex co-adaptations in which a feature detector is only helpful in the context of several other specific feature detectors. Instead, each neuron learns to detect a feature that is generally helpful for producing the correct answer given the combinatorially large variety of internal contexts in which it must operate. Random "dropout" gives big improvements on many benchmark tasks and sets new records for speech and object recognition.

...read moreread less

6,899 citations

Book Chapter•DOI•

Training Deep and Recurrent Networks with Hessian-Free Optimization

[...]

James Martens¹, Ilya Sutskever¹•Institutions (1)

University of Toronto¹

01 Jan 2012

TL;DR: This chapter describes the basic HF approach, and examines well-known performance-improving techniques such as preconditioning which have been beneficial for neural network training and others of a more heuristic nature which are harder to justify, but which have found to work well in practice.

...read moreread less

Abstract: In this chapter we will first describe the basic HF approach, and then examine well-known performance-improving techniques such as preconditioning which we have found to be beneficial for neural network training, as well as others of a more heuristic nature which are harder to justify, but which we have found to work well in practice. We will also provide practical tips for creating efficient and bug-free implementations and discuss various pitfalls which may arise when designing and using an HF-type approach in a particular application.

...read moreread less

239 citations

Proceedings Article•

Estimating the Hessian by Back-propagating Curvature

[...]

James Martens¹, Ilya Sutskever¹, Kevin Swersky¹•Institutions (1)

University of Toronto¹

26 Jun 2012

TL;DR: Curvature Propagation is developed, a general technique for efficiently computing unbiased approximations of the Hessian of any function that is computed using a computational graph, and can be repeatedly applied to give increasingly precise unbiased estimates of any or all of the entries ofThe Hessian.

...read moreread less

Abstract: In this work we develop Curvature Propagation (CP), a general technique for efficiently computing unbiased approximations of the Hessian of any function that is computed using a computational graph At the cost of roughly two gradient evaluations, CP can give a rank-1 approximation of the whole Hessian, and can be repeatedly applied to give increasingly precise unbiased estimates of any or all of the entries of the Hessian Of particular interest is the diagonal of the Hessian, for which no general approach is known to exist that is both efficient and accurate We show in experiments that CP turns out to work well in practice, giving very accurate estimates of the Hessian of neural networks, for example, with a relatively small amount of work We also apply CP to Score Matching, where a diagonal of a Hessian plays an integral role in the Score Matching objective, and where it is usually computed exactly using inefficient algorithms which do not scale to larger and more complex models

...read moreread less

50 citations

Posted Content•

Estimating the Hessian by Back-propagating Curvature

[...]

James Martens¹, Ilya Sutskever¹, Kevin Swersky¹•Institutions (1)

University of Toronto¹

27 Jun 2012-arXiv: Learning

TL;DR: Curvature propagation (CP) as discussed by the authors is a general technique for efficiently computing unbiased approximations of the Hessian of any function that is computed using a computational graph, which can give a rank-1 approximation of the whole Hessian, and can be repeatedly applied to give increasingly precise unbiased estimates of any or all of the entries of a Hessian.

...read moreread less

Abstract: In this work we develop Curvature Propagation (CP), a general technique for efficiently computing unbiased approximations of the Hessian of any function that is computed using a computational graph. At the cost of roughly two gradient evaluations, CP can give a rank-1 approximation of the whole Hessian, and can be repeatedly applied to give increasingly precise unbiased estimates of any or all of the entries of the Hessian. Of particular interest is the diagonal of the Hessian, for which no general approach is known to exist that is both efficient and accurate. We show in experiments that CP turns out to work well in practice, giving very accurate estimates of the Hessian of neural networks, for example, with a relatively small amount of work. We also apply CP to Score Matching, where a diagonal of a Hessian plays an integral role in the Score Matching objective, and where it is usually computed exactly using inefficient algorithms which do not scale to larger and more complex models.

...read moreread less

48 citations

Proceedings Article•

Cardinality Restricted Boltzmann Machines

[...]

Kevin Swersky¹, Ilya Sutskever¹, Daniel Tarlow¹, Richard S. Zemel¹, Ruslan Salakhutdinov¹, Ryan P. Adams² - Show less +2 more•Institutions (2)

University of Toronto¹, Harvard University²

03 Dec 2012

TL;DR: It is shown that a dynamic programming algorithm can be used to implement exact sparsity in the RBM's hidden units and how to pass derivatives through the resulting posterior marginals, which makes it possible to fine-tune a pre-trained neural network with sparse hidden layers.

...read moreread less

Abstract: The Restricted Boltzmann Machine (RBM) is a popular density model that is also good for extracting features. A main source of tractability in RBM models is that, given an input, the posterior distribution over hidden variables is factorizable and can be easily computed and sampled from. Sparsity and competition in the hidden representation is beneficial, and while an RBM with competition among its hidden units would acquire some of the attractive properties of sparse coding, such constraints are typically not added, as the resulting posterior over the hidden units seemingly becomes intractable. In this paper we show that a dynamic programming algorithm can be used to implement exact sparsity in the RBM's hidden units. We also show how to pass derivatives through the resulting posterior marginals, which makes it possible to fine-tune a pre-trained neural network with sparse hidden layers.

...read moreread less

40 citations