Showing papers by "Geoffrey E. Hinton published in 2019"

PDF

Open Access

Posted Content•

[...]

Rafael Muller¹, Simon Kornblith², Geoffrey E. Hinton²•Institutions (2)

06 Jun 2019-arXiv: Learning

TL;DR: It is shown empirically that in addition to improving generalization, label smoothing improves model calibration which can significantly improve beam-search and that if a teacher network is trained with label smoothed, knowledge distillation into a student network is much less effective.

...read moreread less

Abstract: The generalization and learning speed of a multi-class neural network can often be significantly improved by using soft targets that are a weighted average of the hard targets and the uniform distribution over labels. Smoothing the labels in this way prevents the network from becoming over-confident and label smoothing has been used in many state-of-the-art models, including image classification, language translation and speech recognition. Despite its widespread use, label smoothing is still poorly understood. Here we show empirically that in addition to improving generalization, label smoothing improves model calibration which can significantly improve beam-search. However, we also observe that if a teacher network is trained with label smoothing, knowledge distillation into a student network is much less effective. To explain these observations, we visualize how label smoothing changes the representations learned by the penultimate layer of the network. We show that label smoothing encourages the representations of training examples from the same class to group in tight clusters. This results in loss of information in the logits about resemblances between instances of different classes, which is necessary for distillation, but does not hurt generalization or calibration of the model's predictions.

...read moreread less

971 citations

Proceedings Article•

[...]

Simon Kornblith¹, Mohammad Norouzi¹, Honglak Lee², Geoffrey E. Hinton¹•Institutions (2)

Google¹, University of Michigan²

24 May 2019

TL;DR: In this article, the authors introduce a similarity index that measures the relationship between representational similarity matrices and does not suffer from this limitation, which is equivalent to centered kernel alignment (CKA) and is also closely connected to CCA.

...read moreread less

Abstract: Recent work has sought to understand the behavior of neural networks by comparing representations between layers and between different trained models. We examine methods for comparing neural network representations based on canonical correlation analysis (CCA). We show that CCA belongs to a family of statistics for measuring multivariate similarity, but that neither CCA nor any other statistic that is invariant to invertible linear transformation can measure meaningful similarities between representations of higher dimension than the number of data points. We introduce a similarity index that measures the relationship between representational similarity matrices and does not suffer from this limitation. This similarity index is equivalent to centered kernel alignment (CKA) and is also closely connected to CCA. Unlike CCA, CKA can reliably identify correspondences between representations in networks trained from different initializations.

...read moreread less

584 citations

Posted Content•

Lookahead Optimizer: k steps forward, 1 step back

[...]

Michael R. Zhang, James Lucas, Geoffrey E. Hinton, Jimmy Ba

19 Jul 2019-arXiv: Learning

TL;DR: Lookahead improves the learning stability and lowers the variance of its inner optimizer with negligible computation and memory cost, and can significantly improve the performance of SGD and Adam, even with their default hyperparameter settings.

...read moreread less

Abstract: The vast majority of successful deep neural networks are trained using variants of stochastic gradient descent (SGD) algorithms. Recent attempts to improve SGD can be broadly categorized into two approaches: (1) adaptive learning rate schemes, such as AdaGrad and Adam, and (2) accelerated schemes, such as heavy-ball and Nesterov momentum. In this paper, we propose a new optimization algorithm, Lookahead, that is orthogonal to these previous approaches and iteratively updates two sets of weights. Intuitively, the algorithm chooses a search direction by looking ahead at the sequence of fast weights generated by another optimizer. We show that Lookahead improves the learning stability and lowers the variance of its inner optimizer with negligible computation and memory cost. We empirically demonstrate Lookahead can significantly improve the performance of SGD and Adam, even with their default hyperparameter settings on ImageNet, CIFAR-10/100, neural machine translation, and Penn Treebank.

...read moreread less

432 citations

Proceedings Article•

When does label smoothing help

[...]

Rafael Muller¹, Simon Kornblith², Geoffrey E. Hinton²•Institutions (2)

Telecom SudParis¹, Google²

06 Jun 2019

TL;DR: This article showed that label smoothing encourages the representations of training examples from the same class to group in tight clusters, which results in loss of information in the logits about resemblances between instances of different classes, but does not hurt generalization or calibration of the model's predictions.

...read moreread less

Abstract: The generalization and learning speed of a multi-class neural network can often be significantly improved by using soft targets that are a weighted average of the hard targets and the uniform distribution over labels. Smoothing the labels in this way prevents the network from becoming over-confident and label smoothing has been used in many state-of-the-art models, including image classification, language translation and speech recognition. Despite its widespread use, label smoothing is still poorly understood. Here we show empirically that in addition to improving generalization, label smoothing improves model calibration which can significantly improve beam search. However, we also observe that if a teacher network is trained with label smoothing, knowledge distillation into a student network is much less effective. To explain these observations, we visualize how label smoothing changes the representations learned by the penultimate layer of the network. We show that label smoothing encourages the representations of training examples from the same class to group in tight clusters. This results in loss of information in the logits about resemblances between instances of different classes, which is necessary for distillation, but does not hurt generalization or calibration of the model's predictions.

...read moreread less

274 citations

Proceedings Article•

Lookahead Optimizer: k steps forward, 1 step back

[...]

Michael R. Zhang¹, James Lucas¹, Jimmy Ba¹, Geoffrey E. Hinton²•Institutions (2)

University of Toronto¹, Google²

19 Jul 2019

TL;DR: In this article, a new optimization algorithm, Lookahead, is proposed, which is orthogonal to these previous approaches and iteratively updates two sets of weights and chooses a search direction by looking ahead at the sequence of fast weights generated by another optimizer.

...read moreread less

Abstract: The vast majority of successful deep neural networks are trained using variants of stochastic gradient descent (SGD) algorithms. Recent attempts to improve SGD can be broadly categorized into two approaches: (1) adaptive learning rate schemes, such as AdaGrad and Adam and (2) accelerated schemes, such as heavy-ball and Nesterov momentum. In this paper, we propose a new optimization algorithm, Lookahead, that is orthogonal to these previous approaches and iteratively updates two sets of weights. Intuitively, the algorithm chooses a search direction by looking ahead at the sequence of ``fast weights" generated by another optimizer. We show that Lookahead improves the learning stability and lowers the variance of its inner optimizer with negligible computation and memory cost. We empirically demonstrate Lookahead can significantly improve the performance of SGD and Adam, even with their default hyperparameter settings on ImageNet, CIFAR-10/100, neural machine translation, and Penn Treebank.

...read moreread less

217 citations

Proceedings Article•

Stacked capsule autoencoders

[...]

Adam R. Kosiorek¹, Sara Sabour², Yee Whye Teh¹, Geoffrey E. Hinton²•Institutions (2)

University of Oxford¹, Google²

17 Jun 2019

TL;DR: This work introduces an unsupervised capsule autoencoder (SCAE), which explicitly uses geometric relationships between parts to reason about objects, and finds that object capsule presences are highly informative of the object class, which leads to state-of-the-art results for un supervised classification on SVHN and MNIST.

...read moreread less

Abstract: Objects are composed of a set of geometrically organized parts. We introduce an unsupervised capsule autoencoder (SCAE), which explicitly uses geometric relationships between parts to reason about objects. Since these relationships do not depend on the viewpoint, our model is robust to viewpoint changes. SCAE consists of two stages. In the first stage, the model predicts presences and poses of part templates directly from the image and tries to reconstruct the image by appropriately arranging the templates. In the second stage, the SCAE predicts parameters of a few object capsules, which are then used to reconstruct part poses. Inference in this model is amortized and performed by off-the-shelf neural encoders, unlike in previous capsule networks. We find that object capsule presences are highly informative of the object class, which leads to state-of-the-art results for unsupervised classification on SVHN (55%) and MNIST (98.7%).

...read moreread less

180 citations

Posted Content•

[...]

Simon Kornblith¹, Mohammad Norouzi¹, Honglak Lee², Geoffrey E. Hinton¹•Institutions (2)

Google¹, University of Michigan²

01 May 2019-arXiv: Learning

TL;DR: A similarity index is introduced that measures the relationship between representational similarity matrices and does not suffer from this limitation of CCA.

...read moreread less

156 citations

Posted Content•

NASA: Neural Articulated Shape Approximation

[...]

Boyang Deng¹, John P. Lewis¹, Timothy Jeruzalski¹, Gerard Pons-Moll¹, Geoffrey E. Hinton¹, Mohammad Norouzi¹, Andrea Tagliasacchi¹ - Show less +3 more•Institutions (1)

Max Planck Society¹

06 Dec 2019-arXiv: Computer Vision and Pattern Recognition

TL;DR: This paper introduces neural articulated shape approximation (NASA), an alternative framework that enables efficient representation of articulated deformable objects using neural indicator functions that are conditioned on pose.

...read moreread less

Abstract: Efficient representation of articulated objects such as human bodies is an important problem in computer vision and graphics. To efficiently simulate deformation, existing approaches represent 3D objects using polygonal meshes and deform them using skinning techniques. This paper introduces neural articulated shape approximation (NASA), an alternative framework that enables efficient representation of articulated deformable objects using neural indicator functions that are conditioned on pose. Occupancy testing using NASA is straightforward, circumventing the complexity of meshes and the issue of water-tightness. We demonstrate the effectiveness of NASA for 3D tracking applications, and discuss other potential extensions.

...read moreread less

90 citations

Posted Content•

Learning Sparse Networks Using Targeted Dropout.

[...]

Aidan N. Gomez, Ivan Zhang, Siddhartha Rao Kamalakara, Divyam Madaan, Kevin Swersky, Yarin Gal, Geoffrey E. Hinton - Show less +3 more

31 May 2019-arXiv: Learning

TL;DR: Target dropout is introduced, a method for training a neural network so that it is robust to subsequent pruning, and improves upon more complicated sparsifying regularisers while being simple to implement and easy to tune.

...read moreread less

Abstract: Neural networks are easier to optimise when they have many more weights than are required for modelling the mapping from inputs to outputs. This suggests a two-stage learning procedure that first learns a large net and then prunes away connections or hidden units. But standard training does not necessarily encourage nets to be amenable to pruning. We introduce targeted dropout, a method for training a neural network so that it is robust to subsequent pruning. Before computing the gradients for each weight update, targeted dropout stochastically selects a set of units or weights to be dropped using a simple self-reinforcing sparsity criterion and then computes the gradients for the remaining weights. The resulting network is robust to post hoc pruning of weights or units that frequently occur in the dropped sets. The method improves upon more complicated sparsifying regularisers while being simple to implement and easy to tune.

...read moreread less

83 citations

Posted Content•

CvxNet: Learnable Convex Decomposition

[...]

Boyang Deng¹, Kyle Genova¹, Soroosh Yazdani¹, Sofien Bouaziz¹, Geoffrey E. Hinton¹, Andrea Tagliasacchi¹ - Show less +2 more•Institutions (1)

Google¹

12 Sep 2019-arXiv: Computer Vision and Pattern Recognition

TL;DR: This work introduces a network architecture to represent a low dimensional family of convexes, automatically derived via an auto-encoding process, and investigates the applications including automatic convex decomposition, image to 3D reconstruction, and part-based shape retrieval.

...read moreread less

Abstract: Any solid object can be decomposed into a collection of convex polytopes (in short, convexes). When a small number of convexes are used, such a decomposition can be thought of as a piece-wise approximation of the geometry. This decomposition is fundamental in computer graphics, where it provides one of the most common ways to approximate geometry, for example, in real-time physics simulation. A convex object also has the property of being simultaneously an explicit and implicit representation: one can interpret it explicitly as a mesh derived by computing the vertices of a convex hull, or implicitly as the collection of half-space constraints or support functions. Their implicit representation makes them particularly well suited for neural network training, as they abstract away from the topology of the geometry they need to represent. However, at testing time, convexes can also generate explicit representations -- polygonal meshes -- which can then be used in any downstream application. We introduce a network architecture to represent a low dimensional family of convexes. This family is automatically derived via an auto-encoding process. We investigate the applications of this architecture including automatic convex decomposition, image to 3D reconstruction, and part-based shape retrieval.

...read moreread less

78 citations

Posted Content•

Analyzing and Improving Representations with the Soft Nearest Neighbor Loss

[...]

Nicholas Frosst¹, Nicolas Papernot¹, Geoffrey E. Hinton¹•Institutions (1)

Google¹

05 Feb 2019-arXiv: Machine Learning

TL;DR: Surprisingly, it is found that the entanglement of representations of different classes in the hidden layers is beneficial for discrimination in the final layer, possibly because it encourages representations to identify class-independent similarity structures.

...read moreread less

Abstract: We explore and expand the $\textit{Soft Nearest Neighbor Loss}$ to measure the $\textit{entanglement}$ of class manifolds in representation space: i.e., how close pairs of points from the same class are relative to pairs of points from different classes. We demonstrate several use cases of the loss. As an analytical tool, it provides insights into the evolution of class similarity structures during learning. Surprisingly, we find that $\textit{maximizing}$ the entanglement of representations of different classes in the hidden layers is beneficial for discrimination in the final layer, possibly because it encourages representations to identify class-independent similarity structures. Maximizing the soft nearest neighbor loss in the hidden layers leads not only to improved generalization but also to better-calibrated estimates of uncertainty on outlier data. Data that is not from the training distribution can be recognized by observing that in the hidden layers, it has fewer than the normal number of neighbors from the predicted class.

...read moreread less

Proceedings Article•

Analyzing and Improving Representations with the Soft Nearest Neighbor Loss

[...]

Nicholas Frosst¹, Nicolas Papernot¹, Geoffrey E. Hinton¹•Institutions (1)

Google¹

24 May 2019

TL;DR: In this article, the authors explore and expand the soft nearest neighbor loss to measure the entanglement of class manifolds in representation space, i.e., how close pairs of points from the same class are relative to pairs of different classes.

...read moreread less

Posted Content•

Stacked Capsule Autoencoders

[...]

Adam R. Kosiorek¹, Sara Sabour², Yee Whye Teh¹, Geoffrey E. Hinton²•Institutions (2)

University of Oxford¹, Google²

17 Jun 2019-arXiv: Machine Learning

TL;DR: Zhang et al. as discussed by the authors introduced an unsupervised capsule autoencoder (SCAE), which explicitly uses geometric relationships between parts to reason about objects. But their model is robust to viewpoint changes.

...read moreread less

Abstract: Objects are composed of a set of geometrically organized parts. We introduce an unsupervised capsule autoencoder (SCAE), which explicitly uses geometric relationships between parts to reason about objects. Since these relationships do not depend on the viewpoint, our model is robust to viewpoint changes. SCAE consists of two stages. In the first stage, the model predicts presences and poses of part templates directly from the image and tries to reconstruct the image by appropriately arranging the templates. In the second stage, SCAE predicts parameters of a few object capsules, which are then used to reconstruct part poses. Inference in this model is amortized and performed by off-the-shelf neural encoders, unlike in previous capsule networks. We find that object capsule presences are highly informative of the object class, which leads to state-of-the-art results for unsupervised classification on SVHN (55%) and MNIST (98.7%). The code is available at this https URL

...read moreread less

Posted Content•

Detecting and Diagnosing Adversarial Images with Class-Conditional Capsule Reconstructions

[...]

Yao Qin¹, Nicholas Frosst², Sara Sabour², Colin Raffel², Garrison W. Cottrell¹, Geoffrey E. Hinton² - Show less +2 more•Institutions (2)

University of California, San Diego¹, Google²

05 Jul 2019-arXiv: Learning

TL;DR: It is found that CapsNets always perform better than convolutional networks and the resulting perturbations can cause the input image to appear visually more like the target class and hence become non-adversarial.

...read moreread less

Abstract: Adversarial examples raise questions about whether neural network models are sensitive to the same visual features as humans. In this paper, we first detect adversarial examples or otherwise corrupted images based on a class-conditional reconstruction of the input. To specifically attack our detection mechanism, we propose the Reconstructive Attack which seeks both to cause a misclassification and a low reconstruction error. This reconstructive attack produces undetected adversarial examples but with much smaller success rate. Among all these attacks, we find that CapsNets always perform better than convolutional networks. Then, we diagnose the adversarial examples for CapsNets and find that the success of the reconstructive attack is highly related to the visual similarity between the source and target class. Additionally, the resulting perturbations can cause the input image to appear visually more like the target class and hence become non-adversarial. This suggests that CapsNets use features that are more aligned with human perception and have the potential to address the central issue raised by adversarial examples.

...read moreread less

Posted Content•

Deflecting Adversarial Attacks

[...]

Yao Qin, Nicholas Frosst, Colin Raffel, Garrison W. Cottrell, Geoffrey E. Hinton - Show less +1 more

25 Sep 2019-arXiv: Learning

TL;DR: A stronger defense based on Capsule Networks that combines three detection mechanisms to achieve state-of-the-art detection performance on both standard and defense-aware attacks is proposed and it is shown that undetected attacks against the authors' defense often perceptually resemble the adversarial target class.

...read moreread less

Abstract: There has been an ongoing cycle where stronger defenses against adversarial attacks are subsequently broken by a more advanced defense-aware attack. We present a new approach towards ending this cycle where we "deflect'' adversarial attacks by causing the attacker to produce an input that semantically resembles the attack's target class. To this end, we first propose a stronger defense based on Capsule Networks that combines three detection mechanisms to achieve state-of-the-art detection performance on both standard and defense-aware attacks. We then show that undetected attacks against our defense often perceptually resemble the adversarial target class by performing a human study where participants are asked to label images produced by the attack. These attack images can no longer be called "adversarial'' because our network classifies them the same way as humans do.

...read moreread less

Posted Content•

Cerberus: A Multi-headed Derenderer.

[...]

Boyang Deng, Simon Kornblith, Geoffrey E. Hinton¹•Institutions (1)

Google¹

28 May 2019-arXiv: Computer Vision and Pattern Recognition

TL;DR: C Cerberus, the multi-headed derenderer, outperforms previous methods for extracting 3D parts from single images without part annotations, and it does quite well at extracting natural parts of human figures.

...read moreread less

Abstract: To generalize to novel visual scenes with new viewpoints and new object poses, a visual system needs representations of the shapes of the parts of an object that are invariant to changes in viewpoint or pose. 3D graphics representations disentangle visual factors such as viewpoints and lighting from object structure in a natural way. It is possible to learn to invert the process that converts 3D graphics representations into 2D images, provided the 3D graphics representations are available as labels. When only the unlabeled images are available, however, learning to derender is much harder. We consider a simple model which is just a set of free floating parts. Each part has its own relation to the camera and its own triangular mesh which can be deformed to model the shape of the part. At test time, a neural network looks at a single image and extracts the shapes of the parts and their relations to the camera. Each part can be viewed as one head of a multi-headed derenderer. During training, the extracted parts are used as input to a differentiable 3D renderer and the reconstruction error is backpropagated to train the neural net. We make the learning task easier by encouraging the deformations of the part meshes to be invariant to changes in viewpoint and invariant to the changes in the relative positions of the parts that occur when the pose of an articulated body changes. Cerberus, our multi-headed derenderer, outperforms previous methods for extracting 3D parts from single images without part annotations, and it does quite well at extracting natural parts of human figures.

...read moreread less

Patent•

Processing text using neural networks

[...]

Jamie Ryan Kiros¹, William Chan¹, Geoffrey E. Hinton•Institutions (1)

Google¹

22 Feb 2019

TL;DR: In this article, the authors proposed a method for generating a data set that associates each text segment in a vocabulary of text segments with a respective numeric embedding. But, the method is limited to text segments.

...read moreread less

Abstract: Methods, systems, and apparatus including computer programs encoded on a computer storage medium, for generating a data set that associates each text segment in a vocabulary of text segments with a respective numeric embedding. In one aspect, a method includes providing, to an image search engine, a search query that includes the text segment; obtaining image search results that have been classified as being responsive to the search query by the image search engine, wherein each image search result identifies a respective image; for each image search result, processing the image identified by the image search result using a convolutional neural network, wherein the convolutional neural network has been trained to process the image to generate an image numeric embedding for the image; and generating a numeric embedding for the text segment from the image numeric embeddings for the images identified by the image search results.

...read moreread less