scispace - formally typeset
Search or ask a question

Showing papers by "Geoffrey E. Hinton published in 2018"


Proceedings Article
15 Feb 2018
TL;DR: Capsule Networks as mentioned in this paper use a logistic unit to represent the presence of an entity and a 4x4 matrix to learn the relationship between that entity and the viewer (the pose).
Abstract: A capsule is a group of neurons whose outputs represent different properties of the same entity. Each layer in a capsule network contains many capsules [a group of capsules forms a capsule layer and can be used in place of a traditional layer in a neural net]. We describe a version of capsules in which each capsule has a logistic unit to represent the presence of an entity and a 4x4 matrix which could learn to represent the relationship between that entity and the viewer (the pose). A capsule in one layer votes for the pose matrix of many different capsules in the layer above by multiplying its own pose matrix by trainable viewpoint-invariant transformation matrices that could learn to represent part-whole relationships. Each of these votes is weighted by an assignment coefficient. These coefficients are iteratively updated for each image using the Expectation-Maximization algorithm such that the output of each capsule is routed to a capsule in the layer above that receives a cluster of similar votes. The transformation matrices are trained discriminatively by backpropagating through the unrolled iterations of EM between each pair of adjacent capsule layers. On the smallNORB benchmark, capsules reduce the number of test errors by 45\% compared to the state-of-the-art. Capsules also show far more resistance to white box adversarial attack than our baseline convolutional neural network.

891 citations


Journal ArticleDOI
18 Sep 2018-JAMA
TL;DR: The purpose of this Viewpoint is to give health care professionals an intuitive understanding of the technology underlying deep learning, used on billions of digital devices for complex tasks such as speech recognition, image interpretation, and language translation.
Abstract: Widespread application of artificial intelligence in health care has been anticipated for half a century. For most of that time, the dominant approach to artificial intelligence was inspired by logic: researchers assumed that the essence of intelligence was manipulating symbolic expressions, using rules of inference. This approach produced expert systems and graphical models that attempted to automate the reasoning processes of experts. In the last decade, however, a radically different approach to artificial intelligence, called deep learning, has produced major breakthroughs and is now used on billions of digital devices for complex tasks such as speech recognition, image interpretation, and language translation. The purpose of this Viewpoint is to give health care professionals an intuitive understanding of the technology underlying deep learning. In an accompanying Viewpoint, Naylor1 outlines some of the factors propelling adoption of this technology in medicine and health care.

433 citations


Posted Content
TL;DR: This paper claims that online distillation is a cost-effective way to make the exact predictions of a model dramatically more reproducible and can still speed up training even after the authors have already reached the point at which additional parallelism provides no benefit for synchronous or asynchronous stochastic gradient descent.
Abstract: Techniques such as ensembling and distillation promise model quality improvements when paired with almost any base model However, due to increased test-time cost (for ensembles) and increased complexity of the training pipeline (for distillation), these techniques are challenging to use in industrial settings In this paper we explore a variant of distillation which is relatively straightforward to use as it does not require a complicated multi-stage setup or many new hyperparameters Our first claim is that online distillation enables us to use extra parallelism to fit very large datasets about twice as fast Crucially, we can still speed up training even after we have already reached the point at which additional parallelism provides no benefit for synchronous or asynchronous stochastic gradient descent Two neural networks trained on disjoint subsets of the data can share knowledge by encouraging each model to agree with the predictions the other model would have made These predictions can come from a stale version of the other model so they can be safely computed using weights that only rarely get transmitted Our second claim is that online distillation is a cost-effective way to make the exact predictions of a model dramatically more reproducible We support our claims using experiments on the Criteo Display Ad Challenge dataset, ImageNet, and the largest to-date dataset used for neural language modeling, containing $6\times 10^{11}$ tokens and based on the Common Crawl repository of web data

301 citations


Posted Content
TL;DR: Results on scaling up biologically motivated models of deep learning on datasets which need deep networks with appropriate architectures to achieve good performance are presented and implementation details help establish baselines for biologically motivated deep learning schemes going forward.
Abstract: The backpropagation of error algorithm (BP) is impossible to implement in a real brain. The recent success of deep networks in machine learning and AI, however, has inspired proposals for understanding how the brain might learn across multiple layers, and hence how it might approximate BP. As of yet, none of these proposals have been rigorously evaluated on tasks where BP-guided deep learning has proved critical, or in architectures more structured than simple fully-connected networks. Here we present results on scaling up biologically motivated models of deep learning on datasets which need deep networks with appropriate architectures to achieve good performance. We present results on the MNIST, CIFAR-10, and ImageNet datasets and explore variants of target-propagation (TP) and feedback alignment (FA) algorithms, and explore performance in both fully- and locally-connected architectures. We also introduce weight-transport-free variants of difference target propagation (DTP) modified to remove backpropagation from the penultimate layer. Many of these algorithms perform well for MNIST, but for CIFAR and ImageNet we find that TP and FA variants perform significantly worse than BP, especially for networks composed of locally connected units, opening questions about whether new architectures and algorithms are required to scale these approaches. Our results and implementation details help establish baselines for biologically motivated deep learning schemes going forward.

169 citations


Proceedings Article
29 Apr 2018
TL;DR: In this paper, the authors proposed to use the information about which expert produced which label to reduce the workload on individual experts and also give a better estimate of the unobserved ground truth.
Abstract: Data are often labeled by many different experts with each expert only labeling a small fraction of the data and each data point being labeled by several experts. This reduces the workload on individual experts and also gives a better estimate of the unobserved ground truth. When experts disagree, the standard approaches are to treat the majority opinion as the correct label or to model the correct label as a distribution. These approaches, however, do not make any use of potentially valuable information about which expert produced which label. To make use of this extra information, we propose modeling the experts individually and then learning averaging weights for combining them, possibly in sample-specific ways. This allows us to give more weight to more reliable experts and take advantage of the unique strengths of individual experts at classifying certain types of data. Here we show that our approach leads to improvements in computer-aided diagnosis of diabetic retinopathy. We also show that our method performs better than competing algorithms by Welinder and Perona (2010); Mnih and Hinton (2012). Our work offers an innovative approach for dealing with the myriad real-world settings that use expert opinions to define labels for training.

119 citations


Proceedings Article
15 Feb 2018
TL;DR: In this paper, the authors explore a variant of distillation which is relatively straightforward to use as it does not require a complicated multi-stage setup or many new hyperparameters.
Abstract: Techniques such as ensembling and distillation promise model quality improvements when paired with almost any base model. However, due to increased test-time cost (for ensembles) and increased complexity of the training pipeline (for distillation), these techniques are challenging to use in industrial settings. In this paper we explore a variant of distillation which is relatively straightforward to use as it does not require a complicated multi-stage setup or many new hyperparameters. Our first claim is that online distillation enables us to use extra parallelism to fit very large datasets about twice as fast. Crucially, we can still speed up training even after we have already reached the point at which additional parallelism provides no benefit for synchronous or asynchronous stochastic gradient descent. Two neural networks trained on disjoint subsets of the data can share knowledge by encouraging each model to agree with the predictions the other model would have made. These predictions can come from a stale version of the other model so they can be safely computed using weights that only rarely get transmitted. Our second claim is that online distillation is a cost-effective way to make the exact predictions of a model dramatically more reproducible. We support our claims using experiments on the Criteo Display Ad Challenge dataset, ImageNet, and the largest to-date dataset used for neural language modeling, containing $6\times 10^{11}$ tokens and based on the Common Crawl repository of web data.

73 citations


Proceedings Article
15 Jun 2018
TL;DR: In this article, the authors present results on scaling up biologically motivated models of deep learning on datasets which need deep networks with appropriate architectures to achieve good performance, and explore variants of target-propagation (TP) and feedback alignment (FA) algorithms.
Abstract: The backpropagation of error algorithm (BP) is impossible to implement in a real brain The recent success of deep networks in machine learning and AI, however, has inspired proposals for understanding how the brain might learn across multiple layers, and hence how it might approximate BP As of yet, none of these proposals have been rigorously evaluated on tasks where BP-guided deep learning has proved critical, or in architectures more structured than simple fully-connected networks Here we present results on scaling up biologically motivated models of deep learning on datasets which need deep networks with appropriate architectures to achieve good performance We present results on the MNIST, CIFAR-10, and ImageNet datasets and explore variants of target-propagation (TP) and feedback alignment (FA) algorithms, and explore performance in both fully- and locally-connected architectures We also introduce weight-transport-free variants of difference target propagation (DTP) modified to remove backpropagation from the penultimate layer Many of these algorithms perform well for MNIST, but for CIFAR and ImageNet we find that TP and FA variants perform significantly worse than BP, especially for networks composed of locally connected units, opening questions about whether new architectures and algorithms are required to scale these approaches Our results and implementation details help establish baselines for biologically motivated deep learning schemes going forward

72 citations


Proceedings ArticleDOI
01 Jul 2018
TL;DR: Picturebook, a large-scale lookup operation to ground language via ‘snapshots’ of the authors' physical world accessed through image search, is introduced and it is shown that gate activations corresponding to Picturebook embeddings are highly correlated to human judgments of concreteness ratings.
Abstract: We introduce Picturebook, a large-scale lookup operation to ground language via ‘snapshots’ of our physical world accessed through image search. For each word in a vocabulary, we extract the top-k images from Google image search and feed the images through a convolutional network to extract a word embedding. We introduce a multimodal gating function to fuse our Picturebook embeddings with other word representations. We also introduce Inverse Picturebook, a mechanism to map a Picturebook embedding back into words. We experiment and report results across a wide range of tasks: word similarity, natural language inference, semantic relatedness, sentiment/topic classification, image-sentence ranking and machine translation. We also show that gate activations corresponding to Picturebook embeddings are highly correlated to human judgments of concreteness ratings.

56 citations


Posted Content
TL;DR: It is shown that setting a threshold on the $l2$ distance between the input image and its reconstruction from the winning capsule is very effective at detecting adversarial images for three different datasets.
Abstract: We present a simple technique that allows capsule models to detect adversarial images. In addition to being trained to classify images, the capsule model is trained to reconstruct the images from the pose parameters and identity of the correct top-level capsule. Adversarial images do not look like a typical member of the predicted class and they have much larger reconstruction errors when the reconstruction is produced from the top-level capsule for that class. We show that setting a threshold on the $l2$ distance between the input image and its reconstruction from the winning capsule is very effective at detecting adversarial images for three different datasets. The same technique works quite well for CNNs that have been trained to reconstruct the image from all or part of the last hidden layer before the softmax. We then explore a stronger, white-box attack that takes the reconstruction error into account. This attack is able to fool our detection technique but in order to make the model change its prediction to another class, the attack must typically make the "adversarial" image resemble images of the other class.

42 citations