Showing papers on "MNIST database published in 2019"

PDF

Open Access

Proceedings Article•

The Lottery Ticket Hypothesis: Finding Sparse, Trainable Neural Networks.

[...]

Jonathan Frankle¹, Michael Carbin¹•Institutions (1)

04 Mar 2019

TL;DR: This work finds that dense, randomly-initialized, feed-forward networks contain subnetworks ("winning tickets") that - when trained in isolation - reach test accuracy comparable to the original network in a similar number of iterations, and articulate the "lottery ticket hypothesis".

...read moreread less

Abstract: Neural network pruning techniques can reduce the parameter counts of trained networks by over 90%, decreasing storage requirements and improving computational performance of inference without compromising accuracy. However, contemporary experience is that the sparse architectures produced by pruning are difficult to train from the start, which would similarly improve training performance. We find that a standard pruning technique naturally uncovers subnetworks whose initializations made them capable of training effectively. Based on these results, we articulate the "lottery ticket hypothesis:" dense, randomly-initialized, feed-forward networks contain subnetworks ("winning tickets") that - when trained in isolation - reach test accuracy comparable to the original network in a similar number of iterations. The winning tickets we find have won the initialization lottery: their connections have initial weights that make training particularly effective. We present an algorithm to identify winning tickets and a series of experiments that support the lottery ticket hypothesis and the importance of these fortuitous initializations. We consistently find winning tickets that are less than 10-20% of the size of several fully-connected and convolutional feed-forward architectures for MNIST and CIFAR10. Above this size, the winning tickets that we find learn faster than the original network and reach higher test accuracy.

...read moreread less

1,751 citations

Posted Content•

Generative Modeling by Estimating Gradients of the Data Distribution

[...]

Yang Song¹, Stefano Ermon¹•Institutions (1)

Stanford University¹

12 Jul 2019-arXiv: Learning

TL;DR: A new generative model where samples are produced via Langevin dynamics using gradients of the data distribution estimated with score matching, which allows flexible model architectures, requires no sampling during training or the use of adversarial methods, and provides a learning objective that can be used for principled model comparisons.

...read moreread less

Abstract: We introduce a new generative model where samples are produced via Langevin dynamics using gradients of the data distribution estimated with score matching Because gradients can be ill-defined and hard to estimate when the data resides on low-dimensional manifolds, we perturb the data with different levels of Gaussian noise, and jointly estimate the corresponding scores, ie, the vector fields of gradients of the perturbed data distribution for all noise levels For sampling, we propose an annealed Langevin dynamics where we use gradients corresponding to gradually decreasing noise levels as the sampling process gets closer to the data manifold Our framework allows flexible model architectures, requires no sampling during training or the use of adversarial methods, and provides a learning objective that can be used for principled model comparisons Our models produce samples comparable to GANs on MNIST, CelebA and CIFAR-10 datasets, achieving a new state-of-the-art inception score of 887 on CIFAR-10 Additionally, we demonstrate that our models learn effective representations via image inpainting experiments

...read moreread less

983 citations

Posted Content•

Three scenarios for continual learning

[...]

Gido M. van de Ven, Andreas S. Tolias

15 Apr 2019-arXiv: Learning

TL;DR: Three continual learning scenarios are described based on whether at test time task identity is provided and--in case it is not--whether it must be inferred, and it is found that regularization-based approaches fail and that replaying representations of previous experiences seems required for solving this scenario.

...read moreread less

Abstract: Standard artificial neural networks suffer from the well-known issue of catastrophic forgetting, making continual or lifelong learning difficult for machine learning. In recent years, numerous methods have been proposed for continual learning, but due to differences in evaluation protocols it is difficult to directly compare their performance. To enable more structured comparisons, we describe three continual learning scenarios based on whether at test time task identity is provided and--in case it is not--whether it must be inferred. Any sequence of well-defined tasks can be performed according to each scenario. Using the split and permuted MNIST task protocols, for each scenario we carry out an extensive comparison of recently proposed continual learning methods. We demonstrate substantial differences between the three scenarios in terms of difficulty and in terms of how efficient different methods are. In particular, when task identity must be inferred (i.e., class incremental learning), we find that regularization-based approaches (e.g., elastic weight consolidation) fail and that replaying representations of previous experiences seems required for solving this scenario.

...read moreread less

479 citations

Posted Content•

Fine-Grained Analysis of Optimization and Generalization for Overparameterized Two-Layer Neural Networks

[...]

Sanjeev Arora¹, Simon S. Du², Wei Hu¹, Zhiyuan Li¹, Ruosong Wang² - Show less +1 more•Institutions (2)

Princeton University¹, Carnegie Mellon University²

24 Jan 2019-arXiv: Learning

TL;DR: This paper analyzes training and generalization for a simple 2-layer ReLU net with random initialization, and provides the following improvements over recent works: a tighter characterization of training speed, an explanation for why training a neuralNet with random labels leads to slower training, and a data-dependent complexity measure.

...read moreread less

Abstract: Recent works have cast some light on the mystery of why deep nets fit any data and generalize despite being very overparametrized. This paper analyzes training and generalization for a simple 2-layer ReLU net with random initialization, and provides the following improvements over recent works: (i) Using a tighter characterization of training speed than recent papers, an explanation for why training a neural net with random labels leads to slower training, as originally observed in [Zhang et al. ICLR'17]. (ii) Generalization bound independent of network size, using a data-dependent complexity measure. Our measure distinguishes clearly between random labels and true labels on MNIST and CIFAR, as shown by experiments. Moreover, recent papers require sample complexity to increase (slowly) with the size, while our sample complexity is completely independent of the network size. (iii) Learnability of a broad class of smooth functions by 2-layer ReLU nets trained via gradient descent. The key idea is to track dynamics of training and generalization via properties of a related kernel.

...read moreread less

476 citations

Proceedings Article•DOI•

Towards Optimal Structured CNN Pruning via Generative Adversarial Learning

[...]

Shaohui Lin¹, Rongrong Ji¹, Chenqian Yan¹, Baochang Zhang², Liujuan Cao¹, Qixiang Ye³, Feiyue Huang⁴, David Doermann⁵ - Show less +4 more•Institutions (5)

Xiamen University¹, Beihang University², Chinese Academy of Sciences³, Tencent⁴, University at Buffalo⁵

01 Jun 2019

TL;DR: This paper proposes an effective structured pruning approach that jointly prunes filters as well as other structures in an end-to-end manner and effectively solves the optimization problem by generative adversarial learning (GAL), which learns a sparse soft mask in a label-free and an end to end manner.

...read moreread less

Abstract: Structured pruning of filters or neurons has received increased focus for compressing convolutional neural networks. Most existing methods rely on multi-stage optimizations in a layer-wise manner for iteratively pruning and retraining which may not be optimal and may be computation intensive. Besides, these methods are designed for pruning a specific structure, such as filter or block structures without jointly pruning heterogeneous structures. In this paper, we propose an effective structured pruning approach that jointly prunes filters as well as other structures in an end-to-end manner. To accomplish this, we first introduce a soft mask to scale the output of these structures by defining a new objective function with sparsity regularization to align the output of baseline and network with this mask. We then effectively solve the optimization problem by generative adversarial learning (GAL), which learns a sparse soft mask in a label-free and an end-to-end manner. By forcing more scale factors in the soft mask to zero, the fast iterative shrinkage-thresholding algorithm (FISTA) can be leveraged to fast and reliably remove the corresponding structures. Extensive experiments demonstrate the effectiveness of GAL on different datasets, including MNIST, CIFAR-10 and ImageNet ILSVRC 2012. For example, on ImageNet ILSVRC 2012, the pruned ResNet-50 achieves 10.88% Top-5 error and results in a factor of 3.7x speedup. This significantly outperforms state-of-the-art methods.

...read moreread less

447 citations

Proceedings Article•

Fine-grained analysis of optimization and generalization for overparameterized two-layer neural networks

[...]

Sanjeev Arora, Simon S. Du¹, Wei Hu², Zhiyuan Li², Ruosong Wang¹ - Show less +1 more•Institutions (2)

Carnegie Mellon University¹, Princeton University²

24 Jan 2019

TL;DR: In this paper, a simple 2-layer ReLU network with random initialization is analyzed and generalization bound independent of network size is shown to be robust to the size of the network.

...read moreread less

403 citations

Posted Content•

DRIT++: Diverse Image-to-Image Translation via Disentangled Representations

[...]

Hsin-Ying Lee, Hung-Yu Tseng, Qi Mao, Jia-Bin Huang, Yu-Ding Lu, Maneesh Singh, Ming-Hsuan Yang - Show less +3 more

02 May 2019-arXiv: Computer Vision and Pattern Recognition

TL;DR: In this article, a disentangled representation is proposed for image-to-image translation without paired training images, which takes the encoded content features extracted from a given input and the attribute vectors sampled from the attribute space to produce diverse outputs at test time.

...read moreread less

Abstract: Image-to-image translation aims to learn the mapping between two visual domains. There are two main challenges for many applications: 1) the lack of aligned training pairs and 2) multiple possible outputs from a single input image. In this work, we present an approach based on disentangled representation for producing diverse outputs without paired training images. To achieve diversity, we propose to embed images onto two spaces: a domain-invariant content space capturing shared information across domains and a domain-specific attribute space. Our model takes the encoded content features extracted from a given input and the attribute vectors sampled from the attribute space to produce diverse outputs at test time. To handle unpaired training data, we introduce a novel cross-cycle consistency loss based on disentangled representations. Qualitative results show that our model can generate diverse and realistic images on a wide range of tasks without paired training data. For quantitative comparisons, we measure realism with user study and diversity with a perceptual distance metric. We apply the proposed model to domain adaptation and show competitive performance when compared to the state-of-the-art on the MNIST-M and the LineMod datasets.

...read moreread less

383 citations

Posted Content•

FedMD: Heterogenous Federated Learning via Model Distillation

[...]

Daliang Li¹, Junpu Wang•Institutions (1)

Harvard University¹

08 Oct 2019-arXiv: Learning

TL;DR: This work uses transfer learning and knowledge distillation to develop a universal framework that enables federated learning when each agent owns not only their private data, but also uniquely designed models.

...read moreread less

Abstract: Federated learning enables the creation of a powerful centralized model without compromising data privacy of multiple participants. While successful, it does not incorporate the case where each participant independently designs its own model. Due to intellectual property concerns and heterogeneous nature of tasks and data, this is a widespread requirement in applications of federated learning to areas such as health care and AI as a service. In this work, we use transfer learning and knowledge distillation to develop a universal framework that enables federated learning when each agent owns not only their private data, but also uniquely designed models. We test our framework on the MNIST/FEMNIST dataset and the CIFAR10/CIFAR100 dataset and observe fast improvement across all participating models. With 10 distinct participants, the final test accuracy of each model on average receives a 20% gain on top of what's possible without collaboration and is only a few percent lower than the performance each model would have obtained if all private datasets were pooled and made directly available for all participants.

...read moreread less

337 citations

Proceedings Article•DOI•

Transferrable Prototypical Networks for Unsupervised Domain Adaptation

[...]

Yingwei Pan, Ting Yao, Yehao Li¹, Yu Wang, Chong-Wah Ngo², Tao Mei - Show less +2 more•Institutions (2)

Sun Yat-sen University¹, City University of Hong Kong²

15 Jun 2019

TL;DR: This paper presents Transferrable Prototypical Networks (TPN), which is end-to-end trained by jointly minimizing the distance across the prototypes on three types of data and KL-divergence of score distributions output by each pair of the prototypes.

...read moreread less

Abstract: In this paper, we introduce a new idea for unsupervised domain adaptation via a remold of Prototypical Networks, which learn an embedding space and perform classification via a remold of the distances to the prototype of each class. Specifically, we present Transferrable Prototypical Networks (TPN) for adaptation such that the prototypes for each class in source and target domains are close in the embedding space and the score distributions predicted by prototypes separately on source and target data are similar. Technically, TPN initially matches each target example to the nearest prototype in the source domain and assigns an example a ``pseudo" label. The prototype of each class could then be computed on source-only, target-only and source-target data, respectively. The optimization of TPN is end-to-end trained by jointly minimizing the distance across the prototypes on three types of data and KL-divergence of score distributions output by each pair of the prototypes. Extensive experiments are conducted on the transfers across MNIST, USPS and SVHN datasets, and superior results are reported when comparing to state-of-the-art approaches. More remarkably, we obtain an accuracy of 80.4% of single model on VisDA 2017 dataset.

...read moreread less

303 citations

Posted Content•

Confident Learning: Estimating Uncertainty in Dataset Labels

[...]

Curtis G. Northcutt¹, Lu Jiang², Isaac L. Chuang¹•Institutions (2)

Massachusetts Institute of Technology¹, Google²

31 Oct 2019-arXiv: Machine Learning

TL;DR: This work combines building on the assumption of a classification noise process to directly estimate the joint distribution between noisy (given) labels and uncorrupted (unknown) labels, resulting in a generalized CL which is provably consistent and experimentally performant.

...read moreread less

Abstract: Learning exists in the context of data, yet notions of \emph{confidence} typically focus on model predictions, not label quality. Confident learning (CL) is an alternative approach which focuses instead on label quality by characterizing and identifying label errors in datasets, based on the principles of pruning noisy data, counting with probabilistic thresholds to estimate noise, and ranking examples to train with confidence. Whereas numerous studies have developed these principles independently, here, we combine them, building on the assumption of a classification noise process to directly estimate the joint distribution between noisy (given) labels and uncorrupted (unknown) labels. This results in a generalized CL which is provably consistent and experimentally performant. We present sufficient conditions where CL exactly finds label errors, and show CL performance exceeds seven state-of-the-art approaches for learning with noisy labels on the CIFAR dataset. The CL framework is \emph{not} coupled to a specific data modality or model: we use CL to find errors in the presumed error-free MNIST dataset and improve sentiment classification on text data in Amazon Reviews. We also employ CL on ImageNet to quantify ontological class overlap (e.g. finding approximately 645 \emph{missile} images are mislabeled as their parent class \emph{projectile}), and moderately increase model accuracy (e.g. for ResNet) by cleaning data prior to training. These results are replicable using the open-source \texttt{cleanlab} release.

...read moreread less

272 citations

Journal Article•DOI•

AutoZOOM: Autoencoder-Based Zeroth Order Optimization Method for Attacking Black-Box Neural Networks

[...]

Chun-Chen Tu¹, Paishun Ting¹, Pin-Yu Chen², Sijia Liu², Huan Zhang³, Jinfeng Yi, Cho-Jui Hsieh³, Shin-Ming Cheng⁴ - Show less +4 more•Institutions (4)

University of Michigan¹, IBM², University of California, Los Angeles³, National Taiwan University of Science and Technology⁴

17 Jul 2019

TL;DR: Li et al. as discussed by the authors proposed an adaptive random gradient estimation strategy to balance query counts and distortion, and an autoencoder that is either trained offline with unlabeled data or a bilinear resizing operation for attack acceleration.

...read moreread less

Abstract: Recent studies have shown that adversarial examples in state-of-the-art image classifiers trained by deep neural networks (DNN) can be easily generated when the target model is transparent to an attacker, known as the white-box setting. However, when attacking a deployed machine learning service, one can only acquire the input-output correspondences of the target model; this is the so-called black-box attack setting. The major drawback of existing black-box attacks is the need for excessive model queries, which may give a false sense of model robustness due to inefficient query designs. To bridge this gap, we propose a generic framework for query-efficient blackbox attacks. Our framework, AutoZOOM, which is short for Autoencoder-based Zeroth Order Optimization Method, has two novel building blocks towards efficient black-box attacks: (i) an adaptive random gradient estimation strategy to balance query counts and distortion, and (ii) an autoencoder that is either trained offline with unlabeled data or a bilinear resizing operation for attack acceleration. Experimental results suggest that, by applying AutoZOOM to a state-of-the-art black-box attack (ZOO), a significant reduction in model queries can be achieved without sacrificing the attack success rate and the visual quality of the resulting adversarial examples. In particular, when compared to the standard ZOO method, AutoZOOM can consistently reduce the mean query counts in finding successful adversarial examples (or reaching the same distortion level) by at least 93% on MNIST, CIFAR-10 and ImageNet datasets, leading to novel insights on adversarial robustness.

...read moreread less

Posted Content•

Interpretable Counterfactual Explanations Guided by Prototypes

[...]

Arnaud Van Looveren, Janis Klaise

03 Jul 2019-arXiv: Learning

TL;DR: It is shown that class prototypes, obtained using either an encoder or through class specific k-d trees, significantly speed up the search for counterfactual instances and result in more interpretable explanations.

...read moreread less

Abstract: We propose a fast, model agnostic method for finding interpretable counterfactual explanations of classifier predictions by using class prototypes. We show that class prototypes, obtained using either an encoder or through class specific k-d trees, significantly speed up the the search for counterfactual instances and result in more interpretable explanations. We introduce two novel metrics to quantitatively evaluate local interpretability at the instance level. We use these metrics to illustrate the effectiveness of our method on an image and tabular dataset, respectively MNIST and Breast Cancer Wisconsin (Diagnostic). The method also eliminates the computational bottleneck that arises because of numerical gradient evaluation for $\textit{black box}$ models.

...read moreread less

Journal Article•DOI•

Frequency Principle: Fourier Analysis Sheds Light on Deep Neural Networks.

[...]

Zhi-Qin John Xu¹, Yaoyu Zhang, Tao Luo², Yanyang Xiao, Zheng Ma² - Show less +1 more•Institutions (2)

Shanghai Jiao Tong University¹, Purdue University²

19 Jan 2019-arXiv: Learning

TL;DR: A very universal Frequency Principle (F-Principle) --- DNNs often fit target functions from low to high frequencies --- is demonstrated on high-dimensional benchmark datasets such as MNIST/CIFAR10 and deep neural networks such as VGG16.

...read moreread less

Abstract: We study the training process of Deep Neural Networks (DNNs) from the Fourier analysis perspective. We demonstrate a very universal Frequency Principle (F-Principle) --- DNNs often fit target functions from low to high frequencies --- on high-dimensional benchmark datasets such as MNIST/CIFAR10 and deep neural networks such as VGG16. This F-Principle of DNNs is opposite to the behavior of most conventional iterative numerical schemes (e.g., Jacobi method), which exhibit faster convergence for higher frequencies for various scientific computing problems. With a simple theory, we illustrate that this F-Principle results from the regularity of the commonly used activation functions. The F-Principle implies an implicit bias that DNNs tend to fit training data by a low-frequency function. This understanding provides an explanation of good generalization of DNNs on most real datasets and bad generalization of DNNs on parity function or randomized dataset.

...read moreread less

Journal Article•DOI•

CONV-SRAM: An Energy-Efficient SRAM With In-Memory Dot-Product Computation for Low-Power Convolutional Neural Networks

[...]

Avishek Biswas¹, Anantha P. Chandrakasan²•Institutions (2)

Texas Instruments¹, Massachusetts Institute of Technology²

01 Jan 2019-IEEE Journal of Solid-state Circuits

TL;DR: An energy-efficient static random access memory (SRAM) with embedded dot-product computation capability, for binary-weight convolutional neural networks, using a 10T bit-cell-based SRAM array to store the 1-b filter weights.

...read moreread less

Abstract: This paper presents an energy-efficient static random access memory (SRAM) with embedded dot-product computation capability, for binary-weight convolutional neural networks. A 10T bit-cell-based SRAM array is used to store the 1-b filter weights. The array implements dot-product as a weighted average of the bitline voltages, which are proportional to the digital input values. Local integrating analog-to-digital converters compute the digital convolution outputs, corresponding to each filter. We have successfully demonstrated functionality (>98% accuracy) with the 10 000 test images in the MNIST hand-written digit recognition data set, using 6-b inputs/outputs. Compared to conventional full-digital implementations using small bitwidths, we achieve similar or better energy efficiency, by reducing data transfer, due to the highly parallel in-memory analog computations.

...read moreread less

Proceedings Article•DOI•

ComDefend: An Efficient Image Compression Model to Defend Adversarial Examples

[...]

Xiaojun Jia¹, Xingxing Wei², Xiaochun Cao¹, Hassan Foroosh³•Institutions (3)

Chinese Academy of Sciences¹, Tsinghua University², University of Central Florida³

15 Jun 2019

TL;DR: Zhang et al. as mentioned in this paper proposed an end-to-end image compression model to defend adversarial examples, which consists of a compression convolutional neural network (ComCNN) and a reconstruction CNN (ResCNN), which is used to maintain the structure information of the original image and purify adversarial perturbations.

...read moreread less

Abstract: Deep neural networks (DNNs) have been demonstrated to be vulnerable to adversarial examples. Specifically, adding imperceptible perturbations to clean images can fool the well trained deep neural networks. In this paper, we propose an end-to-end image compression model to defend adversarial examples: ComDefend. The proposed model consists of a compression convolutional neural network (ComCNN) and a reconstruction convolutional neural network (ResCNN). The ComCNN is used to maintain the structure information of the original image and purify adversarial perturbations. And the ResCNN is used to reconstruct the original image with high quality. In other words, ComDefend can transform the adversarial image to its clean version, which is then fed to the trained classifier. Our method is a pre-processing module, and does not modify the classifier’s structure during the whole process. Therefore it can be combined with other model-specific defense models to jointly improve the classifier’s robustness. A series of experiments conducted on MNIST, CIFAR10 and ImageNet show that the proposed method outperforms the state-of-the-art defense methods, and is consistently effective to protect classifiers against adversarial attacks.

...read moreread less

Proceedings Article•

Generative Modeling by Estimating Gradients of the Data Distribution

[...]

Yang Song¹, Stefano Ermon¹•Institutions (1)

Stanford University¹

06 Sep 2019

TL;DR: In this article, Langevin dynamics are used to estimate the gradients of the data distribution estimated with score matching, which can be used to generate samples comparable to GANs.

...read moreread less

Abstract: We introduce a new generative model where samples are produced via Langevin dynamics using gradients of the data distribution estimated with score matching. Because gradients can be ill-defined and hard to estimate when the data resides on low-dimensional manifolds, we perturb the data with different levels of Gaussian noise, and jointly estimate the corresponding scores, i.e., the vector fields of gradients of the perturbed data distribution for all noise levels. For sampling, we propose an annealed Langevin dynamics where we use gradients corresponding to gradually decreasing noise levels as the sampling process gets closer to the data manifold. Our framework allows flexible model architectures, requires no sampling during training or the use of adversarial methods, and provides a learning objective that can be used for principled model comparisons. Our models produce samples comparable to GANs on MNIST, CelebA and CIFAR-10 datasets, achieving a new state-of-the-art inception score of 8.87 on CIFAR-10. Additionally, we demonstrate that our models learn effective representations via image inpainting experiments.

...read moreread less

Proceedings Article•DOI•

Decoupling Direction and Norm for Efficient Gradient-Based L2 Adversarial Attacks and Defenses

[...]

Jérôme Rony, Luiz G. Hafemann, Luiz S. Oliveira¹, Ismail Ben Ayed, Robert Sabourin, Eric Granger - Show less +2 more•Institutions (1)

Federal University of Paraná¹

15 Jun 2019

TL;DR: In this article, an efficient approach is proposed to generate gradient-based attacks that induce misclassifications with low L2 norm, by decoupling the direction and the norm of the adversarial perturbation that is added to the image.

...read moreread less

Abstract: Research on adversarial examples in computer vision tasks has shown that small, often imperceptible changes to an image can induce misclassification, which has security implications for a wide range of image processing systems. Considering L2 norm distortions, the Carlini and Wagner attack is presently the most effective white-box attack in the literature. However, this method is slow since it performs a line-search for one of the optimization terms, and often requires thousands of iterations. In this paper, an efficient approach is proposed to generate gradient-based attacks that induce misclassifications with low L2 norm, by decoupling the direction and the norm of the adversarial perturbation that is added to the image. Experiments conducted on the MNIST, CIFAR-10 and ImageNet datasets indicate that our attack achieves comparable results to the state-of-the-art (in terms of L2 norm) with considerably fewer iterations (as few as 100 iterations), which opens the possibility of using these attacks for adversarial training. Models trained with our attack achieve state-of-the-art robustness against white-box gradient-based L2 attacks on the MNIST and CIFAR-10 datasets, outperforming the Madry defense when the attacks are limited to a maximum norm.

...read moreread less

Posted Content•

Guided Image Generation with Conditional Invertible Neural Networks

[...]

Lynton Ardizzone¹, Carsten Lüth, Jakob Kruse, Carsten Rother, Ullrich Köthe - Show less +1 more•Institutions (1)

Heidelberg University¹

04 Jul 2019-arXiv: Computer Vision and Pattern Recognition

TL;DR: This work introduces a new architecture called conditional invertible neural network (cINN), which combines the purely generative INN model with an unconstrained feed-forward network, which efficiently preprocesses the conditioning input into useful features.

...read moreread less

Abstract: In this work, we address the task of natural image generation guided by a conditioning input. We introduce a new architecture called conditional invertible neural network (cINN). The cINN combines the purely generative INN model with an unconstrained feed-forward network, which efficiently preprocesses the conditioning input into useful features. All parameters of the cINN are jointly optimized with a stable, maximum likelihood-based training procedure. By construction, the cINN does not experience mode collapse and generates diverse samples, in contrast to e.g. cGANs. At the same time our model produces sharp images since no reconstruction loss is required, in contrast to e.g. VAEs. We demonstrate these properties for the tasks of MNIST digit generation and image colorization. Furthermore, we take advantage of our bi-directional cINN architecture to explore and manipulate emergent properties of the latent space, such as changing the image style in an intuitive way.

...read moreread less

Proceedings Article•

A Little Is Enough: Circumventing Defenses For Distributed Learning

[...]

Gilad Baruch¹, Moran Baruch², Yoav Goldberg¹•Institutions (2)

Bar-Ilan University¹, University of Jyväskylä²

01 Jan 2019

TL;DR: It is shown that 20% of corrupt workers are sufficient to degrade a CIFAR10 model accuracy by 50%, as well as to introduce backdoors into MNIST and CIFar10 models without hurting their accuracy.

...read moreread less

Abstract: Distributed learning is central for large-scale training of deep-learning models. However, it is exposed to a security threat in which Byzantine participants can interrupt or control the learning process. Previous attack models assume that the rogue participants (a) are omniscient (know the data of all other participants), and (b) introduce large changes to the parameters. Accordingly, most defense mechanisms make a similar assumption and attempt to use statistically robust methods to identify and discard values whose reported gradients are far from the population mean. We observe that if the empirical variance between the gradients of workers is high enough, an attacker could take advantage of this and launch a non-omniscient attack that operates within the population variance. We show that the variance is indeed high enough even for simple datasets such as MNIST, allowing an attack that is not only undetected by existing defenses, but also uses their power against them, causing those defense mechanisms to consistently select the byzantine workers while discarding legitimate ones. We demonstrate our attack method works not only for preventing convergence but also for repurposing of the model behavior (``backdooring''). We show that less than 25\% of colluding workers are sufficient to degrade the accuracy of models trained on MNIST, CIFAR10 and CIFAR100 by 50\%, as well as to introduce backdoors without hurting the accuracy for MNIST and CIFAR10 datasets, but with a degradation for CIFAR100.

...read moreread less

Proceedings Article•DOI•

DeepCaps: Going Deeper With Capsule Networks

[...]

Jathushan Rajasegaran¹, Vinoj Jayasundara¹, Sandaru Jayasekara¹, Hirunima Jayasekara¹, Suranga Seneviratne¹, Ranga Rodrigo² - Show less +2 more•Institutions (2)

University of Moratuwa¹, University of Sydney²

15 Jun 2019

TL;DR: This work introduces DeepCaps, a deep capsule network architecture which uses a novel 3D convolution based dynamic routing algorithm and proposes a class independent decoder network, which strengthens the use of reconstruction loss as a regularization term.

...read moreread less

Abstract: Capsule Network is a promising concept in deep learning, yet its true potential is not fully realized thus far, providing sub-par performance on several key benchmark datasets with complex data. Drawing intuition from the success achieved by Convolutional Neural Networks (CNNs) by going deeper, we introduce DeepCaps, a deep capsule network architecture which uses a novel 3D convolution based dynamic routing algorithm. With DeepCaps, we surpass the state-of-the-art capsule domain networks results on CIFAR10, SVHN and Fashion MNIST, while achieving a 68% reduction in the number of parameters. Further, we propose a class independent decoder network, which strengthens the use of reconstruction loss as a regularization term. This leads to an interesting property of the decoder, which allows us to identify and control the physical attributes of the images represented by the instantiation parameters.

...read moreread less

Posted Content•

Measuring Calibration in Deep Learning

[...]

Jeremy Nixon¹, Michael W. Dusenberry, Ghassen Jerfel, Timothy Nguyen, Jeremiah Liu, Linchuan Zhang, Dustin Tran - Show less +3 more•Institutions (1)

Google¹

02 Apr 2019-arXiv: Learning

TL;DR: A comprehensive empirical study of choices in calibration measures including measuring all probabilities rather than just the maximum prediction, thresholding probability values, class conditionality, number of bins, bins that are adaptive to the datapoint density, and the norm used to compare accuracies to confidences.

...read moreread less

Abstract: Overconfidence and underconfidence in machine learning classifiers is measured by calibration: the degree to which the probabilities predicted for each class match the accuracy of the classifier on that prediction. How one measures calibration remains a challenge: expected calibration error, the most popular metric, has numerous flaws which we outline, and there is no clear empirical understanding of how its choices affect conclusions in practice, and what recommendations there are to counteract its flaws. In this paper, we perform a comprehensive empirical study of choices in calibration measures including measuring all probabilities rather than just the maximum prediction, thresholding probability values, class conditionality, number of bins, bins that are adaptive to the datapoint density, and the norm used to compare accuracies to confidences. To analyze the sensitivity of calibration measures, we study the impact of optimizing directly for each variant with recalibration techniques. Across MNIST, Fashion MNIST, CIFAR-10/100, and ImageNet, we find that conclusions on the rank ordering of recalibration methods is drastically impacted by the choice of calibration measure. We find that conditioning on the class leads to more effective calibration evaluations, and that using the L2 norm rather than the L1 norm improves both optimization for calibration metrics and the rank correlation measuring metric consistency. Adaptive binning schemes lead to more stablity of metric rank ordering when the number of bins vary, and is also recommended. We open source a library for the use of our calibration measures.

...read moreread less

Posted Content•

Learn to Grow: A Continual Structure Learning Framework for Overcoming Catastrophic Forgetting

[...]

Xilai Li¹, Yingbo Zhou², Tianfu Wu¹, Richard Socher², Caiming Xiong² - Show less +1 more•Institutions (2)

North Carolina State University¹, Salesforce.com²

31 Mar 2019-arXiv: Learning

TL;DR: In this article, a neural structure optimization component and a parameter learning and/or fine-tuning component are proposed to handle the catastrophic forgetting problem in continual learning with DNNs.

...read moreread less

Abstract: Addressing catastrophic forgetting is one of the key challenges in continual learning where machine learning systems are trained with sequential or streaming tasks. Despite recent remarkable progress in state-of-the-art deep learning, deep neural networks (DNNs) are still plagued with the catastrophic forgetting problem. This paper presents a conceptually simple yet general and effective framework for handling catastrophic forgetting in continual learning with DNNs. The proposed method consists of two components: a neural structure optimization component and a parameter learning and/or fine-tuning component. By separating the explicit neural structure learning and the parameter estimation, not only is the proposed method capable of evolving neural structures in an intuitively meaningful way, but also shows strong capabilities of alleviating catastrophic forgetting in experiments. Furthermore, the proposed method outperforms all other baselines on the permuted MNIST dataset, the split CIFAR100 dataset and the Visual Domain Decathlon dataset in continual learning setting.

...read moreread less

Proceedings Article•

Continual Unsupervised Representation Learning.

[...]

Dushyant Rao¹, Francesco Visin², Andrei Rusu³, Razvan Pascanu², Yee Whye Teh⁴, Raia Hadsell² - Show less +2 more•Institutions (4)

Carnegie Mellon University¹, Google², West University of Timișoara³, University of Oxford⁴

01 Jan 2019

TL;DR: The proposed approach (CURL) performs task inference directly within the model, is able to dynamically expand to capture new concepts over its lifetime, and incorporates additional rehearsal-based techniques to deal with catastrophic forgetting.

...read moreread less

Abstract: Continual learning aims to improve the ability of modern learning systems to deal with non-stationary distributions, typically by attempting to learn a series of tasks sequentially. Prior art in the field has largely considered supervised or reinforcement learning tasks, and often assumes full knowledge of task labels and boundaries. In this work, we propose an approach (CURL) to tackle a more general problem that we will refer to as unsupervised continual learning. The focus is on learning representations without any knowledge about task identity, and we explore scenarios when there are abrupt changes between tasks, smooth transitions from one task to another, or even when the data is shuffled. The proposed approach performs task inference directly within the model, is able to dynamically expand to capture new concepts over its lifetime, and incorporates additional rehearsal-based techniques to deal with catastrophic forgetting. We demonstrate the efficacy of CURL in an unsupervised learning setting with MNIST and Omniglot, where the lack of labels ensures no information is leaked about the task. Further, we demonstrate strong performance compared to prior art in an i.i.d setting, or when adapting the technique to supervised tasks such as incremental class learning.

...read moreread less

Proceedings Article•DOI•

GenAttack: practical black-box attacks with gradient-free optimization

[...]

Moustafa Alzantot¹, Yash Sharma², Supriyo Chakraborty³, Huan Zhang¹, Cho-Jui Hsieh¹, Mani Srivastava¹ - Show less +2 more•Institutions (3)

University of California, Los Angeles¹, Cooper Union², IBM³

13 Jul 2019

TL;DR: GenAttack is introduced, a gradient-free optimization technique that uses genetic algorithms for synthesizing adversarial examples in the black-box setting and can successfully attack some state-of-the-art ImageNet defenses, including ensemble adversarial training and non-differentiable or randomized input transformations.

...read moreread less

Abstract: Deep neural networks are vulnerable to adversarial examples, even in the black-box setting, where the attacker is restricted solely to query access. Existing black-box approaches to generating adversarial examples typically require a significant number of queries, either for training a substitute network or performing gradient estimation. We introduce GenAttack, a gradient-free optimization technique that uses genetic algorithms for synthesizing adversarial examples in the black-box setting. Our experiments on different datasets (MNIST, CIFAR-10, and ImageNet) show that GenAttack can successfully generate visually imperceptible adversarial examples against state-of-the-art image recognition models with orders of magnitude fewer queries than previous approaches. Against MNIST and CIFAR-10 models, GenAttack required roughly 2,126 and 2,568 times fewer queries respectively, than ZOO, the prior state-of-the-art black-box attack. In order to scale up the attack to large-scale high-dimensional ImageNet models, we perform a series of optimizations that further improve the query efficiency of our attack leading to 237 times fewer queries against the Inception-v3 model than ZOO. Furthermore, we show that GenAttack can successfully attack some state-of-the-art ImageNet defenses, including ensemble adversarial training and non-differentiable or randomized input transformations. Our results suggest that evolutionary algorithms open up a promising area of research into effective black-box attacks.

...read moreread less

Journal Article•DOI•

In situ training of feed-forward and recurrent convolutional memristor networks

[...]

Zhongrui Wang¹, Can Li¹, Can Li², Peng Lin¹, Mingyi Rao¹, Yongyang Nie¹, Wenhao Song¹, Qinru Qiu³, Yunning Li¹, Peng Yan¹, John Paul Strachan², Ning Ge², Nathan McDonald⁴, Qing Wu⁴, Miao Hu⁵, Huaqiang Wu⁶, R. Stanley Williams⁷, Qiangfei Xia¹, Jianhua Yang¹ - Show less +15 more•Institutions (7)

University of Massachusetts Amherst¹, Hewlett-Packard², Syracuse University³, Air Force Research Laboratory⁴, Binghamton University⁵, Tsinghua University⁶, Texas A&M University⁷

01 Sep 2019-Nature Machine Intelligence

TL;DR: In situ training of a five-level convolutional neural network that self-adapts to non-idealities of the one-transistor one-memristor array to classify the MNIST dataset is experimentally demonstrated, achieving a 75% reduction in weights without compromising accuracy.

...read moreread less

Abstract: The explosive growth of machine learning is largely due to the recent advancements in hardware and architecture. The engineering of network structures, taking advantage of the spatial or temporal translational isometry of patterns, naturally leads to bio-inspired, shared-weight structures such as convolutional neural networks, which have markedly reduced the number of free parameters. State-of-the-art microarchitectures commonly rely on weight-sharing techniques, but still suffer from the von Neumann bottleneck of transistor-based platforms. Here, we experimentally demonstrate the in situ training of a five-level convolutional neural network that self-adapts to non-idealities of the one-transistor one-memristor array to classify the MNIST dataset, achieving similar accuracy to the memristor-based multilayer perceptron with a reduction in trainable parameters of ~75% owing to the shared weights. In addition, the memristors encoded both spatial and temporal translational invariance simultaneously in a convolutional long short-term memory network—a memristor-based neural network with intrinsic 3D input processing—which was trained in situ to classify a synthetic MNIST sequence dataset using just 850 weights. These proof-of-principle demonstrations combine the architectural advantages of weight sharing and the area/energy efficiency boost of the memristors, paving the way to future edge artificial intelligence. Memristive devices can provide energy-efficient neural network implementations, but they must be tailored to suit different network architectures. Wang et al. develop a trainable weight-sharing mechanism for memristor-based CNNs and ConvLSTMs, achieving a 75% reduction in weights without compromising accuracy.

...read moreread less

Journal Article•DOI•

Research on image classification model based on deep convolution neural network

[...]

Mingyuan Xin, Yong Wang

11 Feb 2019-Eurasip Journal on Image and Video Processing

TL;DR: The experimental results show that M3 CE can enhance the cross-entropy, and it is an effective supplement to theCross entropy criterion.

...read moreread less

Abstract: Based on the analysis of the error backpropagation algorithm, we propose an innovative training criterion of depth neural network for maximum interval minimum classification error. At the same time, the cross entropy and M3CE are analyzed and combined to obtain better results. Finally, we tested our proposed M3 CE-CEc on two deep learning standard databases, MNIST and CIFAR-10. The experimental results show that M3 CE can enhance the cross-entropy, and it is an effective supplement to the cross-entropy criterion. M3 CE-CEc has obtained good results in both databases.

...read moreread less

Journal Article•DOI•

A 4096-Neuron 1M-Synapse 3.8-pJ/SOP Spiking Neural Network With On-Chip STDP Learning and Sparse Weights in 10-nm FinFET CMOS

[...]

Gregory K. Chen¹, Raghavan Kumar¹, H. Ekin Sumbul¹, Knag Phil¹, Ram Krishnamurthy¹ - Show less +1 more•Institutions (1)

Intel¹

01 Apr 2019-IEEE Journal of Solid-state Circuits

TL;DR: A reconfigurable 4096-neuron, 1M-synapse chip in 10-nm FinFET CMOS is developed to accelerate inference and learning for many classes of spiking neural networks (SNNs) with less than 2% overhead for storing connections.

...read moreread less

Abstract: A reconfigurable 4096-neuron, 1M-synapse chip in 10-nm FinFET CMOS is developed to accelerate inference and learning for many classes of spiking neural networks (SNNs). The SNN features digital circuits for leaky integrate and fire neuron models, on-chip spike-timing-dependent plasticity (STDP) learning, and high-fan-out multicast spike communication. Structured fine-grained weight sparsity reduces synapse memory by up to 16 $\times $ with less than 2% overhead for storing connections. Approximate computing co-optimizes the dropping flow control and benefits from algorithmic noise to process spatiotemporal spike patterns with up to 9.4 $\times $ lower energy. The SNN achieves a peak throughput of 25.2 GSOP/s at 0.9 V, peak energy efficiency of 3.8 pJ/SOP at 525 mV, and 2.3- $\mu \text{W}$ /neuron operation at 450 mV. On-chip unsupervised STDP trains a spiking restricted Boltzmann machine to de-noise Modified National Institute of Standards and Technology (MNIST) digits and to reconstruct natural scene images with RMSE of 0.036. Near-threshold operation, in conjunction with temporal and spatial sparsity, reduces energy by $17.4\times $ to 1.0- $\mu \text{J}$ /classification in a $236 \times 20$ feed-forward network that is trained to classify MNIST digits using supervised STDP. A binary-activation multilayer perceptron with 50% sparse weights is trained offline with error backpropagation to classify MNIST digits with 97.9% accuracy at 1.7- $\mu \text{J}$ /classification.

...read moreread less

Proceedings Article•

Towards the First Adversarially Robust Neural Network Model on MNIST

[...]

Lukas Schott¹, Jonas Rauber¹, Matthias Bethge², Wieland Brendel¹•Institutions (2)

University of Tübingen¹, Max Planck Society²

01 Jan 2019

TL;DR: In this article, a novel robust classification model was proposed that performs analysis by synthesis using learned class-conditional data distributions, which yields state-of-the-art robustness on MNIST against L0, L2 and L-infinity perturbations.

...read moreread less

Abstract: Despite much effort, deep neural networks remain highly susceptible to tiny input perturbations and even for MNIST, one of the most common toy datasets in computer vision, no neural network model exists for which adversarial perturbations are large and make semantic sense to humans. We show that even the widely recognized and by far most successful defense by Madry et al. (1) overfits on the L-infinity metric (it's highly susceptible to L2 and L0 perturbations), (2) classifies unrecognizable images with high certainty, (3) performs not much better than simple input binarization and (4) features adversarial perturbations that make little sense to humans. These results suggest that MNIST is far from being solved in terms of adversarial robustness. We present a novel robust classification model that performs analysis by synthesis using learned class-conditional data distributions. We derive bounds on the robustness and go to great length to empirically evaluate our model using maximally effective adversarial attacks by (a) applying decision-based, score-based, gradient-based and transfer-based attacks for several different Lp norms, (b) by designing a new attack that exploits the structure of our defended model and (c) by devising a novel decision-based attack that seeks to minimize the number of perturbed pixels (L0). The results suggest that our approach yields state-of-the-art robustness on MNIST against L0, L2 and L-infinity perturbations and we demonstrate that most adversarial examples are strongly perturbed towards the perceptual boundary between the original and the adversarial class.

...read moreread less

Posted Content•DOI•

MNIST-C: A Robustness Benchmark for Computer Vision.

[...]

Norman Mu¹, Justin Gilmer¹•Institutions (1)

Google¹

05 Jun 2019-arXiv: Computer Vision and Pattern Recognition

TL;DR: This work demonstrates that several previously published adversarial defenses significantly degrade robustness as measured by MNIST-C, a comprehensive suite of 15 corruptions applied to the MNIST test set, and hopes that this benchmark serves as a useful tool for future work in designing systems that are able to learn robust feature representations that capture the underlying semantics of the input.

...read moreread less

Abstract: We introduce the MNIST-C dataset, a comprehensive suite of 15 corruptions applied to the MNIST test set, for benchmarking out-of-distribution robustness in computer vision. Through several experiments and visualizations we demonstrate that our corruptions significantly degrade performance of state-of-the-art computer vision models while preserving the semantic content of the test images. In contrast to the popular notion of adversarial robustness, our model-agnostic corruptions do not seek worst-case performance but are instead designed to be broad and diverse, capturing multiple failure modes of modern models. In fact, we find that several previously published adversarial defenses significantly degrade robustness as measured by MNIST-C. We hope that our benchmark serves as a useful tool for future work in designing systems that are able to learn robust feature representations that capture the underlying semantics of the input.

...read moreread less

Posted Content•

Weight Agnostic Neural Networks

[...]

Adam Gaier¹, David Ha²•Institutions (2)

Bonn-Rhein-Sieg University of Applied Sciences¹, Google²

11 Jun 2019-arXiv: Learning

TL;DR: This work proposes a search method for neural network architectures that can already perform a task without any explicit weight training, and demonstrates that this method can find minimal neural network architecture that can perform several reinforcement learning tasks without weight training.

...read moreread less

Abstract: Not all neural network architectures are created equal, some perform much better than others for certain tasks. But how important are the weight parameters of a neural network compared to its architecture? In this work, we question to what extent neural network architectures alone, without learning any weight parameters, can encode solutions for a given task. We propose a search method for neural network architectures that can already perform a task without any explicit weight training. To evaluate these networks, we populate the connections with a single shared weight parameter sampled from a uniform random distribution, and measure the expected performance. We demonstrate that our method can find minimal neural network architectures that can perform several reinforcement learning tasks without weight training. On a supervised learning domain, we find network architectures that achieve much higher than chance accuracy on MNIST using random weights. Interactive version of this paper at this https URL

...read moreread less

Collapse