Showing papers in "arXiv: Machine Learning in 2019"

PDF

Open Access

Posted Content•

[...]

Martin Arjovsky, Léon Bottou, Ishaan Gulrajani, David Lopez-Paz

05 Jul 2019-arXiv: Machine Learning

TL;DR: This work introduces Invariant Risk Minimization, a learning paradigm to estimate invariant correlations across multiple training distributions and shows how the invariances learned by IRM relate to the causal structures governing the data and enable out-of-distribution generalization.

...read moreread less

Abstract: We introduce Invariant Risk Minimization (IRM), a learning paradigm to estimate invariant correlations across multiple training distributions. To achieve this goal, IRM learns a data representation such that the optimal classifier, on top of that data representation, matches for all training distributions. Through theory and experiments, we show how the invariances learned by IRM relate to the causal structures governing the data and enable out-of-distribution generalization.

...read moreread less

1,029 citations

Posted Content•

On the Convergence of FedAvg on Non-IID Data

[...]

Xiang Li¹, Kaixuan Huang¹, Wenhao Yang¹, Shusen Wang², Zhihua Zhang¹ - Show less +1 more•Institutions (2)

Peking University¹, Stevens Institute of Technology²

04 Jul 2019-arXiv: Machine Learning

TL;DR: This paper analyzes the convergence of Federated Averaging on non-iid data and establishes a convergence rate of $\mathcal{O}(\frac{1}{T})$ for strongly convex and smooth problems, where $T$ is the number of SGDs.

...read moreread less

Abstract: Federated learning enables a large amount of edge computing devices to jointly learn a model without data sharing. As a leading algorithm in this setting, Federated Averaging (\texttt{FedAvg}) runs Stochastic Gradient Descent (SGD) in parallel on a small subset of the total devices and averages the sequences only once in a while. Despite its simplicity, it lacks theoretical guarantees under realistic settings. In this paper, we analyze the convergence of \texttt{FedAvg} on non-iid data and establish a convergence rate of $\mathcal{O}(\frac{1}{T})$ for strongly convex and smooth problems, where $T$ is the number of SGDs. Importantly, our bound demonstrates a trade-off between communication-efficiency and convergence rate. As user devices may be disconnected from the server, we relax the assumption of full device participation to partial device participation and study different averaging schemes; low device participation rate can be achieved without severely slowing down the learning. Our results indicate that heterogeneity of data slows down the convergence, which matches empirical observations. Furthermore, we provide a necessary condition for \texttt{FedAvg} on non-iid data: the learning rate $\eta$ must decay, even if full-gradient is used; otherwise, the solution will be $\Omega (\eta)$ away from the optimal.

...read moreread less

919 citations

Posted Content•

Can You Trust Your Model's Uncertainty? Evaluating Predictive Uncertainty Under Dataset Shift

[...]

Yaniv Ovadia¹, Emily Fertig¹, Jie Ren¹, Zachary Nado¹, D. Sculley¹, Sebastian Nowozin¹, Joshua V. Dillon¹, Balaji Lakshminarayanan¹, Jasper Snoek¹ - Show less +5 more•Institutions (1)

Google¹

06 Jun 2019-arXiv: Machine Learning

TL;DR: A large-scale benchmark of existing state-of-the-art methods on classification problems and the effect of dataset shift on accuracy and calibration is presented, finding that traditional post-hoc calibration does indeed fall short, as do several other previous methods.

...read moreread less

Abstract: Modern machine learning methods including deep learning have achieved great success in predictive accuracy for supervised learning tasks, but may still fall short in giving useful estimates of their predictive {\em uncertainty}. Quantifying uncertainty is especially critical in real-world settings, which often involve input distributions that are shifted from the training distribution due to a variety of factors including sample bias and non-stationarity. In such settings, well calibrated uncertainty estimates convey information about when a model's output should (or should not) be trusted. Many probabilistic deep learning methods, including Bayesian-and non-Bayesian methods, have been proposed in the literature for quantifying predictive uncertainty, but to our knowledge there has not previously been a rigorous large-scale empirical comparison of these methods under dataset shift. We present a large-scale benchmark of existing state-of-the-art methods on classification problems and investigate the effect of dataset shift on accuracy and calibration. We find that traditional post-hoc calibration does indeed fall short, as do several other previous methods. However, some methods that marginalize over models give surprisingly strong results across a broad spectrum of tasks.

...read moreread less

754 citations

Journal Article•DOI•

Wide Neural Networks of Any Depth Evolve as Linear Models Under Gradient Descent

[...]

Jaehoon Lee¹, Lechao Xiao¹, Samuel S. Schoenholz¹, Yasaman Bahri¹, Roman Novak¹, Jascha Sohl-Dickstein¹, Jeffrey Pennington¹ - Show less +3 more•Institutions (1)

Google¹

18 Feb 2019-arXiv: Machine Learning

TL;DR: In this article, the authors show that for wide neural networks the learning dynamics simplify considerably and that, in the infinite width limit, they are governed by a linear model obtained from the first-order Taylor expansion of the network around its initial parameters.

...read moreread less

Abstract: A longstanding goal in deep learning research has been to precisely characterize training and generalization. However, the often complex loss landscapes of neural networks have made a theory of learning dynamics elusive. In this work, we show that for wide neural networks the learning dynamics simplify considerably and that, in the infinite width limit, they are governed by a linear model obtained from the first-order Taylor expansion of the network around its initial parameters. Furthermore, mirroring the correspondence between wide Bayesian neural networks and Gaussian processes, gradient-based training of wide neural networks with a squared loss produces test set predictions drawn from a Gaussian process with a particular compositional kernel. While these theoretical results are only exact in the infinite width limit, we nevertheless find excellent empirical agreement between the predictions of the original network and those of the linearized version even for finite practically-sized networks. This agreement is robust across different architectures, optimization methods, and loss functions.

...read moreread less

738 citations

Posted Content•

Normalizing Flows for Probabilistic Modeling and Inference

[...]

George Papamakarios¹, Eric Nalisnick¹, Danilo Jimenez Rezende¹, Shakir Mohamed¹, Balaji Lakshminarayanan¹ - Show less +1 more•Institutions (1)

Google¹

05 Dec 2019-arXiv: Machine Learning

TL;DR: This review places special emphasis on the fundamental principles of flow design, and discusses foundational topics such as expressive power and computational trade-offs, and summarizes the use of flows for tasks such as generative modeling, approximate inference, and supervised learning.

...read moreread less

Abstract: Normalizing flows provide a general mechanism for defining expressive probability distributions, only requiring the specification of a (usually simple) base distribution and a series of bijective transformations. There has been much recent work on normalizing flows, ranging from improving their expressive power to expanding their application. We believe the field has now matured and is in need of a unified perspective. In this review, we attempt to provide such a perspective by describing flows through the lens of probabilistic modeling and inference. We place special emphasis on the fundamental principles of flow design, and discuss foundational topics such as expressive power and computational trade-offs. We also broaden the conceptual framing of flows by relating them to more general probability transformations. Lastly, we summarize the use of flows for tasks such as generative modeling, approximate inference, and supervised learning.

...read moreread less

716 citations

Posted Content•

Temporal Fusion Transformers for Interpretable Multi-horizon Time Series Forecasting

[...]

Bryan Lim¹, Sercan O. Arik², Nicolas Loeff², Tomas Pfister²•Institutions (2)

University of Oxford¹, Google²

19 Dec 2019-arXiv: Machine Learning

TL;DR: The Temporal Fusion Transformer is introduced -- a novel attention-based architecture which combines high-performance multi-horizon forecasting with interpretable insights into temporal dynamics and three practical interpretability use-cases of TFT are showcased.

...read moreread less

Abstract: Multi-horizon forecasting problems often contain a complex mix of inputs -- including static (i.e. time-invariant) covariates, known future inputs, and other exogenous time series that are only observed historically -- without any prior information on how they interact with the target. While several deep learning models have been proposed for multi-step prediction, they typically comprise black-box models which do not account for the full range of inputs present in common scenarios. In this paper, we introduce the Temporal Fusion Transformer (TFT) -- a novel attention-based architecture which combines high-performance multi-horizon forecasting with interpretable insights into temporal dynamics. To learn temporal relationships at different scales, the TFT utilizes recurrent layers for local processing and interpretable self-attention layers for learning long-term dependencies. The TFT also uses specialized components for the judicious selection of relevant features and a series of gating layers to suppress unnecessary components, enabling high performance in a wide range of regimes. On a variety of real-world datasets, we demonstrate significant performance improvements over existing benchmarks, and showcase three practical interpretability use-cases of TFT.

...read moreread less

422 citations

Posted Content•

Deep Ensembles: A Loss Landscape Perspective

[...]

Stanislav Fort¹, Huiyi Hu², Balaji Lakshminarayanan²•Institutions (2)

Stanford University¹, Google²

25 Sep 2019-arXiv: Machine Learning

TL;DR: Developing the concept of the diversity--accuracy plane, it is shown that the decorrelation power of random initializations is unmatched by popular subspace sampling methods and the experimental results validate the hypothesis that deep ensembles work well under dataset shift.

...read moreread less

Abstract: Deep ensembles have been empirically shown to be a promising approach for improving accuracy, uncertainty and out-of-distribution robustness of deep learning models. While deep ensembles were theoretically motivated by the bootstrap, non-bootstrap ensembles trained with just random initialization also perform well in practice, which suggests that there could be other explanations for why deep ensembles work well. Bayesian neural networks, which learn distributions over the parameters of the network, are theoretically well-motivated by Bayesian principles, but do not perform as well as deep ensembles in practice, particularly under dataset shift. One possible explanation for this gap between theory and practice is that popular scalable variational Bayesian methods tend to focus on a single mode, whereas deep ensembles tend to explore diverse modes in function space. We investigate this hypothesis by building on recent work on understanding the loss landscape of neural networks and adding our own exploration to measure the similarity of functions in the space of predictions. Our results show that random initializations explore entirely different modes, while functions along an optimization trajectory or sampled from the subspace thereof cluster within a single mode predictions-wise, while often deviating significantly in the weight space. Developing the concept of the diversity--accuracy plane, we show that the decorrelation power of random initializations is unmatched by popular subspace sampling methods. Finally, we evaluate the relative effects of ensembling, subspace based methods and ensembles of subspace based methods, and the experimental results validate our hypothesis.

...read moreread less

356 citations

Journal Article•DOI•

Informed Machine Learning -- A Taxonomy and Survey of Integrating Knowledge into Learning Systems

[...]

Laura von Rueden, Sebastian Mayer, Katharina Beckh, Bogdan Georgiev, Sven Giesselbach, Raoul Heese, Birgit Kirsch, Julius Pfrommer, Annika Pick, Rajkumar Ramamurthy, Michal Walczak, Jochen Garcke, Christian Bauckhage, Jannis Schuecker - Show less +10 more

29 Mar 2019-arXiv: Machine Learning

TL;DR: A definition and proposed concept for informed machine learning is provided, which illustrates its building blocks and distinguishes it from conventional machine learning, and a taxonomy is introduced that serves as a classification framework forinformed machine learning approaches.

...read moreread less

Abstract: Despite its great success, machine learning can have its limits when dealing with insufficient training data. A potential solution is the additional integration of prior knowledge into the training process which leads to the notion of informed machine learning. In this paper, we present a structured overview of various approaches in this field. We provide a definition and propose a concept for informed machine learning which illustrates its building blocks and distinguishes it from conventional machine learning. We introduce a taxonomy that serves as a classification framework for informed machine learning approaches. It considers the source of knowledge, its representation, and its integration into the machine learning pipeline. Based on this taxonomy, we survey related research and describe how different knowledge representations such as algebraic equations, logic rules, or simulation results can be used in learning systems. This evaluation of numerous papers on the basis of our taxonomy uncovers key methods in the field of informed machine learning.

...read moreread less

297 citations

Posted Content•

Revisiting Graph Neural Networks: All We Have is Low-Pass Filters

[...]

Hoang Nt, Takanori Maehara

23 May 2019-arXiv: Machine Learning

TL;DR: The results indicate that graph neural networks only perform low-pass filtering on feature vectors and do not have the non-linear manifold learning property, and some insights on GCN-based graph neural network design are proposed.

...read moreread less

Abstract: Graph neural networks have become one of the most important techniques to solve machine learning problems on graph-structured data. Recent work on vertex classification proposed deep and distributed learning models to achieve high performance and scalability. However, we find that the feature vectors of benchmark datasets are already quite informative for the classification task, and the graph structure only provides a means to denoise the data. In this paper, we develop a theoretical framework based on graph signal processing for analyzing graph neural networks. Our results indicate that graph neural networks only perform low-pass filtering on feature vectors and do not have the non-linear manifold learning property. We further investigate their resilience to feature noise and propose some insights on GCN-based graph neural network design.

...read moreread less

279 citations

Posted Content•

Confident Learning: Estimating Uncertainty in Dataset Labels

[...]

Curtis G. Northcutt¹, Lu Jiang², Isaac L. Chuang¹•Institutions (2)

Massachusetts Institute of Technology¹, Google²

31 Oct 2019-arXiv: Machine Learning

TL;DR: This work combines building on the assumption of a classification noise process to directly estimate the joint distribution between noisy (given) labels and uncorrupted (unknown) labels, resulting in a generalized CL which is provably consistent and experimentally performant.

...read moreread less

Abstract: Learning exists in the context of data, yet notions of \emph{confidence} typically focus on model predictions, not label quality. Confident learning (CL) is an alternative approach which focuses instead on label quality by characterizing and identifying label errors in datasets, based on the principles of pruning noisy data, counting with probabilistic thresholds to estimate noise, and ranking examples to train with confidence. Whereas numerous studies have developed these principles independently, here, we combine them, building on the assumption of a classification noise process to directly estimate the joint distribution between noisy (given) labels and uncorrupted (unknown) labels. This results in a generalized CL which is provably consistent and experimentally performant. We present sufficient conditions where CL exactly finds label errors, and show CL performance exceeds seven state-of-the-art approaches for learning with noisy labels on the CIFAR dataset. The CL framework is \emph{not} coupled to a specific data modality or model: we use CL to find errors in the presumed error-free MNIST dataset and improve sentiment classification on text data in Amazon Reviews. We also employ CL on ImageNet to quantify ontological class overlap (e.g. finding approximately 645 \emph{missile} images are mislabeled as their parent class \emph{projectile}), and moderately increase model accuracy (e.g. for ResNet) by cleaning data prior to training. These results are replicable using the open-source \texttt{cleanlab} release.

...read moreread less

272 citations

Posted Content•

Data Shapley: Equitable Valuation of Data for Machine Learning

[...]

Amirata Ghorbani¹, James Zou¹•Institutions (1)

Stanford University¹

05 Apr 2019-arXiv: Machine Learning

TL;DR: This work develops a principled framework to address data valuation in the context of supervised machine learning by proposing data Shapley as a metric to quantify the value of each training datum to the predictor performance.

...read moreread less

Abstract: As data becomes the fuel driving technological and economic growth, a fundamental challenge is how to quantify the value of data in algorithmic predictions and decisions. For example, in healthcare and consumer markets, it has been suggested that individuals should be compensated for the data that they generate, but it is not clear what is an equitable valuation for individual data. In this work, we develop a principled framework to address data valuation in the context of supervised machine learning. Given a learning algorithm trained on $n$ data points to produce a predictor, we propose data Shapley as a metric to quantify the value of each training datum to the predictor performance. Data Shapley value uniquely satisfies several natural properties of equitable data valuation. We develop Monte Carlo and gradient-based methods to efficiently estimate data Shapley values in practical settings where complex learning algorithms, including neural networks, are trained on large datasets. In addition to being equitable, extensive experiments across biomedical, image and synthetic data demonstrate that data Shapley has several other benefits: 1) it is more powerful than the popular leave-one-out or leverage score in providing insight on what data is more valuable for a given learning task; 2) low Shapley value data effectively capture outliers and corruptions; 3) high Shapley value data inform what type of new data to acquire to improve the predictor.

...read moreread less

Posted Content•

Explaining individual predictions when features are dependent: More accurate approximations to Shapley values

[...]

Kjersti Aas¹, Martin Jullum¹, Anders Løland¹•Institutions (1)

Norwegian Computing Center¹

25 Mar 2019-arXiv: Machine Learning

TL;DR: In this paper, the authors extend the Kernel SHAP method to handle dependent features, and provide several examples of linear and non-linear models with various degrees of feature dependence, where their method gives more accurate approximations to the true Shapley values.

...read moreread less

Abstract: Explaining complex or seemingly simple machine learning models is an important practical problem. We want to explain individual predictions from a complex machine learning model by learning simple, interpretable explanations. Shapley values is a game theoretic concept that can be used for this purpose. The Shapley value framework has a series of desirable theoretical properties, and can in principle handle any predictive model. Kernel SHAP is a computationally efficient approximation to Shapley values in higher dimensions. Like several other existing methods, this approach assumes that the features are independent, which may give very wrong explanations. This is the case even if a simple linear model is used for predictions. In this paper, we extend the Kernel SHAP method to handle dependent features. We provide several examples of linear and non-linear models with various degrees of feature dependence, where our method gives more accurate approximations to the true Shapley values. We also propose a method for aggregating individual Shapley values, such that the prediction can be explained by groups of dependent variables.

...read moreread less

Posted Content•

Robust Aggregation for Federated Learning

[...]

Krishna Pillutla, Sham M. Kakade, Zaid Harchaoui

31 Dec 2019-arXiv: Machine Learning

TL;DR: The experiments show that RFA is competitive with the classical aggregation when the level of corruption is low, while demonstrating greater robustness under high corruption, and establishes the convergence of the robust federated learning algorithm for the stochastic learning of additive models with least squares.

...read moreread less

Abstract: We present a robust aggregation approach to make federated learning robust to settings when a fraction of the devices may be sending corrupted updates to the server. The proposed approach relies on a robust secure aggregation oracle based on the geometric median, which returns a robust aggregate using a constant number of calls to a regular non-robust secure average oracle. The robust aggregation oracle is privacy-preserving, similar to the secure average oracle it builds upon. We provide experimental results of the proposed approach with linear models and deep networks for two tasks in computer vision and natural language processing. The robust aggregation approach is agnostic to the level of corruption; it outperforms the classical aggregation approach in terms of robustness when the level of corruption is high, while being competitive in the regime of low corruption.

...read moreread less

Posted Content•

Adversarial Examples Are Not Bugs, They Are Features

[...]

Andrew Ilyas¹, Shibani Santurkar¹, Dimitris Tsipras¹, Logan Engstrom¹, Brandon Tran¹, Aleksander Madry¹ - Show less +2 more•Institutions (1)

Massachusetts Institute of Technology¹

06 May 2019-arXiv: Machine Learning

TL;DR: The authors demonstrate that adversarial examples can be directly attributed to the presence of non-robust features: features derived from patterns in the data distribution that are highly predictive, yet brittle and incomprehensible to humans.

...read moreread less

Abstract: Adversarial examples have attracted significant attention in machine learning, but the reasons for their existence and pervasiveness remain unclear. We demonstrate that adversarial examples can be directly attributed to the presence of non-robust features: features derived from patterns in the data distribution that are highly predictive, yet brittle and incomprehensible to humans. After capturing these features within a theoretical framework, we establish their widespread existence in standard datasets. Finally, we present a simple setting where we can rigorously tie the phenomena we observe in practice to a misalignment between the (human-specified) notion of robustness and the inherent geometry of the data.

...read moreread less

Posted Content•

On Mixup Training: Improved Calibration and Predictive Uncertainty for Deep Neural Networks

[...]

Sunil Thulasidasan¹, Gopinath Chennupati¹, Jeff A. Bilmes², Tanmoy Bhattacharya¹, Sarah E. Michalak¹ - Show less +1 more•Institutions (2)

Los Alamos National Laboratory¹, University of Washington²

27 May 2019-arXiv: Machine Learning

TL;DR: DNNs trained with mixup are significantly better calibrated and are less prone to over-confident predictions on out-of-distribution and random-noise data, suggesting that mixup be employed for classification tasks where predictive uncertainty is a significant concern.

...read moreread less

Abstract: Mixup~\cite{zhang2017mixup} is a recently proposed method for training deep neural networks where additional samples are generated during training by convexly combining random pairs of images and their associated labels. While simple to implement, it has been shown to be a surprisingly effective method of data augmentation for image classification: DNNs trained with mixup show noticeable gains in classification performance on a number of image classification benchmarks. In this work, we discuss a hitherto untouched aspect of mixup training -- the calibration and predictive uncertainty of models trained with mixup. We find that DNNs trained with mixup are significantly better calibrated -- i.e., the predicted softmax scores are much better indicators of the actual likelihood of a correct prediction -- than DNNs trained in the regular fashion. We conduct experiments on a number of image classification architectures and datasets -- including large-scale datasets like ImageNet -- and find this to be the case. Additionally, we find that merely mixing features does not result in the same calibration benefit and that the label smoothing in mixup training plays a significant role in improving calibration. Finally, we also observe that mixup-trained DNNs are less prone to over-confident predictions on out-of-distribution and random-noise data. We conclude that the typical overconfidence seen in neural networks, even on in-distribution data is likely a consequence of training with hard labels, suggesting that mixup be employed for classification tasks where predictive uncertainty is a significant concern.

...read moreread less

Posted Content•

Towards Automatic Concept-based Explanations

[...]

Amirata Ghorbani¹, James Wexler², James Zou¹, Been Kim²•Institutions (2)

Stanford University¹, Google²

07 Feb 2019-arXiv: Machine Learning

TL;DR: This work proposes principles and desiderata for concept based explanation, which goes beyond per-sample features to identify higher-level human-understandable concepts that apply across the entire dataset.

...read moreread less

Abstract: Interpretability has become an important topic of research as more machine learning (ML) models are deployed and widely used to make important decisions. Most of the current explanation methods provide explanations through feature importance scores, which identify features that are important for each individual input. However, how to systematically summarize and interpret such per sample feature importance scores itself is challenging. In this work, we propose principles and desiderata for \emph{concept} based explanation, which goes beyond per-sample features to identify higher-level human-understandable concepts that apply across the entire dataset. We develop a new algorithm, ACE, to automatically extract visual concepts. Our systematic experiments demonstrate that \alg discovers concepts that are human-meaningful, coherent and important for the neural network's predictions.

...read moreread less

Posted Content•

Bayesian Nonparametric Federated Learning of Neural Networks.

[...]

Mikhail Yurochkin¹, Mayank Agarwal², Soumya Ghosh¹, Kristjan Greenewald¹, Trong Nghia Hoang¹, Yasaman Khazaeni¹ - Show less +2 more•Institutions (2)

IBM¹, University of Massachusetts Amherst²

28 May 2019-arXiv: Machine Learning

TL;DR: In this article, a Bayesian nonparametric framework for federated learning with neural networks is proposed, where each data server is assumed to provide local neural network weights, which are modeled through the framework.

...read moreread less

Abstract: In federated learning problems, data is scattered across different servers and exchanging or pooling it is often impractical or prohibited. We develop a Bayesian nonparametric framework for federated learning with neural networks. Each data server is assumed to provide local neural network weights, which are modeled through our framework. We then develop an inference approach that allows us to synthesize a more expressive global network without additional supervision, data pooling and with as few as a single communication round. We then demonstrate the efficacy of our approach on federated learning problems simulated from two popular image classification datasets.

...read moreread less

Journal Article•DOI•

Dying ReLU and Initialization: Theory and Numerical Examples

[...]

Lu Lu¹, Yeonjong Shin², Yanhui Su³, George Em Karniadakis²•Institutions (3)

Massachusetts Institute of Technology¹, Brown University², Fuzhou University³

15 Mar 2019-arXiv: Machine Learning

TL;DR: In this article, a randomized asymmetric initialization procedure was proposed to prevent the dying ReLU in deep ReLU networks, which can effectively prevent the network from dying in probability as the depth goes to infinity.

...read moreread less

Abstract: The dying ReLU refers to the problem when ReLU neurons become inactive and only output 0 for any input. There are many empirical and heuristic explanations of why ReLU neurons die. However, little is known about its theoretical analysis. In this paper, we rigorously prove that a deep ReLU network will eventually die in probability as the depth goes to infinite. Several methods have been proposed to alleviate the dying ReLU. Perhaps, one of the simplest treatments is to modify the initialization procedure. One common way of initializing weights and biases uses symmetric probability distributions, which suffers from the dying ReLU. We thus propose a new initialization procedure, namely, a randomized asymmetric initialization. We prove that the new initialization can effectively prevent the dying ReLU. All parameters required for the new initialization are theoretically designed. Numerical examples are provided to demonstrate the effectiveness of the new initialization procedure.

...read moreread less

Posted Content•

Residual Flows for Invertible Generative Modeling

[...]

Ricky T. Q. Chen¹, Jens Behrmann², David Duvenaud¹, Jörn-Henrik Jacobsen¹•Institutions (2)

University of Toronto¹, University of Bremen²

06 Jun 2019-arXiv: Machine Learning

TL;DR: The resulting approach, called Residual Flows, achieves state-of-the-art performance on density estimation amongst flow-based models, and outperforms networks that use coupling blocks at joint generative and discriminative modeling.

...read moreread less

Abstract: Flow-based generative models parameterize probability distributions through an invertible transformation and can be trained by maximum likelihood. Invertible residual networks provide a flexible family of transformations where only Lipschitz conditions rather than strict architectural constraints are needed for enforcing invertibility. However, prior work trained invertible residual networks for density estimation by relying on biased log-density estimates whose bias increased with the network's expressiveness. We give a tractable unbiased estimate of the log density using a "Russian roulette" estimator, and reduce the memory required during training by using an alternative infinite series for the gradient. Furthermore, we improve invertible residual blocks by proposing the use of activation functions that avoid derivative saturation and generalizing the Lipschitz condition to induced mixed norms. The resulting approach, called Residual Flows, achieves state-of-the-art performance on density estimation amongst flow-based models, and outperforms networks that use coupling blocks at joint generative and discriminative modeling.

...read moreread less

Posted Content•

Wasserstein Distributionally Robust Optimization: Theory and Applications in Machine Learning

[...]

Daniel Kuhn¹, Peyman Mohajerin Esfahani², Viet Anh Nguyen¹, Soroosh Shafieezadeh-Abadeh•Institutions (2)

École Polytechnique Fédérale de Lausanne¹, Delft University of Technology²

23 Aug 2019-arXiv: Machine Learning

TL;DR: This tutorial argues that Wasserstein distributionally robust optimization has interesting ramifications for statistical learning and motivates new approaches for fundamental learning tasks such as classification, regression, maximum likelihood estimation or minimum mean square error estimation, among others.

...read moreread less

Abstract: Many decision problems in science, engineering and economics are affected by uncertain parameters whose distribution is only indirectly observable through samples. The goal of data-driven decision-making is to learn a decision from finitely many training samples that will perform well on unseen test samples. This learning task is difficult even if all training and test samples are drawn from the same distribution---especially if the dimension of the uncertainty is large relative to the training sample size. Wasserstein distributionally robust optimization seeks data-driven decisions that perform well under the most adverse distribution within a certain Wasserstein distance from a nominal distribution constructed from the training samples. In this tutorial we will argue that this approach has many conceptual and computational benefits. Most prominently, the optimal decisions can often be computed by solving tractable convex optimization problems, and they enjoy rigorous out-of-sample and asymptotic consistency guarantees. We will also show that Wasserstein distributionally robust optimization has interesting ramifications for statistical learning and motivates new approaches for fundamental learning tasks such as classification, regression, maximum likelihood estimation or minimum mean square error estimation, among others.

...read moreread less

Posted Content•

Explanations can be manipulated and geometry is to blame

[...]

Ann-Kathrin Dombrowski¹, Maximilian Alber², Christopher J. Anders², Marcel Ackermann, Klaus-Robert Müller³, Pan Kessel² - Show less +2 more•Institutions (3)

RWTH Aachen University¹, Technical University of Berlin², Max Planck Society³

19 Jun 2019-arXiv: Machine Learning

TL;DR: It is shown that explanations can be manipulated arbitrarily by applying visually hardly perceptible perturbations to the input that keep the network's output approximately constant, and theoretically this phenomenon can be related to certain geometrical properties of neural networks.

...read moreread less

Abstract: Explanation methods aim to make neural networks more trustworthy and interpretable. In this paper, we demonstrate a property of explanation methods which is disconcerting for both of these purposes. Namely, we show that explanations can be manipulated arbitrarily by applying visually hardly perceptible perturbations to the input that keep the network's output approximately constant. We establish theoretically that this phenomenon can be related to certain geometrical properties of neural networks. This allows us to derive an upper bound on the susceptibility of explanations to manipulations. Based on this result, we propose effective mechanisms to enhance the robustness of explanations.

...read moreread less

Posted Content•

Transfer learning enhanced physics informed neural network for phase-field modeling of fracture

[...]

Somdatta Goswami¹, Cosmin Anitescu¹, Souvik Chakraborty², Souvik Chakraborty³, Timon Rabczuk¹ - Show less +1 more•Institutions (3)

Bauhaus University, Weimar¹, University of Notre Dame², University of British Columbia³

04 Jul 2019-arXiv: Machine Learning

TL;DR: The proposed PINN algorithm uses the concept of transfer learning to obtain the crack path in an efficient manner and is found to yield better accuracy compared to conventional residual based PINN algorithms.

...read moreread less

Abstract: We present a new physics informed neural network (PINN) algorithm for solving brittle fracture problems. While most of the PINN algorithms available in the literature minimize the residual of the governing partial differential equation, the proposed approach takes a different path by minimizing the variational energy of the system. Additionally, we modify the neural network output such that the boundary conditions associated with the problem are exactly satisfied. Compared to conventional residual based PINN, the proposed approach has two major advantages. First, the imposition of boundary conditions is relatively simpler and more robust. Second, the order of derivatives present in the functional form of the variational energy is of lower order than in the residual form. Hence, training the network is faster. To compute the total variational energy of the system, an efficient scheme that takes as input a geometry described by spline based CAD model and employs Gauss quadrature rules for numerical integration has been proposed. Moreover, we utilize the concept of transfer learning to obtain the crack path in an efficient manner. The proposed approach is used to solve four fracture mechanics problems. For all the examples, results obtained using the proposed approach match closely with the results available in the literature. For the first two examples, we compare the results obtained using the proposed approach with the conventional residual based neural network results. For both the problems, the proposed approach is found to yield better accuracy compared to conventional residual based PINN algorithms.

...read moreread less

Posted Content•

Adversarial Robustness through Local Linearization

[...]

Chongli Qin¹, James Martens¹, Sven Gowal¹, Dilip Krishnan¹, Krishnamurthy Dvijotham¹, Alhussein Fawzi¹, Soham De², Robert Stanforth¹, Pushmeet Kohli¹ - Show less +5 more•Institutions (2)

Google¹, University of Maryland, College Park²

04 Jul 2019-arXiv: Machine Learning

TL;DR: A novel regularizer is introduced that encourages the loss to behave linearly in the vicinity of the training data, thereby penalizing gradient obfuscation while encouraging robustness and shows via extensive experiments on CIFAR-10 and ImageNet, that models trained with this regularizer avoid gradient obfuscations and can be trained significantly faster than adversarial training.

...read moreread less

Abstract: Adversarial training is an effective methodology for training deep neural networks that are robust against adversarial, norm-bounded perturbations. However, the computational cost of adversarial training grows prohibitively as the size of the model and number of input dimensions increase. Further, training against less expensive and therefore weaker adversaries produces models that are robust against weak attacks but break down under attacks that are stronger. This is often attributed to the phenomenon of gradient obfuscation; such models have a highly non-linear loss surface in the vicinity of training examples, making it hard for gradient-based attacks to succeed even though adversarial examples still exist. In this work, we introduce a novel regularizer that encourages the loss to behave linearly in the vicinity of the training data, thereby penalizing gradient obfuscation while encouraging robustness. We show via extensive experiments on CIFAR-10 and ImageNet, that models trained with our regularizer avoid gradient obfuscation and can be trained significantly faster than adversarial training. Using this regularizer, we exceed current state of the art and achieve 47% adversarial accuracy for ImageNet with l-infinity adversarial perturbations of radius 4/255 under an untargeted, strong, white-box attack. Additionally, we match state of the art results for CIFAR-10 at 8/255.

...read moreread less

Posted Content•

Transferable Clean-Label Poisoning Attacks on Deep Neural Nets

[...]

Chen Zhu, W. Ronny Huang, Ali Shafahi, Hengduo Li, Gavin Taylor, Christoph Studer, Tom Goldstein - Show less +3 more

15 May 2019-arXiv: Machine Learning

TL;DR: A new "polytope attack" is proposed in which poison images are designed to surround the targeted image in feature space, and it is demonstrated that using Dropout during poison creation helps to enhance transferability of this attack.

...read moreread less

Abstract: Clean-label poisoning attacks inject innocuous looking (and "correctly" labeled) poison images into training data, causing a model to misclassify a targeted image after being trained on this data. We consider transferable poisoning attacks that succeed without access to the victim network's outputs, architecture, or (in some cases) training data. To achieve this, we propose a new "polytope attack" in which poison images are designed to surround the targeted image in feature space. We also demonstrate that using Dropout during poison creation helps to enhance transferability of this attack. We achieve transferable attack success rates of over 50% while poisoning only 1% of the training set.

...read moreread less

Posted Content•

Adversarial Robustness as a Prior for Learned Representations

[...]

Logan Engstrom¹, Andrew Ilyas¹, Shibani Santurkar¹, Dimitris Tsipras¹, Brandon Tran¹, Aleksander Madry¹ - Show less +2 more•Institutions (1)

Massachusetts Institute of Technology¹

25 Sep 2019-arXiv: Machine Learning

TL;DR: This work shows that robust optimization can be re-cast as a tool for enforcing priors on the features learned by deep neural networks, and indicates adversarial robustness as a promising avenue for improving learned representations.

...read moreread less

Abstract: An important goal in deep learning is to learn versatile, high-level feature representations of input data. However, standard networks' representations seem to possess shortcomings that, as we illustrate, prevent them from fully realizing this goal. In this work, we show that robust optimization can be re-cast as a tool for enforcing priors on the features learned by deep neural networks. It turns out that representations learned by robust models address the aforementioned shortcomings and make significant progress towards learning a high-level encoding of inputs. In particular, these representations are approximately invertible, while allowing for direct visualization and manipulation of salient input features. More broadly, our results indicate adversarial robustness as a promising avenue for improving learned representations. Our code and models for reproducing these results is available at this https URL .

...read moreread less

Posted Content•

On the Inductive Bias of Neural Tangent Kernels

[...]

Alberto Bietti¹, Julien Mairal¹•Institutions (1)

French Institute for Research in Computer Science and Automation¹

29 May 2019-arXiv: Machine Learning

TL;DR: In this article, the authors study the inductive bias of learning in such a regime by analyzing the neural tangent kernel and the corresponding function space (RKHS), and compare to other known kernels for similar architectures.

...read moreread less

Abstract: State-of-the-art neural networks are heavily over-parameterized, making the optimization algorithm a crucial ingredient for learning predictive models with good generalization properties. A recent line of work has shown that in a certain over-parameterized regime, the learning dynamics of gradient descent are governed by a certain kernel obtained at initialization, called the neural tangent kernel. We study the inductive bias of learning in such a regime by analyzing this kernel and the corresponding function space (RKHS). In particular, we study smoothness, approximation, and stability properties of functions with finite norm, including stability to image deformations in the case of convolutional networks, and compare to other known kernels for similar architectures.

...read moreread less

Posted Content•

Label-Consistent Backdoor Attacks

[...]

Alexander Turner¹, Dimitris Tsipras¹, Aleksander Madry¹•Institutions (1)

Massachusetts Institute of Technology¹

05 Dec 2019-arXiv: Machine Learning

TL;DR: This work leverages adversarial perturbations and generative models to execute efficient, yet label-consistent, backdoor attacks, based on injecting inputs that appear plausible, yet are hard to classify, hence causing the model to rely on the (easier-to-learn) backdoor trigger.

...read moreread less

Abstract: Deep neural networks have been demonstrated to be vulnerable to backdoor attacks. Specifically, by injecting a small number of maliciously constructed inputs into the training set, an adversary is able to plant a backdoor into the trained model. This backdoor can then be activated during inference by a backdoor trigger to fully control the model's behavior. While such attacks are very effective, they crucially rely on the adversary injecting arbitrary inputs that are---often blatantly---mislabeled. Such samples would raise suspicion upon human inspection, potentially revealing the attack. Thus, for backdoor attacks to remain undetected, it is crucial that they maintain label-consistency---the condition that injected inputs are consistent with their labels. In this work, we leverage adversarial perturbations and generative models to execute efficient, yet label-consistent, backdoor attacks. Our approach is based on injecting inputs that appear plausible, yet are hard to classify, hence causing the model to rely on the (easier-to-learn) backdoor trigger.

...read moreread less

Posted Content•

Unlabeled Data Improves Adversarial Robustness

[...]

Yair Carmon¹, Aditi Raghunathan¹, Ludwig Schmidt², Percy Liang¹, John C. Duchi¹ - Show less +1 more•Institutions (2)

Stanford University¹, University of California, Berkeley²

31 May 2019-arXiv: Machine Learning

TL;DR: In this paper, a simple self-training procedure was proposed to bridge the sample complexity gap between standard and robust classification, and the authors showed that a simple semisupervised learning procedure (self-training) achieves high robust accuracy using the same number of labels required for achieving high standard accuracy.

...read moreread less

Abstract: We demonstrate, theoretically and empirically, that adversarial robustness can significantly benefit from semisupervised learning. Theoretically, we revisit the simple Gaussian model of Schmidt et al. that shows a sample complexity gap between standard and robust classification. We prove that unlabeled data bridges this gap: a simple semisupervised learning procedure (self-training) achieves high robust accuracy using the same number of labels required for achieving high standard accuracy. Empirically, we augment CIFAR-10 with 500K unlabeled images sourced from 80 Million Tiny Images and use robust self-training to outperform state-of-the-art robust accuracies by over 5 points in (i) $\ell_\infty$ robustness against several strong attacks via adversarial training and (ii) certified $\ell_2$ and $\ell_\infty$ robustness via randomized smoothing. On SVHN, adding the dataset's own extra training set with the labels removed provides gains of 4 to 10 points, within 1 point of the gain from using the extra labels.

...read moreread less

Posted Content•

Benign Overfitting in Linear Regression

[...]

Peter L. Bartlett¹, Philip M. Long², Gábor Lugosi³, Alexander Tsigler¹•Institutions (3)

University of California, Berkeley¹, Google², Pompeu Fabra University³

26 Jun 2019-arXiv: Machine Learning

TL;DR: In this paper, a characterization of linear regression problems for which the minimum norm interpolating prediction rule has near-optimal prediction accuracy is given, in terms of two notions of the effective rank of the data covariance.

...read moreread less

Abstract: The phenomenon of benign overfitting is one of the key mysteries uncovered by deep learning methodology: deep neural networks seem to predict well, even with a perfect fit to noisy training data. Motivated by this phenomenon, we consider when a perfect fit to training data in linear regression is compatible with accurate prediction. We give a characterization of linear regression problems for which the minimum norm interpolating prediction rule has near-optimal prediction accuracy. The characterization is in terms of two notions of the effective rank of the data covariance. It shows that overparameterization is essential for benign overfitting in this setting: the number of directions in parameter space that are unimportant for prediction must significantly exceed the sample size. By studying examples of data covariance properties that this characterization shows are required for benign overfitting, we find an important role for finite-dimensional data: the accuracy of the minimum norm interpolating prediction rule approaches the best possible accuracy for a much narrower range of properties of the data distribution when the data lies in an infinite dimensional space versus when the data lies in a finite dimensional space whose dimension grows faster than the sample size.

...read moreread less

Posted Content•

Contrastive Learning of Structured World Models

[...]

Thomas Kipf¹, Elise van der Pol¹, Max Welling¹•Institutions (1)

University of Amsterdam¹

27 Nov 2019-arXiv: Machine Learning

TL;DR: These experiments demonstrate that C-SWMs can overcome limitations of models based on pixel reconstruction and outperform typical representatives of this model class in highly structured environments, while learning interpretable object-based representations.

...read moreread less

Abstract: A structured understanding of our world in terms of objects, relations, and hierarchies is an important component of human cognition. Learning such a structured world model from raw sensory data remains a challenge. As a step towards this goal, we introduce Contrastively-trained Structured World Models (C-SWMs). C-SWMs utilize a contrastive approach for representation learning in environments with compositional structure. We structure each state embedding as a set of object representations and their relations, modeled by a graph neural network. This allows objects to be discovered from raw pixel observations without direct supervision as part of the learning process. We evaluate C-SWMs on compositional environments involving multiple interacting objects that can be manipulated independently by an agent, simple Atari games, and a multi-object physics simulation. Our experiments demonstrate that C-SWMs can overcome limitations of models based on pixel reconstruction and outperform typical representatives of this model class in highly structured environments, while learning interpretable object-based representations.

...read moreread less

Collapse