Showing papers by "Michael I. Jordan published in 2018"

PDF

Open Access

Journal Article•DOI•

Deep generative modeling for single-cell transcriptomics.

[...]

Romain Lopez¹, Jeffrey Regier¹, Michael B. Cole¹, Michael I. Jordan¹, Nir Yosef¹, Nir Yosef² - Show less +2 more•Institutions (2)

University of California, Berkeley¹, Ragon Institute of MGH, MIT and Harvard²

30 Nov 2018-Nature Methods

TL;DR: Single-cell variational inference (scVI) is a ready-to-use generative deep learning tool for large-scale single-cell RNA-seq data that enables raw data processing and a wide range of rapid and accurate downstream analyses.

...read moreread less

Abstract: Single-cell transcriptome measurements can reveal unexplored biological diversity, but they suffer from technical noise and bias that must be modeled to account for the resulting uncertainty in downstream analyses. Here we introduce single-cell variational inference (scVI), a ready-to-use scalable framework for the probabilistic representation and analysis of gene expression in single cells ( https://github.com/YosefLab/scVI ). scVI uses stochastic optimization and deep neural networks to aggregate information across similar cells and genes and to approximate the distributions that underlie observed expression values, while accounting for batch effects and limited sensitivity. We used scVI for a range of fundamental analysis tasks including batch correction, visualization, clustering, and differential expression, and achieved high accuracy for each task.

...read moreread less

1,052 citations

Proceedings Article•

Conditional Adversarial Domain Adaptation

[...]

Mingsheng Long¹, Zhangjie Cao¹, Jianmin Wang¹, Michael I. Jordan²•Institutions (2)

Tsinghua University¹, University of California, Berkeley²

01 Jan 2018

TL;DR: Conditional domain adversarial networks (CDANs) as discussed by the authors are designed with two novel conditioning strategies: multilinear conditioning that captures the cross-covariance between feature representations and classifier predictions to improve the discriminability, and entropy conditioning that controls the uncertainty of classifier prediction to guarantee the transferability.

...read moreread less

Abstract: Adversarial learning has been embedded into deep networks to learn disentangled and transferable representations for domain adaptation. Existing adversarial domain adaptation methods may struggle to align different domains of multimodal distributions that are native in classification problems. In this paper, we present conditional adversarial domain adaptation, a principled framework that conditions the adversarial adaptation models on discriminative information conveyed in the classifier predictions. Conditional domain adversarial networks (CDANs) are designed with two novel conditioning strategies: multilinear conditioning that captures the cross-covariance between feature representations and classifier predictions to improve the discriminability, and entropy conditioning that controls the uncertainty of classifier predictions to guarantee the transferability. Experiments testify that the proposed approach exceeds the state-of-the-art results on five benchmark datasets.

...read moreread less

760 citations

Journal Article•DOI•

Highly parallel direct RNA sequencing on an array of nanopores.

[...]

Daniel Ryan Garalde¹, Elizabeth A Snell¹, Daniel Jachimowicz¹, Botond Sipos¹, Joseph Hargreaves Lloyd¹, Mark Bruce¹, Nadia Pantic¹, Tigist Admassu¹, P. D. James¹, Anthony Warland¹, Michael I. Jordan¹, J. Ciccone¹, Sabrina Serra¹, Jemma Keenan¹, Samuel A.M. Martin¹, Luke A. McNeill¹, E. Jayne Wallace¹, Lakmal Jayasinghe¹, Christopher V.E. Wright¹, Javier Blasco¹, Stephen Young¹, Brocklebank D¹, Sissel Juul, James Clarke¹, Andrew John Heron¹, Daniel J. Turner¹ - Show less +22 more•Institutions (1)

University of Oxford¹

15 Jan 2018-Nature Methods

TL;DR: N nanopore direct RNA-seq is demonstrated, a highly parallel, real-time, single-molecule method that circumvents reverse transcription or amplification steps and enables the direct detection of nucleotide analogs in RNA.

...read moreread less

Abstract: Direct sequencing of RNA molecules in real time using nanopores allows for the detection of splice variants and hold promises for profiling RNA modifications.

...read moreread less

757 citations

Proceedings Article•DOI•

Ray: a distributed framework for emerging AI applications

[...]

Philipp Moritz¹, Robert Nishihara¹, Stephanie Wang¹, Alexey Tumanov¹, Richard Liaw¹, Eric Liang¹, Melih Elibol¹, Zongheng Yang¹, William Paul¹, Michael I. Jordan¹, Ion Stoica¹ - Show less +7 more•Institutions (1)

University of California, Berkeley¹

08 Oct 2018

TL;DR: Ray as mentioned in this paper is a distributed system that implements a unified interface that can express both task-parallel and actor-based computations, supported by a single dynamic execution engine and employs a distributed scheduler and a distributed and fault-tolerant store to manage the control state.

...read moreread less

Abstract: The next generation of AI applications will continuously interact with the environment and learn from these interactions. These applications impose new and demanding systems requirements, both in terms of performance and flexibility. In this paper, we consider these requirements and present Ray--a distributed system to address them. Ray implements a unified interface that can express both task-parallel and actor-based computations, supported by a single dynamic execution engine. To meet the performance requirements, Ray employs a distributed scheduler and a distributed and fault-tolerant store to manage the system's control state. In our experiments, we demonstrate scaling beyond 1.8 million tasks per second and better performance than existing specialized systems for several challenging reinforcement learning applications.

...read moreread less

600 citations

Proceedings Article•DOI•

Partial Transfer Learning with Selective Adversarial Networks

[...]

Zhangjie Cao¹, Mingsheng Long¹, Jianmin Wang¹, Michael I. Jordan²•Institutions (2)

Tsinghua University¹, University of California, Berkeley²

01 Jun 2018

TL;DR: In this article, a selective adversarial network (SAN) is proposed to simultaneously circumvents negative transfer by selecting out the outlier source classes and promotes positive transfer by maximally matching the data distributions in the shared label space.

...read moreread less

Abstract: Adversarial learning has been successfully embedded into deep networks to learn transferable features, which reduce distribution discrepancy between the source and target domains. Existing domain adversarial networks assume fully shared label space across domains. In the presence of big data, there is strong motivation of transferring both classification and representation models from existing large-scale domains to unknown small-scale domains. This paper introduces partial transfer learning, which relaxes the shared label space assumption to that the target label space is only a subspace of the source label space. Previous methods typically match the whole source domain to the target domain, which are prone to negative transfer for the partial transfer problem. We present Selective Adversarial Network (SAN), which simultaneously circumvents negative transfer by selecting out the outlier source classes and promotes positive transfer by maximally matching the data distributions in the shared label space. Experiments demonstrate that our models exceed state-of-the-art results for partial transfer learning tasks on several benchmark datasets.

...read moreread less

299 citations

Proceedings Article•

Learning to Explain: An Information-Theoretic Perspective on Model Interpretation

[...]

Jianbo Chen¹, Le Song², Martin J. Wainwright¹, Michael I. Jordan¹•Institutions (2)

University of California, Berkeley¹, Georgia Institute of Technology²

21 Feb 2018

TL;DR: In this article, instancewise feature selection is introduced as a methodology for modelinterpretation, which is based on learning a function to extract a subset of features that are most informative for each given example.

...read moreread less

Abstract: We introduce instancewise feature selection as a methodology for model interpretation. Our method is based on learning a function to extract a subset of features that are most informative for each given example. This feature selector is trained to maximize the mutual information between selected features and the response variable, where the conditional distribution of the response variable given the input is the model to be explained. We develop an efficient variational approximation to the mutual information, and show the effectiveness of our method on a variety of synthetic and real data sets using both quantitative metrics and human evaluation.

...read moreread less

257 citations

Proceedings Article•

Is Q-learning Provably Efficient?

[...]

Chi Jin, Zeyuan Allen-Zhu¹, Sébastien Bubeck¹, Michael I. Jordan²•Institutions (2)

Microsoft¹, University of California, Berkeley²

01 Jan 2018

TL;DR: In this article, the authors show that Q-learning with UCB exploration achieves regret Θ(tlO(sqrt{H^3 SAT}) where S$ and A$ are the numbers of states and actions, H$ is the number of steps per episode, and T$ is a sum of the total number of iterations.

...read moreread less

Abstract: Model-free reinforcement learning (RL) algorithms directly parameterize and update value functions or policies, bypassing the modeling of the environment. They are typically simpler, more flexible to use, and thus more prevalent in modern deep RL than model-based approaches. However, empirical work has suggested that they require large numbers of samples to learn. The theoretical question of whether not model-free algorithms are in fact \emph{sample efficient} is one of the most fundamental questions in RL. The problem is unsolved even in the basic scenario with finitely many states and actions. We prove that, in an episodic MDP setting, Q-learning with UCB exploration achieves regret $\tlO(\sqrt{H^3 SAT})$ where $S$ and $A$ are the numbers of states and actions, $H$ is the number of steps per episode, and $T$ is the total number of steps. Our regret matches the optimal regret up to a single $\sqrt{H}$ factor. Thus we establish the sample efficiency of a classical model-free approach. Moreover, to the best of our knowledge, this is the first model-free analysis to establish $\sqrt{T}$ regret \emph{without} requiring access to a ``simulator.''

...read moreread less

243 citations

Journal Article•

CoCoA: A General Framework for Communication-Efficient Distributed Optimization

[...]

Virginia Smith, Simone Forte, Chenxin Ma, Martin Takáč, Michael I. Jordan, Martin Jaggi - Show less +2 more

01 Jan 2018-Journal of Machine Learning Research

TL;DR: In this paper, the authors proposed a method to solve the problem of the EKG-based EKF model in the context of the EPFL-this paper report.

...read moreread less

Abstract: Reference EPFL-REPORT-229237 URL: https://arxiv.org/abs/1611.02189 Record created on 2017-06-21, modified on 2017-07-12

...read moreread less

233 citations

Journal Article•DOI•

Minimax Optimal Procedures for Locally Private Estimation

[...]

John C. Duchi¹, Michael I. Jordan², Martin J. Wainwright²•Institutions (2)

Stanford University¹, University of California, Berkeley²

16 May 2018-Journal of the American Statistical Association

TL;DR: In this paper, the tradeoff between privacy guarantees and the risk of the resulting statistical estimators is studied under a model of privacy in which data remain private even from the statistician.

...read moreread less

Abstract: Working under a model of privacy in which data remain private even from the statistician, we study the tradeoff between privacy guarantees and the risk of the resulting statistical estimators. We develop private versions of classical information-theoretical bounds, in particular those due to Le Cam, Fano, and Assouad. These inequalities allow for a precise characterization of statistical rates under local privacy constraints and the development of provably (minimax) optimal estimation procedures. We provide a treatment of several canonical families of problems: mean estimation and median estimation, generalized linear models, and nonparametric density estimation. For all of these families, we provide lower and upper bounds that match up to constant factors, and exhibit new (optimal) privacy-preserving mechanisms and computationally efficient estimators that achieve the bounds. Additionally, we present a variety of experimental results for estimation problems involving sensitive data, including sal...

...read moreread less

232 citations

Proceedings Article•

RLlib: Abstractions for Distributed Reinforcement Learning

[...]

Eric Liang¹, Richard Liaw¹, Robert Nishihara¹, Philipp Moritz¹, Roy Fox¹, Ken Goldberg¹, Joseph E. Gonzalez¹, Michael I. Jordan¹, Ion Stoica² - Show less +5 more•Institutions (2)

University of California, Berkeley¹, University of California²

03 Jul 2018

216 citations

Posted Content•

Model-Based Value Estimation for Efficient Model-Free Reinforcement Learning.

[...]

Vladimir Feinberg, Alvin Wan, Ion Stoica, Michael I. Jordan, Joseph E. Gonzalez, Sergey Levine - Show less +2 more

28 Feb 2018-arXiv: Learning

TL;DR: By enabling wider use of learned dynamics models within a model-free reinforcement learning algorithm, this work improves value estimation, which, in turn, reduces the sample complexity of learning.

...read moreread less

Abstract: Recent model-free reinforcement learning algorithms have proposed incorporating learned dynamics models as a source of additional data with the intention of reducing sample complexity. Such methods hold the promise of incorporating imagined data coupled with a notion of model uncertainty to accelerate the learning of continuous control tasks. Unfortunately, they rely on heuristics that limit usage of the dynamics model. We present model-based value expansion, which controls for uncertainty in the model by only allowing imagination to fixed depth. By enabling wider use of learned dynamics models within a model-free reinforcement learning algorithm, we improve value estimation, which, in turn, reduces the sample complexity of learning.

...read moreread less

Proceedings Article•

Generalized Zero-Shot Learning with Deep Calibration Network

[...]

Shichen Liu¹, Mingsheng Long¹, Jianmin Wang¹, Michael I. Jordan²•Institutions (2)

Tsinghua University¹, University of California, Berkeley²

01 Jan 2018

TL;DR: This paper proposes a novel Deep Calibration Network (DCN) approach towards this generalized zero-shot learning paradigm, which enables simultaneous calibration of deep networks on the confidence of source classes and uncertainty of target classes.

...read moreread less

Abstract: A technical challenge of deep learning is recognizing target classes without seen data Zero-shot learning leverages semantic representations such as attributes or class prototypes to bridge source and target classes Existing standard zero-shot learning methods may be prone to overfitting the seen data of source classes as they are blind to the semantic representations of target classes In this paper, we study generalized zero-shot learning that assumes accessible to target classes for unseen data during training, and prediction on unseen data is made by searching on both source and target classes We propose a novel Deep Calibration Network (DCN) approach towards this generalized zero-shot learning paradigm, which enables simultaneous calibration of deep networks on the confidence of source classes and uncertainty of target classes Our approach maps visual features of images and semantic representations of class prototypes to a common embedding space such that the compatibility of seen data to both source and target classes are maximized We show superior accuracy of our approach over the state of the art on benchmark datasets for generalized zero-shot learning, including AwA, CUB, SUN, and aPY

...read moreread less

Proceedings Article•

Learning Without Mixing: Towards A Sharp Analysis of Linear System Identification

[...]

Max Simchowitz¹, Horia Mania¹, Stephen Tu¹, Michael I. Jordan¹, Benjamin Recht¹ - Show less +1 more•Institutions (1)

University of California, Berkeley¹

03 Jul 2018

TL;DR: In this paper, the authors show that the OLS estimator attains nearly minimax optimal performance for the identification of linear dynamical systems from a single observed trajectory, using a generalization of Mendelson's small-ball method to dependent data, eschewing the use of standard mixing-time arguments.

...read moreread less

Abstract: We prove that the ordinary least-squares (OLS) estimator attains nearly minimax optimal performance for the identification of linear dynamical systems from a single observed trajectory. Our upper bound relies on a generalization of Mendelson's small-ball method to dependent data, eschewing the use of standard mixing-time arguments. Our lower bounds reveal that these upper bounds match up to logarithmic factors. In particular, we capture the correct signal-to-noise behavior of the problem, showing that more unstable linear systems are easier to estimate. This behavior is qualitatively different from arguments which rely on mixing-time calculations that suggest that unstable systems are more difficult to estimate. We generalize our technique to provide bounds for a more general class of linear response time-series.

...read moreread less

Posted Content•

Sharp Convergence Rates for Langevin Dynamics in the Nonconvex Setting.

[...]

Xiang Cheng¹, Niladri S. Chatterji¹, Yasin Abbasi-Yadkori, Peter L. Bartlett¹, Michael I. Jordan¹ - Show less +1 more•Institutions (1)

University of California, Berkeley¹

04 May 2018-arXiv: Machine Learning

TL;DR: Both overdamped and underdamped Langevin MCMC are studied and upper bounds on the number of steps required to obtain a sample from a distribution that is within $\epsilon$ of $p*$ in $1$-Wasserstein distance are established.

...read moreread less

Abstract: We study the problem of sampling from a distribution $p^*(x) \propto \exp\left(-U(x)\right)$, where the function $U$ is $L$-smooth everywhere and $m$-strongly convex outside a ball of radius $R$, but potentially nonconvex inside this ball. We study both overdamped and underdamped Langevin MCMC and establish upper bounds on the number of steps required to obtain a sample from a distribution that is within $\epsilon$ of $p^*$ in $1$-Wasserstein distance. For the first-order method (overdamped Langevin MCMC), the iteration complexity is $\tilde{\mathcal{O}}\left(e^{cLR^2}d/\epsilon^2\right)$, where $d$ is the dimension of the underlying space. For the second-order method (underdamped Langevin MCMC), the iteration complexity is $\tilde{\mathcal{O}}\left(e^{cLR^2}\sqrt{d}/\epsilon\right)$ for an explicit positive constant $c$. Surprisingly, the iteration complexity for both these algorithms is only polynomial in the dimension $d$ and the target accuracy $\epsilon$. It is exponential, however, in the problem parameter $LR^2$, which is a measure of non-log-concavity of the target distribution.

...read moreread less

Proceedings Article•

Accelerated Gradient Descent Escapes Saddle Points Faster than Gradient Descent

[...]

Chi Jin¹, Praneeth Netrapalli², Michael I. Jordan¹•Institutions (2)

University of California, Berkeley¹, Microsoft²

03 Jul 2018

TL;DR: In this article, a simple variant of Nesterov's accelerated gradient descent (AGD) was shown to achieve faster convergence rate than GD in the nonconvex setting.

...read moreread less

Abstract: Nesterov's accelerated gradient descent (AGD), an instance of the general family of "momentum methods", provably achieves faster convergence rate than gradient descent (GD) in the convex setting. However, whether these methods are superior to GD in the nonconvex setting remains open. This paper studies a simple variant of AGD, and shows that it escapes saddle points and finds a second-order stationary point in $\tilde{O}(1/\epsilon^{7/4})$ iterations, faster than the $\tilde{O}(1/\epsilon^{2})$ iterations required by GD. To the best of our knowledge, this is the first Hessian-free algorithm to find a second-order stationary point faster than GD, and also the first single-loop algorithm with a faster rate than GD even in the setting of finding a first-order stationary point. Our analysis is based on two key ideas: (1) the use of a simple Hamiltonian function, inspired by a continuous-time perspective, which AGD monotonically decreases per step even for nonconvex functions, and (2) a novel framework called improve or localize, which is useful for tracking the long-term behavior of gradient-based optimization algorithms. We believe that these techniques may deepen our understanding of both acceleration algorithms and nonconvex optimization.

...read moreread less

Proceedings Article•

Stochastic Cubic Regularization for Fast Nonconvex Optimization

[...]

Nilesh Tripuraneni¹, Mitchell Stern¹, Chi Jin¹, Jeffrey Regier¹, Michael I. Jordan¹ - Show less +1 more•Institutions (1)

University of California, Berkeley¹

01 Jan 2018

TL;DR: In this paper, the cubic regularized Newton method (CRN) was proposed to find approximate local minima for general smooth, nonconvex functions in only O(n σ(n)-3.5 ) time.

...read moreread less

Abstract: This paper proposes a stochastic variant of a classic algorithm---the cubic-regularized Newton method [Nesterov and Polyak]. The proposed algorithm efficiently escapes saddle points and finds approximate local minima for general smooth, nonconvex functions in only $\mathcal{\tilde{O}}(\epsilon^{-3.5})$ stochastic gradient and stochastic Hessian-vector product evaluations. The latter can be computed as efficiently as stochastic gradients. This improves upon the $\mathcal{\tilde{O}}(\epsilon^{-4})$ rate of stochastic gradient descent. Our rate matches the best-known result for finding local minima without requiring any delicate acceleration or variance-reduction techniques.

...read moreread less

Proceedings Article•

L-Shapley and C-Shapley: Efficient Model Interpretation for Structured Data

[...]

Jianbo Chen¹, Le Song², Martin J. Wainwright¹, Michael I. Jordan¹•Institutions (2)

University of California, Berkeley¹, Georgia Institute of Technology²

27 Sep 2018

TL;DR: In this article, the authors study instancewise feature importance scoring as a method for model interpretation, and develop two algorithms with linear complexity for instance-wise feature estimation in graph-structured data.

...read moreread less

Abstract: We study instancewise feature importance scoring as a method for model interpretation. Any such method yields, for each predicted instance, a vector of importance scores associated with the feature vector. Methods based on the Shapley score have been proposed as a fair way of computing feature attributions of this kind, but incur an exponential complexity in the number of features. This combinatorial explosion arises from the definition of the Shapley value and prevents these methods from being scalable to large data sets and complex models. We focus on settings in which the data have a graph structure, and the contribution of features to the target variable is well-approximated by a graph-structured factorization. In such settings, we develop two algorithms with linear complexity for instancewise feature importance scoring. We establish the relationship of our methods to the Shapley value and another closely related concept known as the Myerson value from cooperative game theory. We demonstrate on both language and image data that our algorithms compare favorably with other methods for model interpretation.

...read moreread less

Posted Content•

Understanding the Acceleration Phenomenon via High-Resolution Differential Equations

[...]

Bin Shi¹, Simon S. Du², Michael I. Jordan³, Weijie J. Su⁴•Institutions (4)

Chinese Academy of Sciences¹, University of Washington², University of California, Berkeley³, University of Pennsylvania⁴

21 Oct 2018-arXiv: Optimization and Control

TL;DR: In this paper, an alternative limiting process that yields high-resolution ODEs was proposed, which can be used to distinguish between Nesterov's accelerated gradient method for strongly convex functions (NAG-SC) and Polyak's heavy-ball method.

...read moreread less

Abstract: Gradient-based optimization algorithms can be studied from the perspective of limiting ordinary differential equations (ODEs). Motivated by the fact that existing ODEs do not distinguish between two fundamentally different algorithms---Nesterov's accelerated gradient method for strongly convex functions (NAG-SC) and Polyak's heavy-ball method---we study an alternative limiting process that yields high-resolution ODEs. We show that these ODEs permit a general Lyapunov function framework for the analysis of convergence in both continuous and discrete time. We also show that these ODEs are more accurate surrogates for the underlying algorithms; in particular, they not only distinguish between NAG-SC and Polyak's heavy-ball method, but they allow the identification of a term that we refer to as "gradient correction" that is present in NAG-SC but not in the heavy-ball method and is responsible for the qualitative difference in convergence of the two methods. We also use the high-resolution ODE framework to study Nesterov's accelerated gradient method for (non-strongly) convex functions, uncovering a hitherto unknown result---that NAG-C minimizes the squared gradient norm at an inverse cubic rate. Finally, by modifying the high-resolution ODE of NAG-C, we obtain a family of new optimization methods that are shown to maintain the accelerated convergence rates of NAG-C for smooth convex functions.

...read moreread less

Posted Content•

On Symplectic Optimization

[...]

Michael Betancourt, Michael I. Jordan, Ashia C. Wilson

10 Feb 2018-arXiv: Computation

TL;DR: This paper provides a systematic methodology for converting continuous-time dynamics into discrete-time algorithms while retaining oracle rates, based on ideas from Hamiltonian dynamical systems and symplectic integration.

...read moreread less

Abstract: Accelerated gradient methods have had significant impact in machine learning -- in particular the theoretical side of machine learning -- due to their ability to achieve oracle lower bounds But their heuristic construction has hindered their full integration into the practical machine-learning algorithmic toolbox, and has limited their scope In this paper we build on recent work which casts acceleration as a phenomenon best explained in continuous time, and we augment that picture by providing a systematic methodology for converting continuous-time dynamics into discrete-time algorithms while retaining oracle rates Our framework is based on ideas from Hamiltonian dynamical systems and symplectic integration These ideas have had major impact in many areas in applied mathematics, but have not yet been seen to have a relationship with optimization

...read moreread less

Proceedings Article•

On the Theory of Variance Reduction for Stochastic Gradient Monte Carlo

[...]

Niladri S. Chatterji¹, Nicolas Flammarion¹, Yi-An Ma¹, Peter L. Bartlett², Michael I. Jordan¹ - Show less +1 more•Institutions (2)

University of California, Berkeley¹, Lawrence Berkeley National Laboratory²

03 Jul 2018

TL;DR: In this article, the authors provide convergence guarantees in Wasserstein distance for a variety of variance-reduction methods: SAGA Langevin diffusion, SVRG LDA, and control-variate underdamped Langevin LDA.

...read moreread less

Abstract: We provide convergence guarantees in Wasserstein distance for a variety of variance-reduction methods: SAGA Langevin diffusion, SVRG Langevin diffusion and control-variate underdamped Langevin diffusion. We analyze these methods under a uniform set of assumptions on the log-posterior distribution, assuming it to be smooth, strongly convex and Hessian Lipschitz. This is achieved by a new proof technique combining ideas from finite-sum optimization and the analysis of sampling methods. Our sharp theoretical bounds allow us to identify regimes of interest where each method performs better than the others. Our theory is verified with experiments on real-world and synthetic datasets.

...read moreread less

Posted Content•

A Swiss Army Infinitesimal Jackknife

[...]

Ryan Giordano¹, William T. Stephenson², Runjing Liu¹, Michael I. Jordan¹, Tamara Broderick² - Show less +1 more•Institutions (2)

University of California, Berkeley¹, Massachusetts Institute of Technology²

01 Jun 2018-arXiv: Methodology

TL;DR: A linear approximation to the dependence of the fitting procedure on the weights is used, producing results that can be faster than repeated re-fitting by an order of magnitude and support the application of the infinitesimal jackknife to a wide variety of practical problems in machine learning.

...read moreread less

Abstract: The error or variability of machine learning algorithms is often assessed by repeatedly re-fitting a model with different weighted versions of the observed data. The ubiquitous tools of cross-validation (CV) and the bootstrap are examples of this technique. These methods are powerful in large part due to their model agnosticism but can be slow to run on modern, large data sets due to the need to repeatedly re-fit the model. In this work, we use a linear approximation to the dependence of the fitting procedure on the weights, producing results that can be faster than repeated re-fitting by an order of magnitude. This linear approximation is sometimes known as the "infinitesimal jackknife" in the statistics literature, where it is mostly used as a theoretical tool to prove asymptotic results. We provide explicit finite-sample error bounds for the infinitesimal jackknife in terms of a small number of simple, verifiable assumptions. Our results apply whether the weights and data are stochastic or deterministic, and so can be used as a tool for proving the accuracy of the infinitesimal jackknife on a wide variety of problems. As a corollary, we state mild regularity conditions under which our approximation consistently estimates true leave-$k$-out cross-validation for any fixed $k$. These theoretical results, together with modern automatic differentiation software, support the application of the infinitesimal jackknife to a wide variety of practical problems in machine learning, providing a "Swiss Army infinitesimal jackknife". We demonstrate the accuracy of our methods on a range of simulated and real datasets.

...read moreread less

Posted Content•

L-Shapley and C-Shapley: Efficient Model Interpretation for Structured Data

[...]

Jianbo Chen¹, Le Song², Martin J. Wainwright¹, Michael I. Jordan¹•Institutions (2)

University of California, Berkeley¹, Georgia Institute of Technology²

08 Aug 2018-arXiv: Learning

TL;DR: Two algorithms with linear complexity for instancewise feature importance scoring are developed and the relationship of their methods to the Shapley value and another closely related concept known as the Myerson value from cooperative game theory is established.

...read moreread less

Journal Article•DOI•

Covariances, Robustness, and Variational Bayes

[...]

Ryan Giordano, Tamara Broderick, Michael I. Jordan

01 Jan 2018-Journal of Machine Learning Research

TL;DR: Mean-field variational Bayes (MFVB) is an approximate Bayesian posterior inference technique that is increasingly popular due to its fast runtimes on large-scale data sets.

...read moreread less

Abstract: Mean-field Variational Bayes (MFVB) is an approximate Bayesian posterior inference technique that is increasingly popular due to its fast runtimes on large-scale data sets. However, even when MFVB ...

...read moreread less

Posted Content•

Sampling Can Be Faster Than Optimization

[...]

Yi-An Ma¹, Yuansi Chen¹, Chi Jin¹, Nicolas Flammarion¹, Michael I. Jordan¹ - Show less +1 more•Institutions (1)

University of California, Berkeley¹

20 Nov 2018-arXiv: Machine Learning

TL;DR: This work examines a class of nonconvex objective functions that arise in mixture modeling and multistable systems and finds that the computational complexity of sampling algorithms scales linearly with the model dimension while that of optimization algorithms scales exponentially.

...read moreread less

Abstract: Optimization algorithms and Monte Carlo sampling algorithms have provided the computational foundations for the rapid growth in applications of statistical machine learning in recent years. There is, however, limited theoretical understanding of the relationships between these two kinds of methodology, and limited understanding of relative strengths and weaknesses. Moreover, existing results have been obtained primarily in the setting of convex functions (for optimization) and log-concave functions (for sampling). In this setting, where local properties determine global properties, optimization algorithms are unsurprisingly more efficient computationally than sampling algorithms. We instead examine a class of nonconvex objective functions that arise in mixture modeling and multi-stable systems. In this nonconvex setting, we find that the computational complexity of sampling algorithms scales linearly with the model dimension while that of optimization algorithms scales exponentially.

...read moreread less

Posted Content•

Information Constraints on Auto-Encoding Variational Bayes

[...]

Romain Lopez¹, Jeffrey Regier¹, Michael I. Jordan¹, Nir Yosef¹•Institutions (1)

University of California, Berkeley¹

22 May 2018-arXiv: Learning

TL;DR: This article proposed a framework for learning representations that relies on Auto-Encoding Variational Bayes and whose search space is constrained via kernel-based measures of independence, such as the $d$-variable Hilbert-Schmidt Independence Criterion (dHSIC).

...read moreread less

Abstract: Parameterizing the approximate posterior of a generative model with neural networks has become a common theme in recent machine learning research. While providing appealing flexibility, this approach makes it difficult to impose or assess structural constraints such as conditional independence. We propose a framework for learning representations that relies on Auto-Encoding Variational Bayes and whose search space is constrained via kernel-based measures of independence. In particular, our method employs the $d$-variable Hilbert-Schmidt Independence Criterion (dHSIC) to enforce independence between the latent representations and arbitrary nuisance factors. We show how to apply this method to a range of problems, including the problems of learning invariant representations and the learning of interpretable representations. We also present a full-fledged application to single-cell RNA sequencing (scRNA-seq). In this setting the biological signal is mixed in complex ways with sequencing errors and sampling effects. We show that our method out-performs the state-of-the-art in this domain.

...read moreread less

Posted Content•

Greedy Attack and Gumbel Attack: Generating Adversarial Examples for Discrete Data

[...]

Puyudi Yang, Jianbo Chen, Cho-Jui Hsieh, Jane-Ling Wang, Michael I. Jordan - Show less +1 more

31 May 2018-arXiv: Learning

TL;DR: A perturbation- based method, Greedy Attack, and a scalable learning-based method, Gumbel Attack, are derived that illustrate various tradeoffs in the design of attacks.

...read moreread less

Abstract: We present a probabilistic framework for studying adversarial attacks on discrete data. Based on this framework, we derive a perturbation-based method, Greedy Attack, and a scalable learning-based method, Gumbel Attack, that illustrate various tradeoffs in the design of attacks. We demonstrate the effectiveness of these methods using both quantitative metrics and human evaluation on various state-of-the-art models for text classification, including a word-based CNN, a character-based CNN and an LSTM. As as example of our results, we show that the accuracy of character-based convolutional networks drops to the level of random selection by modifying only five characters through Greedy Attack.

...read moreread less

Posted Content•

Learning Without Mixing: Towards A Sharp Analysis of Linear System Identification

[...]

Max Simchowitz¹, Horia Mania¹, Stephen Tu¹, Michael I. Jordan¹, Benjamin Recht¹ - Show less +1 more•Institutions (1)

University of California, Berkeley¹

22 Feb 2018-arXiv: Learning

TL;DR: It is proved that the ordinary least-squares (OLS) estimator attains nearly minimax optimal performance for the identification of linear dynamical systems from a single observed trajectory, and generalizes the technique to provide bounds for a more general class of linear response time-series.

...read moreread less

Proceedings Article•

Averaging Stochastic Gradient Descent on Riemannian Manifolds

[...]

Nilesh Tripuraneni, Nicolas Flammarion, Francis Bach, Michael I. Jordan

03 Jul 2018

TL;DR: In this paper, a geometric framework was developed to transform a sequence of slowly converging iterates generated from stochastic gradient descent (SGD) on a Riemannian manifold to an averaged iterate sequence with a robust and fast $O(1/n) convergence rate.

...read moreread less

Abstract: We consider the minimization of a function defined on a Riemannian manifold $\mathcal{M}$ accessible only through unbiased estimates of its gradients. We develop a geometric framework to transform a sequence of slowly converging iterates generated from stochastic gradient descent (SGD) on $\mathcal{M}$ to an averaged iterate sequence with a robust and fast $O(1/n)$ convergence rate. We then present an application of our framework to geodesically-strongly-convex (and possibly Euclidean non-convex) problems. Finally, we demonstrate how these ideas apply to the case of streaming $k$-PCA, where we show how to accelerate the slow rate of the randomized power method (without requiring knowledge of the eigengap) into a robust algorithm achieving the optimal rate of convergence.

...read moreread less

Journal Article•DOI•

Posteriors, conjugacy, and exponential families for completely random measures

[...]

Tamara Broderick, Ashia C. Wilson, Michael I. Jordan

02 Nov 2018-Bernoulli

TL;DR: In this article, a notion of exponential families for CRMs, which are called exponential CRM likelihoods, was introduced, allowing automatic Bayesian nonparametric conjugate priors for exponential CRMs.

...read moreread less

Abstract: We demonstrate how to calculate posteriors for general Bayesian nonparametric priors and likelihoods based on completely random measures (CRMs). We further show how to represent Bayesian nonparametric priors as a sequence of finite draws using a size-biasing approach – and how to represent full Bayesian nonparametric models via finite marginals. Motivated by conjugate priors based on exponential family representations of likelihoods, we introduce a notion of exponential families for CRMs, which we call exponential CRMs. This construction allows us to specify automatic Bayesian nonparametric conjugate priors for exponential CRM likelihoods. We demonstrate that our exponential CRMs allow particularly straightforward recipes for size-biased and marginal representations of Bayesian nonparametric models. Along the way, we prove that the gamma process is a conjugate prior for the Poisson likelihood process and the beta prime process is a conjugate prior for a process we call the odds Bernoulli process. We deliver a size-biased representation of the gamma process and a marginal representation of the gamma process coupled with a Poisson likelihood process.

...read moreread less

Posted Content•

On the Local Minima of the Empirical Risk

[...]

Chi Jin, Lydia T. Liu¹, Lydia T. Liu², Rong Ge¹, Rong Ge², Michael I. Jordan¹ - Show less +2 more•Institutions (2)

University of California, Berkeley¹, Duke University²

25 Mar 2018-arXiv: Learning

TL;DR: A simple algorithm based on stochastic gradient descent on a smoothed version of F, a smooth nonconvex function, is proposed that achieves optimal error tolerance $ u$ among all algorithms making a polynomial number of queries of $f$.

...read moreread less

Abstract: Population risk is always of primary interest in machine learning; however, learning algorithms only have access to the empirical risk. Even for applications with nonconvex nonsmooth losses (such as modern deep networks), the population risk is generally significantly more well-behaved from an optimization point of view than the empirical risk. In particular, sampling can create many spurious local minima. We consider a general framework which aims to optimize a smooth nonconvex function $F$ (population risk) given only access to an approximation $f$ (empirical risk) that is pointwise close to $F$ (i.e., $\|F-f\|_{\infty} \le u$). Our objective is to find the $\epsilon$-approximate local minima of the underlying function $F$ while avoiding the shallow local minima---arising because of the tolerance $ u$---which exist only in $f$. We propose a simple algorithm based on stochastic gradient descent (SGD) on a smoothed version of $f$ that is guaranteed to achieve our goal as long as $ u \le O(\epsilon^{1.5}/d)$. We also provide an almost matching lower bound showing that our algorithm achieves optimal error tolerance $ u$ among all algorithms making a polynomial number of queries of $f$. As a concrete example, we show that our results can be directly used to give sample complexities for learning a ReLU unit.

...read moreread less