Showing papers by "Michael I. Jordan published in 2015"

PDF

Open Access

Journal Article•DOI•

Machine learning: Trends, perspectives, and prospects

[...]

Michael I. Jordan¹, Tom M. Mitchell²•Institutions (2)

University of California, Berkeley¹, Carnegie Mellon University²

17 Jul 2015-Science

TL;DR: The adoption of data-intensive machine-learning methods can be found throughout science, technology and commerce, leading to more evidence-based decision-making across many walks of life, including health care, manufacturing, education, financial modeling, policing, and marketing.

...read moreread less

Abstract: Machine learning addresses the question of how to build computers that improve automatically through experience. It is one of today’s most rapidly growing technical fields, lying at the intersection of computer science and statistics, and at the core of artificial intelligence and data science. Recent progress in machine learning has been driven both by the development of new learning algorithms and theory and by the ongoing explosion in the availability of online data and low-cost computation. The adoption of data-intensive machine-learning methods can be found throughout science, technology and commerce, leading to more evidence-based decision-making across many walks of life, including health care, manufacturing, education, financial modeling, policing, and marketing.

...read moreread less

4,545 citations

Proceedings Article•

Trust Region Policy Optimization

[...]

John Schulman¹, Sergey Levine¹, Pieter Abbeel¹, Michael I. Jordan¹, Philipp Moritz¹ - Show less +1 more•Institutions (1)

University of California, Berkeley¹

06 Jul 2015

TL;DR: A method for optimizing control policies, with guaranteed monotonic improvement, by making several approximations to the theoretically-justified scheme, called Trust Region Policy Optimization (TRPO).

...read moreread less

Abstract: In this article, we describe a method for optimizing control policies, with guaranteed monotonic improvement. By making several approximations to the theoretically-justified scheme, we develop a practical algorithm, called Trust Region Policy Optimization (TRPO). This algorithm is effective for optimizing large nonlinear policies such as neural networks. Our experiments demonstrate its robust performance on a wide variety of tasks: learning simulated robotic swimming, hopping, and walking gaits; and playing Atari games using images of the screen as input. Despite its approximations that deviate from the theory, TRPO tends to give monotonic improvement, with little tuning of hyperparameters.

...read moreread less

3,479 citations

Posted Content•

Learning Transferable Features with Deep Adaptation Networks

[...]

Mingsheng Long¹, Mingsheng Long², Yue Cao¹, Jianmin Wang¹, Michael I. Jordan² - Show less +1 more•Institutions (2)

Tsinghua University¹, University of California, Berkeley²

10 Feb 2015-arXiv: Learning

TL;DR: A new Deep Adaptation Network (DAN) architecture is proposed, which generalizes deep convolutional neural network to the domain adaptation scenario and can learn transferable features with statistical guarantees, and can scale linearly by unbiased estimate of kernel embedding.

...read moreread less

Abstract: Recent studies reveal that a deep neural network can learn transferable features which generalize well to novel tasks for domain adaptation. However, as deep features eventually transition from general to specific along the network, the feature transferability drops significantly in higher layers with increasing domain discrepancy. Hence, it is important to formally reduce the dataset bias and enhance the transferability in task-specific layers. In this paper, we propose a new Deep Adaptation Network (DAN) architecture, which generalizes deep convolutional neural network to the domain adaptation scenario. In DAN, hidden representations of all task-specific layers are embedded in a reproducing kernel Hilbert space where the mean embeddings of different domain distributions can be explicitly matched. The domain discrepancy is further reduced using an optimal multi-kernel selection method for mean embedding matching. DAN can learn transferable features with statistical guarantees, and can scale linearly by unbiased estimate of kernel embedding. Extensive empirical evidence shows that the proposed architecture yields state-of-the-art image classification error rates on standard domain adaptation benchmarks.

...read moreread less

3,351 citations

Posted Content•

Trust Region Policy Optimization

[...]

John Schulman, Sergey Levine, Philipp Moritz, Michael I. Jordan, Pieter Abbeel - Show less +1 more

19 Feb 2015-arXiv: Learning

TL;DR: Trust Region Policy Optimization (TRPO) as mentioned in this paper is an iterative procedure for optimizing policies, with guaranteed monotonic improvement, which is similar to natural policy gradient methods and is effective for optimizing large nonlinear policies such as neural networks.

...read moreread less

Abstract: We describe an iterative procedure for optimizing policies, with guaranteed monotonic improvement. By making several approximations to the theoretically-justified procedure, we develop a practical algorithm, called Trust Region Policy Optimization (TRPO). This algorithm is similar to natural policy gradient methods and is effective for optimizing large nonlinear policies such as neural networks. Our experiments demonstrate its robust performance on a wide variety of tasks: learning simulated robotic swimming, hopping, and walking gaits; and playing Atari games using images of the screen as input. Despite its approximations that deviate from the theory, TRPO tends to give monotonic improvement, with little tuning of hyperparameters.

...read moreread less

3,171 citations

Proceedings Article•

Learning Transferable Features with Deep Adaptation Networks

[...]

Mingsheng Long¹, Mingsheng Long², Yue Cao¹, Jianmin Wang¹, Michael I. Jordan² - Show less +1 more•Institutions (2)

Tsinghua University¹, University of California, Berkeley²

06 Jul 2015

TL;DR: Deep Adaptation Network (DAN) as mentioned in this paper embeds hidden representations of all task-specific layers in a reproducing kernel Hilbert space where the mean embeddings of different domain distributions can be explicitly matched.

...read moreread less

Abstract: Recent studies reveal that a deep neural network can learn transferable features which generalize well to novel tasks for domain adaptation. However, as deep features eventually transition from general to specific along the network, the feature transferability drops significantly in higher layers with increasing domain discrepancy. Hence, it is important to formally reduce the dataset bias and enhance the transferability in task-specific layers. In this paper, we propose a new Deep Adaptation Network (DAN) architecture, which generalizes deep convolutional neural network to the domain adaptation scenario. In DAN, hidden representations of all task-specific layers are embedded in a reproducing kernel Hilbert space where the mean embeddings of different domain distributions can be explicitly matched. The domain discrepancy is further reduced using an optimal multikernel selection method for mean embedding matching. DAN can learn transferable features with statistical guarantees, and can scale linearly by unbiased estimate of kernel embedding. Extensive empirical evidence shows that the proposed architecture yields state-of-the-art image classification error rates on standard domain adaptation benchmarks.

...read moreread less

1,272 citations

Posted Content•

High-Dimensional Continuous Control Using Generalized Advantage Estimation

[...]

John Schulman¹, Philipp Moritz¹, Sergey Levine¹, Michael I. Jordan¹, Pieter Abbeel¹ - Show less +1 more•Institutions (1)

University of California, Berkeley¹

08 Jun 2015-arXiv: Learning

TL;DR: The authors proposed a trust region optimization procedure for both the policy and the value function, which are represented by neural networks, which yields strong empirical results on highly challenging 3D locomotion tasks, learning running gaits for bipedal and quadrupedal simulated robots, and learning a policy for getting the biped to stand up from starting out lying on the ground.

...read moreread less

Abstract: Policy gradient methods are an appealing approach in reinforcement learning because they directly optimize the cumulative reward and can straightforwardly be used with nonlinear function approximators such as neural networks. The two main challenges are the large number of samples typically required, and the difficulty of obtaining stable and steady improvement despite the nonstationarity of the incoming data. We address the first challenge by using value functions to substantially reduce the variance of policy gradient estimates at the cost of some bias, with an exponentially-weighted estimator of the advantage function that is analogous to TD(lambda). We address the second challenge by using trust region optimization procedure for both the policy and the value function, which are represented by neural networks. Our approach yields strong empirical results on highly challenging 3D locomotion tasks, learning running gaits for bipedal and quadrupedal simulated robots, and learning a policy for getting the biped to stand up from starting out lying on the ground. In contrast to a body of prior work that uses hand-crafted policy representations, our neural network policies map directly from raw kinematics to joint torques. Our algorithm is fully model-free, and the amount of simulated experience required for the learning tasks on 3D bipeds corresponds to 1-2 weeks of real time.

...read moreread less

667 citations

Journal Article•DOI•

Optimal Rates for Zero-Order Convex Optimization: The Power of Two Function Evaluations

[...]

John C. Duchi¹, Michael I. Jordan², Martin J. Wainwright², Andre Wibisono²•Institutions (2)

Stanford University¹, University of California, Berkeley²

05 Mar 2015-IEEE Transactions on Information Theory

TL;DR: In this article, the authors consider derivative-free algorithms for stochastic and non-stochastic convex optimization problems that use only function values rather than gradients, and show that if pairs of function values are available, algorithms for $d$ -dimensional optimization that use gradient estimates based on random perturbations suffer a factor of at most Ω( √ n/d) in convergence rate over traditional gradient methods.

...read moreread less

Abstract: We consider derivative-free algorithms for stochastic and nonstochastic convex optimization problems that use only function values rather than gradients. Focusing on nonasymptotic bounds on convergence rates, we show that if pairs of function values are available, algorithms for $d$ -dimensional optimization that use gradient estimates based on random perturbations suffer a factor of at most $\sqrt {d}$ in convergence rate over traditional stochastic gradient methods. We establish such results for both smooth and nonsmooth cases, sharpening previous analyses that suggested a worse dimension dependence, and extend our results to the case of multiple ( $ {m}\ge 2$ ) evaluations. We complement our algorithmic development with information-theoretic lower bounds on the minimax convergence rate of such problems, establishing the sharpness of our achievable results up to constant (sometimes logarithmic) factors.

...read moreread less

342 citations

Journal Article•DOI•

Nested Hierarchical Dirichlet Processes

[...]

John Paisley¹, Chong Wang, David M. Blei², Michael I. Jordan³•Institutions (3)

Columbia University¹, Princeton University², University of California, Berkeley³

01 Feb 2015-IEEE Transactions on Pattern Analysis and Machine Intelligence

TL;DR: A stochastic variational inference algorithm is derived for the model, which enables efficient inference for massive collections of text documents and alleviates the rigid, single-path formulation of the nCRP.

...read moreread less

Abstract: We develop a nested hierarchical Dirichlet process (nHDP) for hierarchical topic modeling. The nHDP generalizes the nested Chinese restaurant process (nCRP) to allow each word to follow its own path to a topic node according to a per-document distribution over the paths on a shared tree. This alleviates the rigid, single-path formulation assumed by the nCRP, allowing documents to easily express complex thematic borrowings. We derive a stochastic variational inference algorithm for the model, which enables efficient inference for massive collections of text documents. We demonstrate our algorithm on 1.8 million documents from The New York Times and 2.7 million documents from Wikipedia .

...read moreread less

210 citations

Posted Content•

A General Analysis of the Convergence of ADMM

[...]

Robert Nishihara¹, Laurent Lessard¹, Benjamin Recht¹, Andrew Packard¹, Michael I. Jordan¹ - Show less +1 more•Institutions (1)

University of California, Berkeley¹

06 Feb 2015-arXiv: Optimization and Control

TL;DR: In this article, the authors provide a new proof of the linear convergence of the alternating direction method of multipliers (ADMM) when one of the objective terms is strongly convex, based on a framework for analyzing optimization algorithms introduced in Lessard et al.

...read moreread less

Abstract: We provide a new proof of the linear convergence of the alternating direction method of multipliers (ADMM) when one of the objective terms is strongly convex. Our proof is based on a framework for analyzing optimization algorithms introduced in Lessard et al. (2014), reducing algorithm convergence to verifying the stability of a dynamical system. This approach generalizes a number of existing results and obviates any assumptions about specific choices of algorithm parameters. On a numerical example, we demonstrate that minimizing the derived bound on the convergence rate provides a practical approach to selecting algorithm parameters for particular ADMM instances. We complement our upper bound by constructing a nearly-matching lower bound on the worst-case rate of convergence.

...read moreread less

195 citations

Proceedings Article•

Adding vs. Averaging in Distributed Primal-Dual Optimization

[...]

Chenxin Ma¹, Virginia Smith², Martin Jaggi³, Michael I. Jordan², Peter Richtárik⁴, Martin Takáč¹ - Show less +2 more•Institutions (4)

Lehigh University¹, University of California, Berkeley², ETH Zurich³, University of Edinburgh⁴

06 Jul 2015

TL;DR: A novel generalization of the recent communication-efficient primal-dual framework (COCOA) for distributed optimization, which allows for additive combination of local updates to the global parameters at each iteration, whereas previous schemes with convergence guarantees only allow conservative averaging.

...read moreread less

Abstract: Distributed optimization methods for large-scale machine learning suffer from a communication bottleneck. It is difficult to reduce this bottleneck while still efficiently and accurately aggregating partial work from different machines. In this paper, we present a novel generalization of the recent communication-efficient primal-dual framework (COCOA) for distributed optimization. Our framework, COCOA+, allows for additive combination of local updates to the global parameters at each iteration, whereas previous schemes with convergence guarantees only allow conservative averaging. We give stronger (primal-dual) convergence rate guarantees for both COCOA as well as our new variants, and generalize the theory for both methods to cover non-smooth convex loss functions. We provide an extensive experimental comparison that shows the markedly improved performance of COCOA+ on several real-world distributed datasets, especially when scaling up the number of machines.

...read moreread less

151 citations

Posted Content•

SparkNet: Training Deep Networks in Spark

[...]

Philipp Moritz¹, Robert Nishihara¹, Ion Stoica¹, Michael I. Jordan¹•Institutions (1)

University of California, Berkeley¹

19 Nov 2015-arXiv: Machine Learning

TL;DR: SparkNet as mentioned in this paper is a framework for training deep networks in Spark, which includes a convenient interface for reading data from Spark RDDs, a Scala interface to the Caffe deep learning framework, and a lightweight multi-dimensional tensor library.

...read moreread less

Abstract: Training deep networks is a time-consuming process, with networks for object recognition often requiring multiple days to train. For this reason, leveraging the resources of a cluster to speed up training is an important area of work. However, widely-popular batch-processing computational frameworks like MapReduce and Spark were not designed to support the asynchronous and communication-intensive workloads of existing distributed deep learning systems. We introduce SparkNet, a framework for training deep networks in Spark. Our implementation includes a convenient interface for reading data from Spark RDDs, a Scala interface to the Caffe deep learning framework, and a lightweight multi-dimensional tensor library. Using a simple parallelization scheme for stochastic gradient descent, SparkNet scales well with the cluster size and tolerates very high-latency communication. Furthermore, it is easy to deploy and use with no parameter tuning, and it is compatible with existing Caffe models. We quantify the dependence of the speedup obtained by SparkNet on the number of machines, the communication frequency, and the cluster's communication overhead, and we benchmark our system's performance on the ImageNet dataset.

...read moreread less

Proceedings Article•DOI•

Automating model search for large scale machine learning

[...]

Evan R. Sparks¹, Ameet Talwalkar², Daniel Haas¹, Michael J. Franklin¹, Michael I. Jordan¹, Tim Kraska³ - Show less +2 more•Institutions (3)

University of California, Berkeley¹, University of California, Los Angeles², Brown University³

27 Aug 2015

TL;DR: An architecture for automatic machine learning at scale comprised of a cost-based cluster resource allocation estimator, advanced hyper-parameter tuning techniques, bandit resource allocation via runtime algorithm introspection, and physical optimization via batching and optimal resource allocation is proposed.

...read moreread less

Abstract: The proliferation of massive datasets combined with the development of sophisticated analytical techniques has enabled a wide variety of novel applications such as improved product recommendations, automatic image tagging, and improved speech-driven interfaces. A major obstacle to supporting these predictive applications is the challenging and expensive process of identifying and training an appropriate predictive model. Recent efforts aiming to automate this process have focused on single node implementations and have assumed that model training itself is a black box, limiting their usefulness for applications driven by large-scale datasets. In this work, we build upon these recent efforts and propose an architecture for automatic machine learning at scale comprised of a cost-based cluster resource allocation estimator, advanced hyper-parameter tuning techniques, bandit resource allocation via runtime algorithm introspection, and physical optimization via batching and optimal resource allocation. The result is TuPAQ, a component of the MLbase system that automatically finds and trains models for a user's predictive application with comparable quality to those found using exhaustive strategies, but an order of magnitude more efficiently than the standard baseline approach. TuPAQ scales to models trained on Terabytes of data across hundreds of machines.

...read moreread less

Proceedings Article•

A General Analysis of the Convergence of ADMM

[...]

Robert Nishihara¹, Laurent Lessard¹, Benjamin Recht¹, Andrew Packard¹, Michael I. Jordan¹ - Show less +1 more•Institutions (1)

University of California, Berkeley¹

06 Jul 2015

TL;DR: This work provides a new proof of the linear convergence of the alternating direction method of multipliers when one of the objective terms is strongly convex, and demonstrates that minimizing the derived bound on the convergence rate provides a practical approach to selecting algorithm parameters for particular ADMM instances.

...read moreread less

Posted Content•

A Linearly-Convergent Stochastic L-BFGS Algorithm

[...]

Philipp Moritz¹, Robert Nishihara¹, Michael I. Jordan¹•Institutions (1)

University of California, Berkeley¹

09 Aug 2015-arXiv: Optimization and Control

TL;DR: In this paper, a new stochastic L-BFGS algorithm was proposed and proved to have a linear convergence rate for strongly convex and smooth functions, and the algorithm was shown to perform well for a wide range of step sizes.

...read moreread less

Abstract: We propose a new stochastic L-BFGS algorithm and prove a linear convergence rate for strongly convex and smooth functions. Our algorithm draws heavily from a recent stochastic variant of L-BFGS proposed in Byrd et al. (2014) as well as a recent approach to variance reduction for stochastic gradient descent from Johnson and Zhang (2013). We demonstrate experimentally that our algorithm performs well on large-scale convex and non-convex optimization problems, exhibiting linear convergence and rapidly solving the optimization problems to high levels of precision. Furthermore, we show that our algorithm performs well for a wide-range of step sizes, often differing by several orders of magnitude.

...read moreread less

Journal Article•DOI•

Combinatorial Clustering and the Beta Negative Binomial Process

[...]

Tamara Broderick¹, Lester Mackey², John Paisley³, Michael I. Jordan¹•Institutions (3)

University of California, Berkeley¹, Stanford University², Columbia University³

01 Feb 2015-IEEE Transactions on Pattern Analysis and Machine Intelligence

TL;DR: A Bayesian nonparametric approach to a general family of latent class problems in which individuals can belong simultaneously to multiple classes and where each class can be exhibited multiple times by an individual is developed.

...read moreread less

Abstract: We develop a Bayesian nonparametric approach to a general family of latent class problems in which individuals can belong simultaneously to multiple classes and where each class can be exhibited multiple times by an individual. We introduce a combinatorial stochastic process known as the negative binomial process ( ${\rm NBP}$ ) as an infinite-dimensional prior appropriate for such problems. We show that the ${\rm NBP}$ is conjugate to the beta process, and we characterize the posterior distribution under the beta-negative binomial process ( ${\rm BNBP}$ ) and hierarchical models based on the ${\rm BNBP}$ (the ${\rm HBNBP}$ ). We study the asymptotic properties of the ${\rm BNBP}$ and develop a three-parameter extension of the ${\rm BNBP}$ that exhibits power-law behavior. We derive MCMC algorithms for posterior inference under the ${\rm HBNBP}$ , and we present experiments using these algorithms in the domains of image segmentation, object recognition, and document analysis.

...read moreread less

Posted Content•

Distributed Optimization with Arbitrary Local Solvers

[...]

Chenxin Ma¹, Jakub Konečný², Martin Jaggi³, Virginia Smith⁴, Michael I. Jordan⁴, Peter Richtárik², Martin Takáč¹ - Show less +3 more•Institutions (4)

Lehigh University¹, University of Edinburgh², École Polytechnique Fédérale de Lausanne³, University of California, Berkeley⁴

13 Dec 2015-arXiv: Learning

TL;DR: This work presents a framework for distributed optimization that both allows the flexibility of arbitrary solvers to be used on each (single) machine locally and yet maintains competitive performance against other state-of-the-art special-purpose distributed methods.

...read moreread less

Abstract: With the growth of data and necessity for distributed optimization methods, solvers that work well on a single machine must be re-designed to leverage distributed computation. Recent work in this area has been limited by focusing heavily on developing highly specific methods for the distributed environment. These special-purpose methods are often unable to fully leverage the competitive performance of their well-tuned and customized single machine counterparts. Further, they are unable to easily integrate improvements that continue to be made to single machine methods. To this end, we present a framework for distributed optimization that both allows the flexibility of arbitrary solvers to be used on each (single) machine locally, and yet maintains competitive performance against other state-of-the-art special-purpose distributed methods. We give strong primal-dual convergence rate guarantees for our framework that hold for arbitrary local solvers. We demonstrate the impact of local solver selection both theoretically and in an extensive experimental comparison. Finally, we provide thorough implementation details for our framework, highlighting areas for practical performance gains.

...read moreread less

Posted Content•

Perturbed Iterate Analysis for Asynchronous Stochastic Optimization

[...]

Horia Mania, Xinghao Pan, Dimitris S. Papailiopoulos, Benjamin Recht, Kannan Ramchandran, Michael I. Jordan - Show less +2 more

24 Jul 2015-arXiv: Machine Learning

TL;DR: Using the perturbed iterate framework, this work provides new analyses of the Hogwild! algorithm and asynchronous stochastic coordinate descent, that are simpler than earlier analyses, remove many assumptions of previous models, and in some cases yield improved upper bounds on the convergence rates.

...read moreread less

Abstract: We introduce and analyze stochastic optimization methods where the input to each gradient update is perturbed by bounded noise. We show that this framework forms the basis of a unified approach to analyze asynchronous implementations of stochastic optimization this http URL this framework, asynchronous stochastic optimization algorithms can be thought of as serial methods operating on noisy inputs. Using our perturbed iterate framework, we provide new analyses of the Hogwild! algorithm and asynchronous stochastic coordinate descent, that are simpler than earlier analyses, remove many assumptions of previous models, and in some cases yield improved upper bounds on the convergence rates. We proceed to apply our framework to develop and analyze KroMagnon: a novel, parallel, sparse stochastic variance-reduced gradient (SVRG) algorithm. We demonstrate experimentally on a 16-core machine that the sparse and parallel version of SVRG is in some cases more than four orders of magnitude faster than the standard SVRG algorithm.

...read moreread less

Proceedings Article•

Linear response methods for accurate covariance estimates from Mean field variational Bayes

[...]

Ryan Giordano¹, Tamara Broderick², Michael I. Jordan¹•Institutions (2)

University of California, Berkeley¹, Massachusetts Institute of Technology²

07 Dec 2015

TL;DR: The linear response variational Bayes (LRVB) as mentioned in this paper generalizes linear response methods from statistical physics to deliver accurate uncertainty estimates for model variables, both for individual variables and coherently across variables.

...read moreread less

Abstract: Mean field variational Bayes (MFVB) is a popular posterior approximation method due to its fast runtime on large-scale data sets However, a well known major failing of MFVB is that it underestimates the uncertainty of model variables (sometimes severely) and provides no information about model variable covariance We generalize linear response methods from statistical physics to deliver accurate uncertainty estimates for model variables—both for individual variables and coherently across variables We call our method linear response variational Bayes (LRVB) When the MFVB posterior approximation is in the exponential family, LRVB has a simple, analytic form, even for non-conjugate models Indeed, we make no assumptions about the form of the true posterior We demonstrate the accuracy and scalability of our method on a range of models for both simulated and real data

...read moreread less

Posted Content•

Parallel Correlation Clustering on Big Graphs

[...]

Xinghao Pan¹, Dimitris S. Papailiopoulos¹, Samet Oymak¹, Benjamin Recht¹, Kannan Ramchandran¹, Michael I. Jordan¹ - Show less +2 more•Institutions (1)

University of California, Berkeley¹

17 Jul 2015-arXiv: Distributed, Parallel, and Cluster Computing

TL;DR: C4 and ClusterWild!, two algorithms for parallel correlation clustering that run in a polylogarithmic number of rounds and achieve nearly linear speedups, provably are presented.

...read moreread less

Abstract: Given a similarity graph between items, correlation clustering (CC) groups similar items together and dissimilar ones apart. One of the most popular CC algorithms is KwikCluster: an algorithm that serially clusters neighborhoods of vertices, and obtains a 3-approximation ratio. Unfortunately, KwikCluster in practice requires a large number of clustering rounds, a potential bottleneck for large graphs. We present C4 and ClusterWild!, two algorithms for parallel correlation clustering that run in a polylogarithmic number of rounds and achieve nearly linear speedups, provably. C4 uses concurrency control to enforce serializability of a parallel clustering process, and guarantees a 3-approximation ratio. ClusterWild! is a coordination free algorithm that abandons consistency for the benefit of better scaling; this leads to a provably small loss in the 3-approximation ratio. We provide extensive experimental results for both algorithms, where we outperform the state of the art, both in terms of clustering accuracy and running time. We show that our algorithms can cluster billion-edge graphs in under 5 seconds on 32 cores, while achieving a 15x speedup.

...read moreread less

Posted Content•

Adding vs. Averaging in Distributed Primal-Dual Optimization

[...]

Chenxin Ma¹, Virginia Smith², Martin Jaggi³, Michael I. Jordan², Peter Richtárik⁴, Martin Takáč¹ - Show less +2 more•Institutions (4)

Lehigh University¹, University of California, Berkeley², ETH Zurich³, University of Edinburgh⁴

12 Feb 2015-arXiv: Learning

TL;DR: CoCoA+ as discussed by the authors generalizes the primal-dual framework for distributed optimization by allowing for additive combination of local updates to the global parameters at each iteration, whereas previous schemes with convergence guarantees only allow conservative averaging.

...read moreread less

Abstract: Distributed optimization methods for large-scale machine learning suffer from a communication bottleneck. It is difficult to reduce this bottleneck while still efficiently and accurately aggregating partial work from different machines. In this paper, we present a novel generalization of the recent communication-efficient primal-dual framework (CoCoA) for distributed optimization. Our framework, CoCoA+, allows for additive combination of local updates to the global parameters at each iteration, whereas previous schemes with convergence guarantees only allow conservative averaging. We give stronger (primal-dual) convergence rate guarantees for both CoCoA as well as our new variants, and generalize the theory for both methods to cover non-smooth convex loss functions. We provide an extensive experimental comparison that shows the markedly improved performance of CoCoA+ on several real-world distributed datasets, especially when scaling up the number of machines.

...read moreread less

Proceedings Article•

Parallel correlation clustering on big graphs

[...]

Xinghao Pan¹, Dimitris S. Papailiopoulos¹, Samet Oymak¹, Benjamin Recht¹, Kannan Ramchandran¹, Michael I. Jordan¹ - Show less +2 more•Institutions (1)

University of California, Berkeley¹

07 Dec 2015

TL;DR: C4 and ClusterWild! as discussed by the authors use concurrency control to enforce serializability of a parallel clustering process, and guarantee a 3-approximation ratio for large graphs.

...read moreread less

Abstract: Given a similarity graph between items, correlation clustering (CC) groups similar items together and dissimilar ones apart. One of the most popular CC algorithms is KwikCluster: an algorithm that serially clusters neighborhoods of vertices, and obtains a 3-approximation ratio. Unfortunately, in practice KwikCluster requires a large number of clustering rounds, a potential bottleneck for large graphs. We present C4 and ClusterWild!, two algorithms for parallel correlation clustering that run in a polylogarithmic number of rounds, and provably achieve nearly linear speedups. C4 uses concurrency control to enforce serializability of a parallel clustering process, and guarantees a 3-approximation ratio. ClusterWild! is a coordination free algorithm that abandons consistency for the benefit of better scaling; this leads to a provably small loss in the 3 approximation ratio. We demonstrate experimentally that both algorithms outperform the state of the art, both in terms of clustering accuracy and running time. We show that our algorithms can cluster billion-edge graphs in under 5 seconds on 32 cores, while achieving a 15 x speedup.

...read moreread less

Proceedings Article•

Variational Consensus Monte Carlo

[...]

Maxim Rabinovich¹, Elaine Angelino¹, Michael I. Jordan¹•Institutions (1)

University of California, Berkeley¹

07 Dec 2015

TL;DR: The variational consensus Monte Carlo (VCMC) as mentioned in this paper is a variational Bayes algorithm that optimizes over aggregation functions to obtain samples from a distribution that better approximates the target.

...read moreread less

Abstract: Practitioners of Bayesian statistics have long depended on Markov chain Monte Carlo (MCMC) to obtain samples from intractable posterior distributions. Unfortunately, MCMC algorithms are typically serial, and do not scale to the large datasets typical of modern machine learning. The recently proposed consensus Monte Carlo algorithm removes this limitation by partitioning the data and drawing samples conditional on each partition in parallel [22]. A fixed aggregation function then combines these samples, yielding approximate posterior samples. We introduce variational consensus Monte Carlo (VCMC), a variational Bayes algorithm that optimizes over aggregation functions to obtain samples from a distribution that better approximates the target. The resulting objective contains an intractable entropy term; we therefore derive a relaxation of the objective and show that the relaxed problem is blockwise concave under mild conditions. We illustrate the advantages of our algorithm on three inference tasks from the literature, demonstrating both the superior quality of the posterior approximation and the moderate overhead of the optimization step. Our algorithm achieves a relative error reduction (measured against serial MCMC) of up to 39% compared to consensus Monte Carlo on the task of estimating 300-dimensional probit regression parameter expectations; similarly, it achieves an error reduction of 92% on the task of estimating cluster comembership probabilities in a Gaussian mixture model with 8 components in 8 dimensions. Furthermore, these gains come at moderate cost compared to the runtime of serial MCMC—achieving near-ideal speedup in some instances.

...read moreread less

Proceedings Article•DOI•

Optimism-driven exploration for nonlinear systems

[...]

Teodor Moldovan¹, Sergey Levine¹, Michael I. Jordan¹, Pieter Abbeel¹•Institutions (1)

University of California, Berkeley¹

26 May 2015

TL;DR: This work shows that combining an optimistic exploration strategy with model-predictive control can achieve very good sample complexity for a range of nonlinear systems and achieves some of the most sample-efficient learning rates on several benchmark problems.

...read moreread less

Abstract: Tasks with unknown dynamics and costly system interaction time present a serious challenge for reinforcement learning. If a model of the dynamics can be learned quickly, interaction time can be reduced substantially. We show that combining an optimistic exploration strategy with model-predictive control can achieve very good sample complexity for a range of nonlinear systems. Our method learns a Dirichlet process mixture of linear models using an exploration strategy based on optimism in the face of uncertainty. Trajectory optimization is used to plan paths in the learned model that both minimize the cost and perform exploration. Experimental results show that our approach achieves some of the most sample-efficient learning rates on several benchmark problems, and is able to successfully learn to control a simulated helicopter during hover and autorotation with only seconds of interaction time. The computational requirements are substantial.

...read moreread less

Journal Article•DOI•

Distributed matrix completion and robust factorization

[...]

Lester Mackey¹, Ameet Talwalkar², Michael I. Jordan³•Institutions (3)

Stanford University¹, University of California, Los Angeles², University of California, Berkeley³

01 Jan 2015-Journal of Machine Learning Research

TL;DR: In this paper, a scalable divide-and-conquer framework for noisy matrix factorization is proposed, in which the statistical errors introduced by the divide step and control their magnitude in the conquer step are characterized.

...read moreread less

Abstract: If learning methods are to scale to the massive sizes of modern data sets, it is essential for the field of machine learning to embrace parallel and distributed computing. Inspired by the recent development of matrix factorization methods with rich theory but poor computational complexity and by the relative ease of mapping matrices onto distributed architectures, we introduce a scalable divide-and-conquer framework for noisy matrix factorization. We present a thorough theoretical analysis of this framework in which we characterize the statistical errors introduced by the "divide" step and control their magnitude in the "conquer" step, so that the overall algorithm enjoys high-probability estimation guarantees comparable to those of its base algorithm. We also present experiments in collaborative filtering and video background modeling that demonstrate the near-linear to superlinear speed-ups attainable with this approach.

...read moreread less

Posted Content•

Learning Halfspaces and Neural Networks with Random Initialization

[...]

Yuchen Zhang, Jason D. Lee, Martin J. Wainwright, Michael I. Jordan

25 Nov 2015-arXiv: Learning

TL;DR: It is shown that if the data is separable by some neural network with constant margin $\gamma>0$, then there is a polynomial-time algorithm for learning a neural network that separates the training data with margin $\Omega(\gamma)$.

...read moreread less

Abstract: We study non-convex empirical risk minimization for learning halfspaces and neural networks. For loss functions that are $L$-Lipschitz continuous, we present algorithms to learn halfspaces and multi-layer neural networks that achieve arbitrarily small excess risk $\epsilon>0$. The time complexity is polynomial in the input dimension $d$ and the sample size $n$, but exponential in the quantity $(L/\epsilon^2)\log(L/\epsilon)$. These algorithms run multiple rounds of random initialization followed by arbitrary optimization steps. We further show that if the data is separable by some neural network with constant margin $\gamma>0$, then there is a polynomial-time algorithm for learning a neural network that separates the training data with margin $\Omega(\gamma)$. As a consequence, the algorithm achieves arbitrary generalization error $\epsilon>0$ with ${\rm poly}(d,1/\epsilon)$ sample and time complexity. We establish the same learnability result when the labels are randomly flipped with probability $\eta<1/2$.

...read moreread less

Posted Content•

L1-Regularized Distributed Optimization: A Communication-Efficient Primal-Dual Framework

[...]

Virginia Smith, Simone Forte, Michael I. Jordan, Martin Jaggi

13 Dec 2015-arXiv: Learning

TL;DR: This research presents a novel probabilistic approach to estimating the response of the immune system to laser-spot assisted, 3D image analysis of central nervous system injury.

...read moreread less

Abstract: Reference EPFL-REPORT-229241 URL: http://arxiv.org/abs/1512.04011 Record created on 2017-06-21, modified on 2017-07-12

...read moreread less

Proceedings Article•DOI•

Machine Learning and Databases: The Sound of Things to Come or a Cacophony of Hype?

[...]

Christopher Ré¹, Divy Agrawal², Magdalena Balazinska³, Michael Cafarella⁴, Michael I. Jordan⁵, Tim Kraska⁶, Raghu Ramakrishnan⁷ - Show less +3 more•Institutions (7)

Stanford University¹, Qatar Airways², University of Washington³, University of Michigan⁴, University of California, Berkeley⁵, Brown University⁶, Microsoft⁷

27 May 2015

TL;DR: As the home of high-value, data-driven applications for over four decades, a natural question for database researchers to ask is: what role should the database community play in these new data- driven machine-learning-based applications?

...read moreread less

Abstract: Machine learning seems to be eating the world with a new breed of high-value data-driven applications in image analysis, search, voice recognition, mobile, and office productivity products. To paraphrase Mike Stonebraker, machine learning is no longer a zero-billion-dollar business. As the home of high-value, data-driven applications for over four decades, a natural question for database researchers to ask is: what role should the database community play in these new data-driven machine-learning-based applications?

...read moreread less

Posted Content•

Linear Response Methods for Accurate Covariance Estimates from Mean Field Variational Bayes

[...]

Ryan Giordano¹, Tamara Broderick², Michael I. Jordan¹•Institutions (2)

University of California, Berkeley¹, Massachusetts Institute of Technology²

12 Jun 2015-arXiv: Machine Learning

TL;DR: This work generalizes linear response methods from statistical physics to deliver accurate uncertainty estimates for model variables—both for individual variables and coherently across variables.

...read moreread less

Abstract: Mean field variational Bayes (MFVB) is a popular posterior approximation method due to its fast runtime on large-scale data sets. However, it is well known that a major failing of MFVB is that it underestimates the uncertainty of model variables (sometimes severely) and provides no information about model variable covariance. We generalize linear response methods from statistical physics to deliver accurate uncertainty estimates for model variables---both for individual variables and coherently across variables. We call our method linear response variational Bayes (LRVB). When the MFVB posterior approximation is in the exponential family, LRVB has a simple, analytic form, even for non-conjugate models. Indeed, we make no assumptions about the form of the true posterior. We demonstrate the accuracy and scalability of our method on a range of models for both simulated and real data.

...read moreread less

Posted Content•

TuPAQ: An Efficient Planner for Large-scale Predictive Analytic Queries

[...]

Evan R. Sparks, Ameet Talwalkar, Michael J. Franklin, Michael I. Jordan, Tim Kraska - Show less +1 more

31 Jan 2015-arXiv: Databases

TL;DR: TuPAQ, a component of the MLbase system, is proposed, which solves the PAQ planning problem with comparable quality to exhaustive strategies but an order of magnitude more efficiently than the standard baseline approach, and can scale to models trained on terabytes of data across hundreds of machines.

...read moreread less

Abstract: The proliferation of massive datasets combined with the development of sophisticated analytical techniques have enabled a wide variety of novel applications such as improved product recommendations, automatic image tagging, and improved speech-driven interfaces. These and many other applications can be supported by Predictive Analytic Queries (PAQs). A major obstacle to supporting PAQs is the challenging and expensive process of identifying and training an appropriate predictive model. Recent efforts aiming to automate this process have focused on single node implementations and have assumed that model training itself is a black box, thus limiting the effectiveness of such approaches on large-scale problems. In this work, we build upon these recent efforts and propose an integrated PAQ planning architecture that combines advanced model search techniques, bandit resource allocation via runtime algorithm introspection, and physical optimization via batching. The result is TuPAQ, a component of the MLbase system, which solves the PAQ planning problem with comparable quality to exhaustive strategies but an order of magnitude more efficiently than the standard baseline approach, and can scale to models trained on terabytes of data across hundreds of machines.

...read moreread less

Posted Content•

Splash: User-friendly Programming Interface for Parallelizing Stochastic Algorithms.

[...]

Yuchen Zhang¹, Michael I. Jordan¹•Institutions (1)

University of California, Berkeley¹

24 Jun 2015-arXiv: Learning

TL;DR: This paper proposes a general framework for parallelizing stochastic algorithms on multi-node distributed systems called Splash, which consists of a programming interface and an execution engine and provides theoretical justifications on the optimal rate of convergence.

...read moreread less

Abstract: Stochastic algorithms are efficient approaches to solving machine learning and optimization problems. In this paper, we propose a general framework called Splash for parallelizing stochastic algorithms on multi-node distributed systems. Splash consists of a programming interface and an execution engine. Using the programming interface, the user develops sequential stochastic algorithms without concerning any detail about distributed computing. The algorithm is then automatically parallelized by a communication-efficient execution engine. We provide theoretical justifications on the optimal rate of convergence for parallelizing stochastic gradient descent. Splash is built on top of Apache Spark. The real-data experiments on logistic regression, collaborative filtering and topic modeling verify that Splash yields order-of-magnitude speedup over single-thread stochastic algorithms and over state-of-the-art implementations on Spark.

...read moreread less