scispace - formally typeset
Search or ask a question

Showing papers by "Kilian Q. Weinberger published in 2014"


Proceedings Article
21 Jun 2014
TL;DR: This work presents constrained Bayesian optimization, which places a prior distribution on both the objective and the constraint functions, and evaluates this method on simulated and real data, demonstrating that constrainedBayesian optimization can quickly find optimal and feasible points, even when small feasible regions cause standard methods to fail.
Abstract: Bayesian optimization is a powerful framework for minimizing expensive objective functions while using very few function evaluations. It has been successfully applied to a variety of problems, including hyperparameter tuning and experimental design. However, this framework has not been extended to the inequality-constrained optimization setting, particularly the setting in which evaluating feasibility is just as expensive as evaluating the objective. Here we present constrained Bayesian optimization, which places a prior distribution on both the objective and the constraint functions. We evaluate our method on simulated and real data, demonstrating that constrained Bayesian optimization can quickly find optimal and feasible points, even when small feasible regions cause standard methods to fail.

333 citations


Proceedings ArticleDOI
24 Aug 2014
TL;DR: Gradient Boosted Feature Selection (GBFS) as mentioned in this paper is a feature selection algorithm that is based on a modification of gradient boosted trees and is shown to match or outperform other state of the art feature selection algorithms.
Abstract: A feature selection algorithm should ideally satisfy four conditions: reliably extract relevant features; be able to identify non-linear feature interactions; scale linearly with the number of features and dimensions; allow the incorporation of known sparsity structure. In this work we propose a novel feature selection algorithm, Gradient Boosted Feature Selection (GBFS), which satisfies all four of these requirements. The algorithm is flexible, scalable, and surprisingly straight-forward to implement as it is based on a modification of Gradient Boosted Trees. We evaluate GBFS on several real world data sets and show that it matches or outperforms other state of the art feature selection algorithms. Yet it scales to larger data set sizes and naturally allows for domain-specific side information.

147 citations


Proceedings Article
21 Jun 2014
TL;DR: The marginalized Denoising Auto-encoder (mDAE) is presented, which (approximately) marginalizes out the corruption during training and is able to match or outperform the DAE with much fewer training epochs.
Abstract: Denoising auto-encoders (DAEs) have been successfully used to learn new representations for a wide range of machine learning tasks. During training, DAEs make many passes over the training dataset and reconstruct it from partial corruption generated from a pre-specified corrupting distribution. This process learns robust representation, though at the expense of requiring many training epochs, in which the data is explicitly corrupted. In this paper we present the marginalized Denoising Auto-encoder (mDAE), which (approximately) marginalizes out the corruption during training. Effectively, the mDAE takes into account infinitely many corrupted copies of the training data in every epoch, and therefore is able to match or outperform the DAE with much fewer training epochs. We analyze our proposed algorithm and show that it can be understood as a classic auto-encoder with a special form of regularization. In empirical evaluations we show that it attains 1-2 order-of-magnitude speedup in training time over other competing approaches.

123 citations


Journal ArticleDOI
TL;DR: Two algorithms are developed to efficiently balance the performance with the test-time cost of a classifier in real-world settings, and find their trained classifiers lead to high accuracies at a small fraction of the computational cost.
Abstract: Machine learning algorithms have successfully entered industry through many real-world applications (e.g., search engines and product recommendations). In these applications, the test-time CPU cost must be budgeted and accounted for. In this paper, we examine two main components of the test-time CPU cost, classifier evaluation cost and feature extraction cost, and show how to balance these costs with the classifier accuracy. Since the computation required for feature extraction dominates the test-time cost of a classifier in these settings, we develop two algorithms to efficiently balance the performance with the test-time cost. Our first contribution describes how to construct and optimize a tree of classifiers, through which test inputs traverse along individual paths. Each path extracts different features and is optimized for a specific sub-partition of the input space. Our second contribution is a natural reduction of the tree of classifiers into a cascade. The cascade is particularly useful for class-imbalanced data sets as the majority of instances can be early-exited out of the cascade when the algorithm is sufficiently confident in its prediction. Because both approaches only compute features for inputs that benefit from them the most, we find our trained classifiers lead to high accuracies at a small fraction of the computational cost.

94 citations


Proceedings Article
21 Jun 2014
TL;DR: Stochastic Neighbor Compression is presented, an algorithm to compress a dataset for the purpose of k-nearest neighbor (kNN) classification that is complementary to existing state-of-the-art algorithms to speed up kNN classification and leads to substantial further improvements.
Abstract: We present Stochastic Neighbor Compression (SNC), an algorithm to compress a dataset for the purpose of k-nearest neighbor (kNN) classification. Given training data, SNC learns a much smaller synthetic data set, that minimizes the stochastic 1-nearest neighbor classification error on the training data. This approach has several appealing properties: due to its small size, the compressed set speeds up kNN testing drastically (up to several orders of magnitude, in our experiments); it makes the kNN classifier substantially more robust to label noise; on 4 of 7 data sets it yields lower test error than kNN on the entire training set, even at compression ratios as low as 2%; finally, the SNC compression leads to impressive speed ups over kNN even when kNN and SNC are both used with ball-tree data structures, hashing, and LMNN dimensionality reduction--demonstrating that it is complementary to existing state-of-the-art algorithms to speed up kNN classification and leads to substantial further improvements.

67 citations


Proceedings Article
27 Jul 2014
TL;DR: This work proposes a different relaxation using approximate submodularity, called Approximately Submodular Tree of Classifiers (ASTC), which is much simpler to implement, yields equivalent results but requires no optimization hyperparameter tuning and is up to two orders of magnitude faster to train.
Abstract: During the past decade, machine learning algorithms have become commonplace in large-scale real-world industrial applications. In these settings, the computation time to train and test machine learning algorithms is a key consideration. At training-time the algorithms must scale to very large data set sizes. At testing-time, the cost of feature extraction can dominate the CPU runtime. Recently, a promising method was proposed to account for the feature extraction cost at testing time, called Cost-sensitive Tree of Classifiers (CSTC). Although the CSTC problem is NP-hard, the authors suggest an approximation through a mixed-norm relaxation across many classifiers. This relaxation is slow to train and requires involved optimization hyperparameter tuning. We propose a different relaxation using approximate submodularity, called Approximately Submodular Tree of Classifiers (ASTC). ASTC is much simpler to implement, yields equivalent results but requires no optimization hyperparameter tuning and is up to two orders of magnitude faster to train.

60 citations


Posted Content
TL;DR: This paper provides the first comparison of algorithms with explicit and implicit parallelization and finds an approximate implicitly parallel algorithm which is surprisingly efficient, permits a much simpler implementation, and leads to unprecedented speedups in SVM training.
Abstract: In this paper, we evaluate the performance of various parallel optimization methods for Kernel Support Vector Machines on multicore CPUs and GPUs. In particular, we provide the first comparison of algorithms with explicit and implicit parallelization. Most existing parallel implementations for multi-core or GPU architectures are based on explicit parallelization of Sequential Minimal Optimization (SMO)---the programmers identified parallelizable components and hand-parallelized them, specifically tuned for a particular architecture. We compare these approaches with each other and with implicitly parallelized algorithms---where the algorithm is expressed such that most of the work is done within few iterations with large dense linear algebra operations. These can be computed with highly-optimized libraries, that are carefully parallelized for a large variety of parallel platforms. We highlight the advantages and disadvantages of both approaches and compare them on various benchmark data sets. We find an approximate implicitly parallel algorithm which is surprisingly efficient, permits a much simpler implementation, and leads to unprecedented speedups in SVM training.

21 citations


Proceedings ArticleDOI
24 Aug 2014
TL;DR: This paper proposes a novel supervised learning method, Fast Flux Discriminant (FFD), for large-scale nonlinear classification that can be learned in minutes on datasets with millions of samples, for which most existing nonlinear methods will be prohibitively expensive in space and time.
Abstract: In this paper, we propose a novel supervised learning method, Fast Flux Discriminant (FFD), for large-scale nonlinear classification. Compared with other existing methods, FFD has unmatched advantages, as it attains the efficiency and interpretability of linear models as well as the accuracy of nonlinear models. It is also sparse and naturally handles mixed data types. It works by decomposing the kernel density estimation in the entire feature space into selected low-dimensional subspaces. Since there are many possible subspaces, we propose a submodular optimization framework for subspace selection. The selected subspace predictions are then transformed to new features on which a linear model can be learned. Besides, since the transformed features naturally expect non-negative weights, we only require smooth optimization even with the L1 regularization. Unlike other nonlinear models such as kernel methods, the FFD model is interpretable as it gives importance weights on the original features. Its training and testing are also much faster than traditional kernel models. We carry out extensive empirical studies on real-world datasets and show that the proposed model achieves state-of-the-art classification results with sparsity, interpretability, and exceptional scalability. Our model can be learned in minutes on datasets with millions of samples, for which most existing nonlinear methods will be prohibitively expensive in space and time.

15 citations


Book ChapterDOI
15 Sep 2014
TL;DR: It is shown that the proposed Transductive MPM (TMPM) almost outperforms all the other algorithms in both accuracy and speed.
Abstract: The Minimax Probability Machine (MPM) is an elegant machine learning algorithm for inductive learning. It learns a classifier that minimizes an upper bound on its own generalization error. In this paper, we extend its celebrated inductive formulation to an equally elegant transductive learning algorithm. In the transductive setting, the label assignment of a test set is already optimized during training. This optimization problem is an intractable mixed-integer programming. Thus, we provide an efficient label-switching approach to solve it approximately. The resulting method scales naturally to large data sets and is very efficient to run. In comparison with nine competitive algorithms on eleven data sets, we show that the proposed Transductive MPM (TMPM) almost outperforms all the other algorithms in both accuracy and speed.

13 citations


Journal ArticleDOI
TL;DR: In this article, a decision tree learning algorithm was used to derive the rules for estimating nanoparticle-dependent complement response based on the data generated from the hemolytic assay studies, which indicated that physicochemical properties of nanoparticles, namely, size, polydispersity index, zeta potential, and mole percentage of the active surface ligand of a nanoparticle, can serve as good descriptors for prediction of nanoparticledependent complement activation in the decision tree modeling framework.
Abstract: Nanoparticles are potentially powerful therapeutic tools that have the capacity to target drug payloads and imaging agents. However, some nanoparticles can activate complement, a branch of the innate immune system, and cause adverse side-effects. Recently, we employed an in vitro hemolysis assay to measure the serum complement activity of perfluorocarbon nanoparticles that differed by size, surface charge, and surface chemistry, quantifying the nanoparticle-dependent complement activity using a metric called Residual Hemolytic Activity (RHA). In the present work, we have used a decision tree learning algorithm to derive the rules for estimating nanoparticle-dependent complement response based on the data generated from the hemolytic assay studies. Our results indicate that physicochemical properties of nanoparticles, namely, size, polydispersity index, zeta potential, and mole percentage of the active surface ligand of a nanoparticle, can serve as good descriptors for prediction of nanoparticle-dependent complement activation in the decision tree modeling framework.

12 citations


Posted Content
TL;DR: This paper introduces a formal and practical reduction between two of the most widely used machine learning algorithms: from the Elastic Net to the Support Vector Machine and shows that it yields identical results as the popular glmnet implementation but is up-to two orders of magnitude faster.
Abstract: The past years have witnessed many dedicated open-source projects that built and maintain implementations of Support Vector Machines (SVM), parallelized for GPU, multi-core CPUs and distributed systems. Up to this point, no comparable effort has been made to parallelize the Elastic Net, despite its popularity in many high impact applications, including genetics, neuroscience and systems biology. The first contribution in this paper is of theoretical nature. We establish a tight link between two seemingly different algorithms and prove that Elastic Net regression can be reduced to SVM with squared hinge loss classification. Our second contribution is to derive a practical algorithm based on this reduction. The reduction enables us to utilize prior efforts in speeding up and parallelizing SVMs to obtain a highly optimized and parallel solver for the Elastic Net and Lasso. With a simple wrapper, consisting of only 11 lines of MATLAB code, we obtain an Elastic Net implementation that naturally utilizes GPU and multi-core CPUs. We demonstrate on twelve real world data sets, that our algorithm yields identical results as the popular (and highly optimized) glmnet implementation but is one or several orders of magnitude faster.

Posted Content
TL;DR: This paper proposes a third, alternative approach to combat overfitting: extending the training set with infinitely many artificial training examples that are obtained by corrupting the original training data, called marginalized corrupted features (MCF), which trains robust predictors by minimizing the expected value of the loss function under the corruption model.
Abstract: The goal of machine learning is to develop predictors that generalize well to test data. Ideally, this is achieved by training on an almost infinitely large training data set that captures all variations in the data distribution. In practical learning settings, however, we do not have infinite data and our predictors may overfit. Overfitting may be combatted, for example, by adding a regularizer to the training objective or by defining a prior over the model parameters and performing Bayesian inference. In this paper, we propose a third, alternative approach to combat overfitting: we extend the training set with infinitely many artificial training examples that are obtained by corrupting the original training data. We show that this approach is practical and efficient for a range of predictors and corruption models. Our approach, called marginalized corrupted features (MCF), trains robust predictors by minimizing the expected value of the loss function under the corruption model. We show empirically on a variety of data sets that MCF classifiers can be trained efficiently, may generalize substantially better to test data, and are also more robust to feature deletion at test time.




Posted Content
TL;DR: Stochastic Covariance Compression is presented, an algorithm that compresses a data set of SPD matrices to a much smaller set with similar kNN characteristics, and sometimes even outperforms the original data set, while requiring only a fraction of the space and drastically reduced test-time computation.
Abstract: Covariance matrices are an effective way to capture global spread across local interest points in images. Often, these image descriptors are more compact, robust and informative than, for example, bags of visual words. However, they are symmetric and positive definite (SPD) and therefore live on a non-Euclidean Riemannian manifold, which gives rise to non-Euclidean metrics. These are slow to compute and can make the use of covariance features prohibitive in many settings, in particular k-nearest neighbors (kNN) classification. In this paper we present Stochastic Covariance Compression, an algorithm that compresses a data set of SPD matrices to a much smaller set with similar kNN characteristics. We show that we can reduce the data sets to 1/6 and in some cases even up to 1/50 of their original size, while approximately matching the test error of full kNN classification. In fact, because the compressed set is learned to perform well on kNN tasks, it sometimes even outperforms the original data set, while requiring only a fraction of the space and drastically reduced test-time computation.


Posted Content
TL;DR: This paper presents two methods to compress the size of covariance and histogram datasets with only marginal increases in test error for k-nearest neighbor classification, and shows that they can reduce data sets to 16% and in some cases as little as 2% of their original size.
Abstract: Covariance and histogram image descriptors provide an effective way to capture information about images Both excel when used in combination with special purpose distance metrics For covariance descriptors these metrics measure the distance along the non-Euclidean Riemannian manifold of symmetric positive definite matrices For histogram descriptors the Earth Mover's distance measures the optimal transport between two histograms Although more precise, these distance metrics are very expensive to compute, making them impractical in many applications, even for data sets of only a few thousand examples In this paper we present two methods to compress the size of covariance and histogram datasets with only marginal increases in test error for k-nearest neighbor classification Specifically, we show that we can reduce data sets to 16% and in some cases as little as 2% of their original size, while approximately matching the test error of kNN classification on the full training set In fact, because the compressed set is learned in a supervised fashion, it sometimes even outperforms the full data set, while requiring only a fraction of the space and drastically reducing test-time computation