scispace - formally typeset
Search or ask a question

Showing papers by "Ran El-Yaniv published in 2017"


Journal Article
TL;DR: In this paper, a method to train quantized neural networks (QNNs) with extremely low precision (e.g., 1-bit) weights and activations, at run-time is introduced.
Abstract: We introduce a method to train Quantized Neural Networks (QNNs) -- neural networks with extremely low precision (e.g., 1-bit) weights and activations, at run-time. At traintime the quantized weights and activations are used for computing the parameter gradients. During the forward pass, QNNs drastically reduce memory size and accesses, and replace most arithmetic operations with bit-wise operations. As a result, power consumption is expected to be drastically reduced. We trained QNNs over the MNIST, CIFAR-10, SVHN and ImageNet datasets. The resulting QNNs achieve prediction accuracy comparable to their 32-bit counterparts. For example, our quantized version of AlexNet with 1-bit weights and 2-bit activations achieves 51% top-1 accuracy. Moreover, we quantize the parameter gradients to 6-bits as well which enables gradients computation using only bit-wise operation. Quantized recurrent neural networks were tested over the Penn Treebank dataset, and achieved comparable accuracy as their 32-bit counterparts using only 4-bits. Last but not least, we programmed a binary matrix multiplication GPU kernel with which it is possible to run our MNIST QNN 7 times faster than with an unoptimized GPU kernel, without suffering any loss in classification accuracy. The QNN code is available online.

919 citations


Proceedings Article
01 May 2017
TL;DR: A method to construct a selective classifier given a trained neural network, which allows a user to set a desired risk level and the classifier rejects instances as needed, to grant the desired risk (with high probability).
Abstract: Selective classification techniques (also known as reject option) have not yet been considered in the context of deep neural networks (DNNs). These techniques can potentially significantly improve DNNs prediction performance by trading-off coverage. In this paper we propose a method to construct a selective classifier given a trained neural network. Our method allows a user to set a desired risk level. At test time, the classifier rejects instances as needed, to grant the desired risk (with high probability). Empirical results over CIFAR and ImageNet convincingly demonstrate the viability of our method, which opens up possibilities to operate DNNs in mission-critical applications. For example, using our method an unprecedented 2% error in top-5 ImageNet classification can be guaranteed with probability 99.9%, with almost 60% test coverage.

330 citations


Posted Content
TL;DR: A novel active learning algorithm that queries consecutive points from the pool using farthest-first traversals in the space of neural activation over a representation layer shows consistent and overwhelming improvement in sample complexity over passive learning (random sampling) for three datasets: MNIST, CIFar-10, and CIFAR-100.
Abstract: This paper is concerned with pool-based active learning for deep neural networks. Motivated by coreset dataset compression ideas, we present a novel active learning algorithm that queries consecutive points from the pool using farthest-first traversals in the space of neural activation over a representation layer. We show consistent and overwhelming improvement in sample complexity over passive learning (random sampling) for three datasets: MNIST, CIFAR-10, and CIFAR-100. In addition, our algorithm outperforms the traditional uncertainty sampling technique (obtained using softmax activations), and we identify cases where uncertainty sampling is only slightly better than random sampling.

99 citations


Journal ArticleDOI
TL;DR: Novel model transfer-learning methods that refine a decision forest model by considering an ensemble that contains the union of the two forests and exhibit impressive experimental results over a range of problems are proposed.
Abstract: We propose novel model transfer-learning methods that refine a decision forest model $M$ learned within a “source” domain using a training set sampled from a “target” domain, assumed to be a variation of the source. We present two random forest transfer algorithms. The first algorithm searches greedily for locally optimal modifications of each tree structure by trying to locally expand or reduce the tree around individual nodes. The second algorithm does not modify structure, but only the parameter (thresholds) associated with decision nodes. We also propose to combine both methods by considering an ensemble that contains the union of the two forests. The proposed methods exhibit impressive experimental results over a range of problems.

89 citations


Posted Content
TL;DR: This paper proposed a method to construct a selective classifier given a trained neural network, which allows a user to set a desired risk level, and the classifier rejects instances as needed, to grant the desired risk (with high probability).
Abstract: Selective classification techniques (also known as reject option) have not yet been considered in the context of deep neural networks (DNNs). These techniques can potentially significantly improve DNNs prediction performance by trading-off coverage. In this paper we propose a method to construct a selective classifier given a trained neural network. Our method allows a user to set a desired risk level. At test time, the classifier rejects instances as needed, to grant the desired risk (with high probability). Empirical results over CIFAR and ImageNet convincingly demonstrate the viability of our method, which opens up possibilities to operate DNNs in mission-critical applications. For example, using our method an unprecedented 2% error in top-5 ImageNet classification can be guaranteed with probability 99.9%, and almost 60% test coverage.

33 citations


Patent
04 Apr 2017
TL;DR: In this paper, a neural network is constructed with neurons arranged in layers and connected by connections associated with quantized connection weight functions adapted to output quantised connection weight values, and a plurality of weight gradients are calculated during backpropagation sub-processes by computing neuron gradients, each of an output of a respective activation function in one layer with respect to an input of the respective quantized activation function.
Abstract: Training neural networks by constructing a neural network model having neurons each associated with a quantized activation function adapted to output a quantized activation value. The neurons are arranged in layers and connected by connections associated quantized connection weight functions adapted to output quantized connection weight values. During a training process a plurality of weight gradients are calculated during backpropagation sub-processes by computing neuron gradients, each of an output of a respective the quantized activation function in one layer with respect to an input of the respective quantized activation function. Each neuron gradient is calculated such that when an absolute value of the input is smaller than a positive constant threshold value, the respective neuron gradient is set as a positive constant output value and when the absolute value of the input is smaller than the positive constant threshold value the neuron gradient is set to zero.

30 citations


Posted Content
TL;DR: In this article, it was shown that a fast rejection rate is achieved if the rejection mass is bounded from above by O(1/m) where m is the number of labeled examples used to train the classifier and O hides logarithmic factors.
Abstract: A selective classifier (f,g) comprises a classification function f and a binary selection function g, which determines if the classifier abstains from prediction, or uses f to predict. The classifier is called pointwise-competitive if it classifies each point identically to the best classifier in hindsight (from the same class), whenever it does not abstain. The quality of such a classifier is quantified by its rejection mass, defined to be the probability mass of the points it rejects. A "fast" rejection rate is achieved if the rejection mass is bounded from above by O(1/m) where m is the number of labeled examples used to train the classifier (and O hides logarithmic factors). Pointwise-competitive selective (PCS) classifiers are intimately related to disagreement-based active learning and it is known that in the realizable case, a fast rejection rate of a known PCS algorithm (called Consistent Selective Strategy) is equivalent to an exponential speedup of the well-known CAL active algorithm. We focus on the agnostic setting, for which there is a known algorithm called LESS that learns a PCS classifier and achieves a fast rejection rate (depending on Hanneke's disagreement coefficient) under strong assumptions. We present an improved PCS learning algorithm called ILESS for which we show a fast rate (depending on Hanneke's disagreement coefficient) without any assumptions. Our rejection bound smoothly interpolates the realizable and agnostic settings. The main result of this paper is an equivalence between the following three entities: (i) the existence of a fast rejection rate for any PCS learning algorithm (such as ILESS); (ii) a poly-logarithmic bound for Hanneke's disagreement coefficient; and (iii) an exponential speedup for a new disagreement-based active learner called ActiveiLESS.

7 citations


Posted Content
TL;DR: It is argued that among several known alternative performance measures, PA is the best (and only) quantity ensuring meaningfulness for all noise and imbalance levels.
Abstract: We introduce the Prediction Advantage (PA), a novel performance measure for prediction functions under any loss function (eg, classification or regression) The PA is defined as the performance advantage relative to the Bayesian risk restricted to knowing only the distribution of the labels We derive the PA for well-known loss functions, including 0/1 loss, cross-entropy loss, absolute loss, and squared loss In the latter case, the PA is identical to the well-known R-squared measure, widely used in statistics The use of the PA ensures meaningful quantification of prediction performance, which is not guaranteed, for example, when dealing with noisy imbalanced classification problems We argue that among several known alternative performance measures, PA is the best (and only) quantity ensuring meaningfulness for all noise and imbalance levels

4 citations


Posted Content
TL;DR: In this paper, the authors consider online learning of portfolios of stocks whose prices are governed by arbitrary (unknown) stationary and ergodic processes, where the goal is to maximize wealth while keeping the conditional value at risk (CVaR) below a desired threshold.
Abstract: Online portfolio selection research has so far focused mainly on minimizing regret defined in terms of wealth growth. Practical financial decision making, however, is deeply concerned with both wealth and risk. We consider online learning of portfolios of stocks whose prices are governed by arbitrary (unknown) stationary and ergodic processes, where the goal is to maximize wealth while keeping the conditional value at risk (CVaR) below a desired threshold. We characterize the asymptomatically optimal risk-adjusted performance and present an investment strategy whose portfolios are guaranteed to achieve the asymptotic optimal solution while fulfilling the desired risk constraint. We also numerically demonstrate and validate the viability of our method on standard datasets.

3 citations


Posted Content
TL;DR: In this article, the authors extend the multi-objective framework to the case of stationary and ergodic processes, thus allowing dependencies among observations, and present an algorithm whose predictions achieve the optimal solution while fulfilling any continuous and convex constraining criterion.
Abstract: Online-learning research has mainly been focusing on minimizing one objective function. In many real-world applications, however, several objective functions have to be considered simultaneously. Recently, an algorithm for dealing with several objective functions in the i.i.d. case has been presented. In this paper, we extend the multi-objective framework to the case of stationary and ergodic processes, thus allowing dependencies among observations. We first identify an asymptomatic lower bound for any prediction strategy and then present an algorithm whose predictions achieve the optimal solution while fulfilling any continuous and convex constraining criterion.

3 citations


Proceedings Article
01 Jan 2017
TL;DR: The multi-objective framework is extended to the case of stationary and ergodic processes, thus allowing dependencies among observations and presenting an algorithm whose predictions achieve the optimal solution while fulfilling any continuous and convex constraining criterion.
Abstract: Online-learning research has mainly been focusing on minimizing one objective function. In many real-world applications, however, several objective functions have to be considered simultaneously. Recently, an algorithm for dealing with several objective functions in the i.i.d. case has been presented. In this paper, we extend the multi-objective framework to the case of stationary and ergodic processes, thus allowing dependencies among observations. We first identify an asymptomatic lower bound for any prediction strategy and then present an algorithm whose predictions achieve the optimal solution while fulfilling any continuous and convex constraining criterion.