scispace - formally typeset
Search or ask a question

Showing papers on "Empirical risk minimization published in 1996"


Journal ArticleDOI
TL;DR: It is shown, loosely speaking, that for loss functions other than zero-one (e.g., quadratic loss), there are a priori distinctions between algorithms, and it is shown here that any algorithm is equivalent on average to its randomized version, and in this still has no first principles justification in terms of average error.
Abstract: This is the second of two papers that use off-training set (OTS) error to investigate the assumption-free relationship between learning algorithms. The first paper discusses a particular set of ways to compare learning algorithms, according to which there are no distinctions between learning algorithms. This second paper concentrates on different ways of comparing learning algorithms from those used in the first paper. In particular this second paper discusses the associated a priori distinctions that do exist between learning algorithms. In this second paper it is shown, loosely speaking, that for loss functions other than zero-one (e.g., quadratic loss), there are a priori distinctions between algorithms. However, even for such loss functions, it is shown here that any algorithm is equivalent on average to its “randomized” version, and in this still has no first principles justification in terms of average error. Nonetheless, as this paper discusses, it may be that (for example) cross-validation has better head-to-head minimax properties than “anti-cross-validation” (choose the learning algorithm with the largest cross-validation error). This may be true even for zero-one loss, a loss function for which the notion of “randomization” would not be relevant. This paper also analyzes averages over hypotheses rather than targets. Such analyses hold for all possible priors over targets. Accordingly they prove, as a particular example, that cross-validation cannot be justified as a Bayesian procedure. In fact, for a very natural restriction of the class of learning algorithms, one should use anti-cross-validation rather than cross-validation (!).

123 citations


Journal ArticleDOI
TL;DR: The authors show the optimal nets to be consistent in the problem of nonlinear function approximation and in nonparametric classification in RBF networks and obtain the network parameters through empirical risk minimization.
Abstract: Studies convergence properties of radial basis function (RBF) networks for a large class of basis functions, and reviews the methods and results related to this topic. The authors obtain the network parameters through empirical risk minimization. The authors show the optimal nets to be consistent in the problem of nonlinear function approximation and in nonparametric classification. For the classification problem the authors consider two approaches: the selection of the RBF classifier via nonlinear function estimation and the direct method of minimizing the empirical error probability. The tools used in the analysis include distribution-free nonasymptotic probability inequalities and covering numbers for classes of functions.

86 citations


Proceedings Article
03 Dec 1996
TL;DR: An adaptive on-line algorithm extending the learning of learning idea can be applied to learning continuous functions or distributions, even when no explicit loss function is given and the Hessian is not available.
Abstract: An adaptive on-line algorithm extending the learning of learning idea is proposed and theoretically motivated. Relying only on gradient flow information it can be applied to learning continuous functions or distributions, even when no explicit loss function is given and the Hessian is not available. Its efficiency is demonstrated for a non-stationary blind separation task of acoustic signals.

76 citations


Journal ArticleDOI
TL;DR: Almost-cyclic learning is a better alternative for batch-mode learning than cyclic learning (learning with a fixed cycle), and stochastic methods valid for small learning parameters eta are used.
Abstract: We study and compare different neural network learning strategies: batch-mode learning, online learning, cyclic learning, and almost-cyclic learning. Incremental learning strategies require less storage capacity than batch-mode learning. However, due to the arbitrariness in the presentation order of the training patterns, incremental learning is a stochastic process; whereas batch-mode learning is deterministic. In zeroth order, i.e., as the learning parameter /spl eta/ tends to zero, all learning strategies approximate the same ordinary differential equation for convenience referred to as the "ideal behavior". Using stochastic methods valid for small learning parameters /spl eta/, we derive differential equations describing the evolution of the lowest-order deviations from this ideal behavior. We compute how the asymptotic misadjustment, measuring the average asymptotic distance from a stable fixed point of the ideal behavior, scales as a function of the learning parameter and the number of training patterns. Knowing the asymptotic misadjustment, we calculate the typical number of learning steps necessary to generate a weight within order /spl epsiv/ of this fixed point, both with fixed and time-dependent learning parameters. We conclude that almost-cyclic learning (learning with random cycles) is a better alternative for batch-mode learning than cyclic learning (learning with a fixed cycle).

62 citations


Journal ArticleDOI
TL;DR: Lower bounds on the performance of any algorithm for this learning problem are proved, and a similar analysis of the closely related problem of learning to predict in a model in which the learner must produce predictions for a whole batch of observations before receiving reinforcement is given.
Abstract: We study the behavior of a family of learning algorithms based on Sutton’s method of temporal differences. In our on-line learning framework, learning takes place in a sequence of trials, and the goal of the learning algorithm is to estimate a discounted sum of all the reinforcements that will be received in the future. In this setting, we are able to prove general upper bounds on the performance of a slightly modified version of Sutton’s so-called TD(λ) algorithm. These bounds are stated in terms of the performance of the best linear predictor on the given training sequence, and are proved without making any statistical assumptions of any kind about the process producing the learner’s observed training sequence. We also prove lower bounds on the performance of any algorithm for this learning problem, and give a similar analysis of the closely related problem of learning to predict in a model in which the learner must produce predictions for a whole batch of observations before receiving reinforcement.

45 citations



01 Dec 1996
TL;DR: This paper focuses on the dependencies between answers to different questions, and emphasizes the need of their empirical measurement and control and of a more explicit treatment in theory.
Abstract: In learning problems available information is usually divided into two categories: examples of function values (or training data) and prior information (e.g. a smoothness constraint). This paper 1.) studies aspects on which these two categories usually differ, like their relevance for generalization and their role in the loss function, 2.) presents a unifying formalism, where both types of information are identified with answers to generalized questions, 3.) shows what kind of generalized information is necessary to enable learning, 4.) aims to put usual training data and prior information on a more equal footing by discussing possibilities and variants of measurement and control for generalized questions, including the examples of smoothness and symmetries, 5.) reviews shortly the measurement of linguistic concepts based on fuzzy priors, and principles to combine preprocessors, 6.) uses a Bayesian decision theoretic framework, contrasting parallel and inverse decision problems, 7.) proposes, for problems with non--approximation aspects, a Bayesian two step approximation consisting of posterior maximization and a subsequent risk minimization, 8.) analyses empirical risk minimization under the aspect of nonlocal information 9.) compares the Bayesian two step approximation with empirical risk minimization, including their interpretations of Occam''s razor, 10.) formulates examples of stationarity conditions for the maximum posterior approximation with nonlocal and nonconvex priors, leading to inhomogeneous nonlinear equations, similar for example to equations in scattering theory in physics. In summary, this paper focuses on the dependencies between answers to different questions. Because not training examples alone but such dependencies enable generalization, it emphasizes the need of their empirical measurement and control and of a more explicit treatment in theory.

11 citations


Proceedings Article
03 Dec 1996
TL;DR: In this article, the authors apply the method of complexity regularization to derive estimation bounds for nonlinear function estimation using a single hidden layer radial basis function network, where the network is trained by means of complexity-regularization involving empirical risk minimization.
Abstract: In this paper we apply the method of complexity regularization to derive estimation bounds for nonlinear function estimation using a single hidden layer radial basis function network. Our approach differs from the previous complexity regularization neural network function learning schemes in that we operate with random covering numbers and l1 metric entropy, making it possible to consider much broader families of activation functions, namely functions of bounded variation. Some constraints previously imposed on the network parameters are also eliminated this way. The network is trained by means of complexity regularization involving empirical risk minimization. Bounds on the expected risk in terms of the sample size are obtained for a large class of loss functions. Rates of convergence to the optimal loss are also derived.

3 citations