scispace - formally typeset
Search or ask a question

Showing papers on "Empirical risk minimization published in 2006"


Journal ArticleDOI
TL;DR: The support vector machine (SVM) is presented as a promising method for hydrological prediction and it is demonstrated that SVM is a very potential candidate for the prediction of long-term discharges.
Abstract: Accurate time- and site-specific forecasts of streamflow and reservoir inflow are important in effective hydropower reservoir management and scheduling. Traditionally, autoregressive moving-average (ARMA) models have been used in modelling water resource time series as a standard representation of stochastic time series. Recently, artificial neural network (ANN) approaches have been proven to be efficient when applied to hydrological prediction. In this paper, the support vector machine (SVM) is presented as a promising method for hydrological prediction. Over-fitting and local optimal solution are unlikely to occur with SVM, which implements the structural risk minimization principle rather than the empirical risk minimization principle. In order to identify appropriate parameters of the SVM prediction model, a shuffled complex evolution algorithm is performed through exponential transformation. The SVM prediction model is tested using the long-term observations of discharges of monthly river fl...

517 citations


Journal ArticleDOI
TL;DR: In this article, the authors studied an empirical risk minimization problem, where the goal is to obtain very general upper bounds on the excess risk of a class of measurable functions, expressed in terms of relevant geometric parameters of the class.
Abstract: Let ℱ be a class of measurable functions f:S↦[0, 1] defined on a probability space (S, $\mathcal{A}$, P). Given a sample (X1, …, Xn) of i.i.d. random variables taking values in S with common distribution P, let Pn denote the empirical measure based on (X1, …, Xn). We study an empirical risk minimization problem Pnf→min , f∈ℱ. Given a solution fn of this problem, the goal is to obtain very general upper bounds on its excess risk $$\mathcal{E}_{P}(\hat{f}_{n}):=P\hat{f}_{n}-\inf_{f\in \mathcal{F}}Pf,$$ expressed in terms of relevant geometric parameters of the class ℱ. Using concentration inequalities and other empirical processes tools, we obtain both distribution-dependent and data-dependent upper bounds on the excess risk that are of asymptotically correct order in many examples. The bounds involve localized sup-norms of empirical and Rademacher processes indexed by functions from the class. We use these bounds to develop model selection techniques in abstract risk minimization problems that can be applied to more specialized frameworks of regression and classification.

381 citations


Proceedings ArticleDOI
25 Jun 2006
TL;DR: The first active learning algorithm which works in the presence of arbitrary forms of noise is state and analyzed, and it is shown that A2 achieves an exponential improvement over the usual sample complexity of supervised learning.
Abstract: We state and analyze the first active learning algorithm which works in the presence of arbitrary forms of noise. The algorithm, A2 (for Agnostic Active), relies only upon the assumption that the samples are drawn i.i.d. from a fixed distribution. We show that A2 achieves an exponential improvement (i.e., requires only O (ln 1/e) samples to find an e-optimal classifier) over the usual sample complexity of supervised learning, for several settings considered before in the realizable case. These include learning threshold classifiers and learning homogeneous linear separators with respect to an input distribution which is uniform over the unit sphere.

327 citations


Journal ArticleDOI
TL;DR: Results from the SVM modeling are compared with predictions obtained from ANN models and show that SVM models performed better for soil moisture forecasting than ANN models.
Abstract: Herein, a recently developed methodology, Support Vector Machines (SVMs), is presented and applied to the challenge of soil moisture prediction. Support Vector Machines are derived from statistical learning theory and can be used to predict a quantity forward in time based on training that uses past data, hence providing a statistically sound approach to solving inverse problems. The principal strength of SVMs lies in the fact that they employ Structural Risk Minimization (SRM) instead of Empirical Risk Minimization (ERM). The SVMs formulate a quadratic optimization problem that ensures a global optimum, which makes them superior to traditional learning algorithms such as Artificial Neural Networks (ANNs). The resulting model is sparse and not characterized by the “curse of dimensionality.” Soil moisture distribution and variation is helpful in predicting and understanding various hydrologic processes, including weather changes, energy and moisture fluxes, drought, irrigation scheduling, and rainfall/runoff generation. Soil moisture and meteorological data are used to generate SVM predictions for four and seven days ahead. Predictions show good agreement with actual soil moisture measurements. Results from the SVM modeling are compared with predictions obtained from ANN models and show that SVM models performed better for soil moisture forecasting than ANN models.

237 citations


Journal ArticleDOI
TL;DR: In this article, leave-one-out (LOO) stability is defined as a statistical form of well-posedness, and it is shown that for bounded loss classes LOO stability is sufficient and sufficient for generalization.
Abstract: Solutions of learning problems by Empirical Risk Minimization (ERM) – and almost-ERM when the minimizer does not exist – need to be consistent, so that they may be predictive. They also need to be well-posed in the sense of being stable, so that they might be used robustly. We propose a statistical form of stability, defined as leave-one-out (LOO) stability. We prove that for bounded loss classes LOO stability is (a) sufficient for generalization, that is convergence in probability of the empirical error to the expected error, for any algorithm satisfying it and, (b) necessary and sufficient for consistency of ERM. Thus LOO stability is a weak form of stability that represents a sufficient condition for generalization for symmetric learning algorithms while subsuming the classical conditions for consistency of ERM. In particular, we conclude that a certain form of well-posedness and consistency are equivalent for ERM.

227 citations


Journal ArticleDOI
01 Dec 2006-Neuron
TL;DR: It is shown that the performances of the estimators are controlled by the rate of uniform convergence of empirical to true probabilities over the class from which the estimator is drawn, and finite sample size performance bounds are obtained in terms of VC dimension and related quantities.

149 citations


Journal Article
TL;DR: This paper proposes a new active learning method also using the weighted least-squares learning, which it proves that the proposed active learning criterion is a more accurate predictor of the single-trial generalization error than the existing criterion.
Abstract: The goal of active learning is to determine the locations of training input points so that the generalization error is minimized. We discuss the problem of active learning in linear regression scenarios. Traditional active learning methods using least-squares learning often assume that the model used for learning is correctly specified. In many practical situations, however, this assumption may not be fulfilled. Recently, active learning methods using "importance"-weighted least-squares learning have been proposed, which are shown to be robust against misspecification of models. In this paper, we propose a new active learning method also using the weighted least-squares learning, which we call ALICE (Active Learning using the Importance-weighted least-squares learning based on Conditional Expectation of the generalization error). An important difference from existing methods is that we predict the conditional expectation of the generalization error given training input points, while existing methods predict the full expectation of the generalization error. Due to this difference, the training input design can be fine-tuned depending on the realization of training input points. Theoretically, we prove that the proposed active learning criterion is a more accurate predictor of the single-trial generalization error than the existing criterion. Numerical studies with toy and benchmark data sets show that the proposed method compares favorably to existing methods.

147 citations


Journal ArticleDOI
TL;DR: A high-dimensional simulation study of a "boosting type" classification procedure based on empirical risk minimization and shows that the best subset among those of size which is of order o(n/log(n).
Abstract: Let (Y, X 1 ,..., X m ) be a random vector. It is desired to predict Y based on (X 1 ,..., X m ). Examples of prediction methods are regression, classification using logistic regression or separating hyperplanes, and so on. We consider the problem of best subset selection, and study it in the context m = n α , a > 1, where n is the number of observations. We investigate procedures that are based on empirical risk minimization. It is shown, that in common cases, we should aim to find the best subset among those of size which is of order o(n/log(n)). It is also shown, that in some "asymptotic sense," when assuming a certain sparsity condition, there is no loss in letting m be much larger than n, for example, m = n α , a > 1. This is in comparison to starting with the "best" subset of size smaller than n and regardless of the value of a. We then study conditions under which empirical risk minimization subject to l 1 constraint yields nearly the best subset. These results extend some recent results obtained by Greenshtein and Ritov. Finally we present a high-dimensional simulation study of a "boosting type" classification procedure.

127 citations


Journal ArticleDOI
TL;DR: It is shown that a new family of decision trees, dyadic decision trees (DDTs), attain nearly optimal rates of convergence for a broad range of classification problems.
Abstract: Decision trees are among the most popular types of classifiers, with interpretability and ease of implementation being among their chief attributes. Despite the widespread use of decision trees, theoretical analysis of their performance has only begun to emerge in recent years. In this paper, it is shown that a new family of decision trees, dyadic decision trees (DDTs), attain nearly optimal (in a minimax sense) rates of convergence for a broad range of classification problems. Furthermore, DDTs are surprisingly adaptive in three important respects: they automatically 1) adapt to favorable conditions near the Bayes decision boundary; 2) focus on data distributed on lower dimensional manifolds; and 3) reject irrelevant features. DDTs are constructed by penalized empirical risk minimization using a new data-dependent penalty and may be computed exactly with computational complexity that is nearly linear in the training sample size. DDTs comprise the first classifiers known to achieve nearly optimal rates for the diverse class of distributions studied here while also being practical and implementable. This is also the first study (of which we are aware) to consider rates for adaptation to intrinsic data dimension and relevant features.

119 citations


Journal ArticleDOI
TL;DR: This paper provides guidelines within SVM framework so that one can readily use the paper as a quick reference for SVM response modeling: use of different costs for different classes and use of distance to decision boundary, respectively.
Abstract: Support Vector Machine (SVM) employs Structural Risk Minimization (SRM) principle to generalize better than conventional machine learning methods employing the traditional Empirical Risk Minimization (ERM) principle. When applying SVM to response modeling in direct marketing, however, one has to deal with the practical difficulties: large training data, class imbalance and scoring from binary SVM output. For the first difficulty, we propose a way to alleviate or solve it through a novel informative sampling. For the latter two difficulties, we provide guidelines within SVM framework so that one can readily use the paper as a quick reference for SVM response modeling: use of different costs for different classes and use of distance to decision boundary, respectively. This paper also provides various evaluation measures for response models in terms of accuracies, lift chart analysis, and computational efficiency.

95 citations


Proceedings Article
04 Dec 2006
TL;DR: It is shown that in the case of a unique global minimizer, the clustering solution is stable with respect to complete changes of the data, while for the cases of multiple minimizers, the change of Ω(n1/2) samples defines the transition between stability and instability.
Abstract: We phrase K-means clustering as an empirical risk minimization procedure over a class ℋK and explicitly calculate the covering number for this class. Next, we show that stability of K-means clustering is characterized by the geometry of ℋK with respect to the underlying distribution. We prove that in the case of a unique global minimizer, the clustering solution is stable with respect to complete changes of the data, while for the case of multiple minimizers, the change of Ω(n1/2) samples defines the transition between stability and instability. While for a finite number of minimizers this result follows from multinomial distribution estimates, the case of infinite minimizers requires more refined tools. We conclude by proving that stability of the functions in ℋK implies stability of the actual centers of the clusters. Since stability is often used for selecting the number of clusters in practice, we hope that our analysis serves as a starting point for finding theoretically grounded recipes for the choice of K.

Proceedings Article
04 Dec 2006
TL;DR: This paper proposes a graph learning method for the harmonic energy minimization method by minimizing the leave-one-out prediction error on labeled data points and designed an efficient algorithm which significantly accelerates the calculation of the gradient by applying the matrix inversion lemma and using careful pre-computation.
Abstract: Semi-supervised learning algorithms have been successfully applied in many applications with scarce labeled data, by utilizing the unlabeled data. One important category is graph based semi-supervised learning algorithms, for which the performance depends considerably on the quality of the graph, or its hyperparameters. In this paper, we deal with the less explored problem of learning the graphs. We propose a graph learning method for the harmonic energy minimization method; this is done by minimizing the leave-one-out prediction error on labeled data points. We use a gradient based method and designed an efficient algorithm which significantly accelerates the calculation of the gradient by applying the matrix inversion lemma and using careful pre-computation. Experimental results show that the graph learning method is effective in improving the performance of the classification algorithm.

Posted Content
TL;DR: This investigation presents a hybrid SVM model to exploit the unique strength of the linear and nonlinear SVM models in forecasting exchange rate and shows that the proposed model outperforms the other approaches in the literature.
Abstract: Support vector machines (SVMs) have been successfully used to solve nonlinear regression and times series problems. Unlike most conventional neural network models, which are based on the empirical risk minimization principle, SVMs apply the structural risk minimization principle to minimize an upper bound of the generalization error, rather than minimizing the training error. However, one particular model can not capture all data patterns easily. This investigation presents a hybrid SVM model to exploit the unique strength of the linear and nonlinear SVM models in forecasting exchange rate. Furthermore, parameters of both the linear and nonlinear SVM models are determined by Genetic Algorithms (GAs). A numerical example from an existing literature is employed to compare the performance of the proposed model. Experiment results show that the proposed model outperforms the other approaches in the literature.

Journal Article
TL;DR: In this article, a hybrid SVM model is proposed to exploit the unique strength of the linear and nonlinear SVM models in forecasting exchange rate, which is based on the structural risk minimization principle to minimize an upper bound of the generalization error.
Abstract: Support vector machines (SVMs) have been successfully used to solve nonlinear regression and times series problems. Unlike most conventional neural network models, which are based on the empirical risk minimization principle, SVMs apply the structural risk minimization principle to minimize an upper bound of the generalization error, rather than minimizing the training error. However, one particular model can not capture all data patterns easily. This investigation presents a hybrid SVM model to exploit the unique strength of the linear and nonlinear SVM models in forecasting exchange rate. Furthermore, parameters of both the linear and nonlinear SVM models are determined by Genetic Algorithms (GAs). A numerical example from an existing literature is employed to compare the performance of the proposed model. Experiment results show that the proposed model outperforms the other approaches in the literature.

Proceedings ArticleDOI
18 Dec 2006
TL;DR: One of the algorithms was consistently the top performer, and Closest Sampling from the literature often came in second behind it, when good posterior probability estimates were available, and the authors' heuristics were by far the best.
Abstract: In active learning, a machine learning algorithm is given an unlabeled set of examples U, and is allowed to request labels for a relatively small subset of U to use for training. The goal is then to judiciously choose which examples in U to have labeled in order to optimize some performance criterion, e.g. classification accuracy. We study how active learning affects AUC. We examine two existing algorithms from the literature and present our own active learning algorithms designed to maximize the AUC of the hypothesis. One of our algorithms was consistently the top performer, and Closest Sampling from the literature often came in second behind it. When good posterior probability estimates were available, our heuristics were by far the best.

Journal ArticleDOI
TL;DR: In this magnificent paper, Professor Koltchinskii offers general and powerful performance bounds for empirical risk minimization, a fundamental principle of statistical learning theory, and develops a powerful new methodology, iterative localization, which is able to explain most of the recent results and go significantly beyond them in many cases.
Abstract: In this magnificent paper, Professor Koltchinskii offers general and powerful performance bounds for empirical risk minimization, a fundamental principle of statistical learning theory. Since the elegant pioneering work of Vapnik and Chervonenkis in the early 1970s, various such bounds have been known that relate the performance of empirical risk minimizers to combinatorial and geometrical features of the class over which the minimization is performed. This area of research has been a rich source of motivation and a major field of applications of empirical process theory. The appearance of advanced concentration inequalities in the 1990s, primarily thanks to Talagrand’s influential work, provoked major advances in both empirical process theory and statistical learning theory and led to a much deeper understanding of some of the basic phenomena. In the discussed paper Professor Koltchinskii develops a powerful new methodology, iterative localization, which, with the help of concentration inequalities, is able to explain most of the recent results and go significantly beyond them in many cases. The main motivation behind Professor Koltchinskii’s paper is based on classical problems of statistical learning theory such as binary classification and regression in which, given a sample (Xi ,Y i), i = 1 ,...,n , of independent and identically distributed pairs of random variables (where the Xi take their values in some feature space X and the Yi are, say, real-valued), the goal is to find a function f : X → R whose risk, defined in terms of the expected value of an appropriately chosen loss function, is as small as possible. In the remaining part of this discussion we point out how the performance bounds of Professor Koltchinskii’s paper can be used to study a seemingly different model, motivated by nonparametric ranking problems, which has received increasing attention both in the statistical and machine learning literature. Indeed, in several applications, such as the search engine problem or credit risk screening, the goal is to learn how to rank—or to score—observations rather than just classify them. In this case, performance measures involve pairs of observations, as can be seen, for instance, with the AUC (Area Under an ROC Curve) criterion. In this

Proceedings Article
04 Dec 2006
TL;DR: This work gives a gradient-based procedure for minimizing an arbitrarily accurate approximation of the empirical risk under a Hamming loss function.
Abstract: We consider the problem of training a conditional random field (CRF) to maximize per-label predictive accuracy on a training set, an approach motivated by the principle of empirical risk minimization. We give a gradient-based procedure for minimizing an arbitrarily accurate approximation of the empirical risk under a Hamming loss function. In experiments with both simulated and real data, our optimization procedure gives significantly better testing performance than several current approaches for CRF training, especially in situations of high label noise.

Journal Article
TL;DR: The paper firstly introduces the mathematical model of regression least squares support vector machine (LSSVM), and analyzes its property, then designs incremental and online learning algorithms based on LSSVM by the calculation formula of block matrix and kernel function matrix's property.
Abstract: Support vector machine is a learning technique based on the structural risk minimization principle,and it is also a class of regression method with good generalization ability.The paper firstly introduces the mathematical model of regression least squares support vector machine(LSSVM),and analyzes its property,then designs incremental and online learning algorithms based on LSSVM by the calculation formula of block matrix and kernel function matrix's property.The proposed learning algorithms fully utilizes the historical training results,reduces storage space and calculate time.Experimental results of simulation indicate the feasibility of the two learning algorithms.

Journal Article
TL;DR: As the number n of samples grows, the L2-diameter of the set of almost-minimizers of empirical error with tolerance ξ(n)=o(n-1/2) converges to zero in probability, so that even in the case of multiple minimizers of expected error, as n increases it becomes less and less likely that adding a sample to the training set will result in a large jump to a new hypothesis.
Abstract: We study some stability properties of algorithms which minimize (or almost-minimize) empirical error over Donsker classes of functions. We show that, as the number n of samples grows, the L2-diameter of the set of almost-minimizers of empirical error with tolerance ξ(n)=o(n-1/2) converges to zero in probability. Hence, even in the case of multiple minimizers of expected error, as n increases it becomes less and less likely that adding a sample (or a number of samples) to the training set will result in a large jump to a new hypothesis. Moreover, under some assumptions on the entropy of the class, along with an assumption of Komlos-Major-Tusnady type, we derive a power rate of decay for the diameter of almost-minimizers. This rate, through an application of a uniform ratio limit inequality, is shown to govern the closeness of the expected errors of the almost-minimizers. In fact, under the above assumptions, the expected errors of almost-minimizers become closer with a rate strictly faster than n-1/2.

Book ChapterDOI
12 Sep 2006
TL;DR: This paper proposes a new method called importance-weighted cross-validation, which is still unbiased even under the covariate shift, and successfully tested on toy data and furthermore demonstrated in the brain-computer interface, where strong non-stationarity effects can be seen between calibration and feedback sessions.
Abstract: A common assumption in supervised learning is that the input points in the training set follow the same probability distribution as the input points used for testing However, this assumption is not satisfied, for example, when the outside of training region is extrapolated The situation where the training input points and test input points follow different distributions is called the covariate shift Under the covariate shift, standard machine learning techniques such as empirical risk minimization or cross-validation do not work well since their unbiasedness is no longer maintained In this paper, we propose a new method called importance-weighted cross-validation, which is still unbiased even under the covariate shift The usefulness of our proposed method is successfully tested on toy data and furthermore demonstrated in the brain-computer interface, where strong non-stationarity effects can be seen between calibration and feedback sessions

Journal Article
TL;DR: This work considers the consistency of ERM scheme over classes of combinations of very simple rules (base classifiers) in multiclass classification to establish a quantitative relationship between classification errors and convex risks.
Abstract: The consistency of classification algorithm plays a central role in statistical learning theory. A consistent algorithm guarantees us that taking more samples essentially suffices to roughly reconstruct the unknown distribution. We consider the consistency of ERM scheme over classes of combinations of very simple rules (base classifiers) in multiclass classification. Our approach is, under some mild conditions, to establish a quantitative relationship between classification errors and convex risks. In comparison with the related previous work, the feature of our result is that the conditions are mainly expressed in terms of the differences between some values of the convex function.

Journal ArticleDOI
TL;DR: Comparison of various local and global learning algorithms in neural network modeling was performed for ore grade estimation in three deposits, revealing no benefit was achieved and it is better to apply global learning algorithm in neuralnetwork training since many real-life applications of Neural network modeling show local minima problems in error surface.
Abstract: In this paper, comparative evaluation of various local and global learning algorithms in neural network modeling was performed for ore grade estimation in three deposits: gold, bauxite, and iron ore. Four local learning algorithms, standard back-propagation, back-propagation with momentum, quickprop back-propagation, and Levenberg–Marquardt back-propagation, along with two global learning algorithms, NOVEL and simulated annealing, were investigated for this purpose. The study results revealed that no benefit was achieved using global learning algorithms over local learning algorithms. The reasons for showing equivalent performance of global and local learning algorithms was the smooth error surface of neural network training for these specific case studies. However, a separate exercise involving local and global learning algorithms on a nonlinear multimodal optimization of a Rastrigin function, containing many local minima, clearly demonstrated the superior performance of global learning algorithms over local learning algorithms. Although no benefit was found by using global learning algorithms of neural network training for these specific case studies, as a safeguard against getting trapped in local minima, it is better to apply global learning algorithms in neural network training since many real-life applications of neural network modeling show local minima problems in error surface.

Journal ArticleDOI
TL;DR: Several modifications to the Fuzzy ARTMAP neural network architecture are proposed for conducting classification in complex, possibly noisy, environments and a theory of structural risk minimization reveals a trade-off between training error and classifier complexity in reducing generalization error is exploited in the learning algorithms proposed.

01 Jan 2006
TL;DR: How marginal structural models for causal effects can be extended through the alternative techniques of local, penalized, and additive learning is shown, and nonparametric function estimation methods can be fruitfully applied for making causal inferences.
Abstract: Marginal structural models (MSMs) allow one to form causal inferences from data, by specifying a relationship between a treatment and the marginal distribution of a corresponding counterfactual outcome. Following their introduction in Robins (1997), MSMs have typically been fit after assuming a semiparametric model, and then estimating a finite dimensional parameter. van der Laan and Dudoit (2003) proposed to instead view MSM fitting not as a task of semiparametric parameter estimation, but of nonparametric function approximation. They introduced a class of causal effect estimators based on mapping loss functions suitable for the unavailable counterfactual data to those suitable for the data actually observed, and then applying what has been known in nonparametric statistics as empirical risk minimization, or global learning. However, it has long been recognized in the statistical learning community that global learning is only one of several paradigms for estimator construction. Building upon van der Laan and Dudoit’s work, we show how marginal structural models for causal effects can be extended through the alternative techniques of local, penalized, and additive learning. We discuss how these new methods can often be implemented by simply adding observation weights to existing algorithms, demonstrate the gains made possible by these extended MSMs through simulation results, and conclude that nonparametric function estimation methods can be fruitfully applied for making causal inferences.

Book ChapterDOI
22 Jun 2006
TL;DR: In this paper, the authors show that there is a learning problem that can be solved by a discriminative learning algorithm, but not by any generative learning algorithms (given minimal cryptographic assumptions).
Abstract: Generative algorithms for learning classifiers use training data to separately estimate a probability model for each class. New items are then classified by comparing their probabilities under these models. In contrast, discriminative learning algorithms try to find classifiers that perform well on all the training data. We show that there is a learning problem that can be solved by a discriminative learning algorithm, but not by any generative learning algorithm (given minimal cryptographic assumptions). This statement is formalized using a framework inspired by previous work of Goldberg [3].

Book ChapterDOI
TL;DR: A brief survey of regularization schemes in learning theory for the purposes of regression and classification, from, an approximation theory point of view.
Abstract: We give a brief survey of regularization schemes in learning theory for the purposes of regression and classification, from, an approximation theory point of view. First, the classical method of empirical risk minimization is reviewed for regression with a general convex loss function. Next, we explain ideas and methods for the error analysis of regression algorithms generated by Tikhonov regularization schemes associated with reproducing kernel Hilbert spaces. Then binary classification algorithms given by regularization schemes are described with emphasis on support vector machines and noise conditions for distributions. Finally, we mention further topics and some open problems in learning theory.

Dissertation
01 Jan 2006
TL;DR: This thesis derives tight performance guarantees for greedy error minimization methods---a family of computationally tractable algorithms and proves a generalization of the bounded-difference concentration inequality for almost-everywhere smooth functions.
Abstract: This thesis studies two key properties of learning algorithms: their generalization ability and their stability with respect to perturbations To analyze these properties, we focus on concentration inequalities and tools from empirical process theory We obtain theoretical results and demonstrate their applications to machine learning First, we show how various notions of stability upper- and lower-bound the bias and variance of several estimators of the expected performance for general learning algorithms A weak stability condition is shown to be equivalent to consistency of empirical risk minimization The second part of the thesis derives tight performance guarantees for greedy error minimization methods---a family of computationally tractable algorithms In particular, we derive risk bounds for a greedy mixture density estimation procedure We prove that, unlike what is suggested in the literature, the number of terms in the mixture is not a bias-variance trade-off for the performance The third part of this thesis provides a solution to an open problem regarding the stability of Empirical Risk Minimization (ERM) This algorithm is of central importance in Learning Theory By studying the suprema of the empirical process, we prove that ERM over Donsker classes of functions is stable in the L1 norm Hence, as the number of samples grows, it becomes less and less likely that a perturbation of o( n ) samples will result in a very different empirical minimizer Asymptotic rates of this stability are proved under metric entropy assumptions on the function class Through the use of a ratio limit inequality, we also prove stability of expected errors of empirical minimizers Next, we investigate applications of the stability result In particular, we focus on procedures that optimize an objective function, such as k-means and other clustering methods We demonstrate that stability of clustering, just like stability of ERM, is closely related to the geometry of the class and the underlying measure Furthermore, our result on stability of ERM delineates a phase transition between stability and instability of clustering methods In the last chapter, we prove a generalization of the bounded-difference concentration inequality for almost-everywhere smooth functions This result can be utilized to analyze algorithms which are almost always stable Next, we prove a phase transition in the concentration of almost-everywhere smooth functions Finally, a tight concentration of empirical errors of empirical minimizers is shown under an assumption on the underlying space (Copies available exclusively from MIT Libraries, Rm 14-0551, Cambridge, MA 02139-4307 Ph 617-253-5668; Fax 617-253-1690)

Posted Content
TL;DR: In this article, support vector machines (SVMs) are applied to minimize an upper bound of the generalization error, rather than minimizing the training error, to forecast tourist arrivals.
Abstract: Accurate tourist demand forecasting systems are essential in tourism planning, particularly in tourism-based countries Artificial neural networks are attracting attention to forecast tourist arrivals due to their general nonlinear mapping capabilities Unlike most conventional neural network models, which are based on the empirical risk minimization principle, support vector machines (SVMs) apply the structural risk minimization principle to minimize an upper bound of the generalization error, rather than minimizing the training error This investigation presents an SVM model with genetic algorithms to forecast the tourist arrivals Genetic algorithms (GAs) are used to determine free parameters in the SVM model Empirical results that involve tourist arrival data for Barbados reveal that the proposed model outperforms other approaches in the literature

Proceedings Article
01 Jan 2006
TL;DR: This paper presents an approach that uses independence tests based on chi square distribution, in order to find relationships between predictive variables, and shows that this algorithm is as good as some state-of-the-art Bayesian classifiers, like TAN and an implementation of the BAN model.
Abstract: In this paper we propose using dependency networks (Heckerman et al., 2000), that is a probabilistic graphical model similar to Bayesian networks, to model classifiers. The main difference between these two models is that in dependency networks cycles are allowed, and this fact has the consequence that the automatic learning process is much easier and can be parallelized. These properties make dependency networks a valuable model especially when it is needed to deal with large databases. Because of these promising characteristics we analyse the usefulness of dependency networks-based Bayesian classifiers. We present an approach that uses independence tests based on chi square distribution, in order to find relationships between predictive variables. We show that this algorithm is as good as some state-of the-art Bayesian classifiers, like TAN and an implementation of the BAN model, and has, in addition, other interesting proprierties like scalability or good quality for visualizing relationships.

Posted Content
TL;DR: In this paper, the authors formulate the local ranking problem in the framework of bipartite ranking where the goal is to focus on the best instances and propose a methodology based on the construction of real-valued scoring functions.
Abstract: We formulate the local ranking problem in the framework of bipartite ranking where the goal is to focus on the best instances. We propose a methodology based on the construction of real-valued scoring functions. We study empirical risk minimization of dedicated statistics which involve empirical quantiles of the scores. We first state the problem of finding the best instances which can be cast as a classification problem with mass constraint. Next, we develop special performance measures for the local ranking problem which extend the Area Under an ROC Curve (AUC/AROC) criterion and describe the optimal elements of these new criteria. We also highlight the fact that the goal of ranking the best instances cannot be achieved in a stage-wise manner where first, the best instances would be tentatively identified and then a standard AUC criterion could be applied. Eventually, we state preliminary statistical results for the local ranking problem.