scispace - formally typeset
Search or ask a question

Showing papers by "Yoshua Bengio published in 2002"


Journal ArticleDOI
TL;DR: This work shows how matching pursuit can be extended to use non-squared error loss functions, and how it can be used to build kernel-based solutions to machine learning problems, while keeping control of the sparsity of the solution.
Abstract: Matching Pursuit algorithms learn a function that is a weighted sum of basis functions, by sequentially appending functions to an initially empty basis, to approximate a target function in the least-squares sense. We show how matching pursuit can be extended to use non-squared error loss functions, and how it can be used to build kernel-based solutions to machine learning problems, while keeping control of the sparsity of the solution. We present a version of the algorithm that makes an optimal choice of both the next basis and the weights of all the previously chosen bases. Finally, links to boosting algorithms and RBF training procedures, as well as an extensive experimental comparison with SVMs for classification are given, showing comparable results with typically much sparser models.

343 citations


Proceedings Article
01 Jan 2002
TL;DR: A new non-parametric kernel density estimation method which captures the local structure of an underlying manifold through the leading eigenvectors of regularized local covariance matrices, yielding classification rates similar to SVMs and much superior to the Parzen classifier.
Abstract: The similarity between objects is a fundamental element of many learning algorithms. Most non-parametric methods take this similarity to be fixed, but much recent work has shown the advantages of learning it, in particular to exploit the local invariances in the data or to capture the possibly non-linear manifold on which most of the data lies. We propose a new non-parametric kernel density estimation method which captures the local structure of an underlying manifold through the leading eigenvectors of regularized local covariance matrices. Experiments in density estimation show significant improvements with respect to Parzen density estimators. The density estimators can also be used within Bayes classifiers, yielding classification rates similar to SVMs and much superior to the Parzen classifier.

143 citations


Journal ArticleDOI
TL;DR: This work presents a new penalization method for performing model selection for regression that is appropriate even for small samples, based on an accurate estimator of the ratio of the expected training error and the expected generalization error, in terms of theexpected eigenvalues of the input covariance matrix.
Abstract: Model selection is an important ingredient of many machine learning algorithms, in particular when the sample size in small, in order to strike the right trade-off between overfitting and underfitting Previous classical results for linear regression are based on an asymptotic analysis We present a new penalization method for performing model selection for regression that is appropriate even for small samples Our penalization is based on an accurate estimator of the ratio of the expected training error and the expected generalization error, in terms of the expected eigenvalues of the input covariance matrix

123 citations


Posted Content
TL;DR: In this article, a class of functions similar to multi-layer neural networks is proposed for modeling the price of call options, where the function to be learned is non-decreasing in its two arguments and convex in one of them.
Abstract: Incorporating prior knowledge of a particular task into the architecture of a learning algorithm can greatly improve generalization performance. We study here a case where we know that the function to be learned is non-decreasing in its two arguments and convex in one of them. For this purpose we propose a class of functions similar to multi-layer neural networks but (1) that has those properties, (2) is a universal approximator of continuous functions with these and other properties. We apply this new class of functions to the task of modeling the price of call options. Experiments show improvements on regressing the price of call options using the new types of function classes that incorporate the a priori constraints. Incorporer une connaissance a priori pour une tache particuliere aux algorithmes d'apprentissage peut grandement ameliorer leur performance en generalisation. Dans cet article, nous etudions un cas ou nous savons que la fonction a apprendre est non-decroissante pour ses deux arguments, et convexe pour l'un d'entre eux. Pour ce cas particulier, nous proposons une classe de fonctions similaires aux reseaux de neurones multi-couches mais (1) avec les proprietes mentionnees plus haut, et (2) est un approximateur universel de fonctions continues avec ces proprietes et avec d'autres. Nous appliquons cette nouvelle classe de fonctions au probleme de la modelisation du prix des options d'achat. Nos experiences montrent une amelioration pour la regression sur ces prix d'options d'achat lorsque nous utilisons la nouvelle classe de fonctions qui incorporent les contraintes a priori.

62 citations


Journal ArticleDOI
TL;DR: A new approach to robust regression tailored to deal with asymmetric noise distribution is proposed, to learn most of the parameters of the model using conditional quantile estimators and to learn a few remaining parameters to combine and correct these estimators, to minimize the average squared error in an unbiased way.
Abstract: In the presence of a heavy-tail noise distribution, regression becomes much more difficult. Traditional robust regression methods assume that the noise distribution is symmetric, and they downweight the influence of so-called outliers. When the noise distribution is asymmetric, these methods yield biased regression estimators. Motivated by data-mining problems for the insurance industry, we propose a new approach to robust regression tailored to deal with asymmetric noise distribution. The main idea is to learn most of the parameters of the model using conditional quantile estimators (which are biased but robust estimators of the regression) and to learn a few remaining parameters to combine and correct these estimators, to minimize the average squared error in an unbiased way. Theoretical analysis and experiments show the clear advantages of the approach. Results are on artificial data as well as insurance data, using both linear and neural network predictors.

20 citations


Book ChapterDOI
TL;DR: This work proposes a "hard parallelizable mixture" methodology which yields significantly reduced training time through modularization and parallelization: the training data is iteratively partitioned by a "gater" model in such a way that it becomes easy to learn an "expert" model separately in each region of the partition.
Abstract: A challenge for statistical learning is to deal with large data sets, e.g. in data mining. Popular learning algorithms such as Support Vector Machines have training time at least quadratic in the number of examples: they are hopeless to solve problems with a million examples. We propose a "hard parallelizable mixture" methodology which yields significantly reduced training time through modularization and parallelization: the training data is iteratively partitioned by a "gater" model in such a way that it becomes easy to learn an "expert" model separately in each region of the partition. Ap robabilistic extension and the use of a set of generative models allows representing the gater so that all pieces of the model are locally trained. For SVMs, time complexity appears empirically to locally grow linearly with the number of examples, while generalization performance can be enhanced. For the probabilistic version of the algorithm, the iterative algorithm provably goes down in a cost function that is an upper bound on the negative log-likelihood.

14 citations


01 Jan 2002
TL;DR: In this article, the authors decrivons the complexite du traitement automatique des conversations and present le decoupage thematique comme un outil d'aide a l'extraction.
Abstract: Dans cet article, nous decrivons la complexite du traitement automatique des conversations. En particulier, nous etudions la problematique de l'extraction d'information a partir des conversations et nous presentons le decoupage thematique comme un outil d'aide a l'extraction.

7 citations


Journal ArticleDOI
TL;DR: The papers in this special issue present recent developments in model-complexity control for supervised learning, which represent three areas of significant current research on this subject: model selection, sparse models, and model combination.
Abstract: A classical challenge in fitting models to data is managing the complexity of the models to simultaneously avoid under-fitting and over-fitting the data; fulfilling the goal of producing models that generalize well to unseen data. The papers in this special issue present recent developments in model-complexity control for supervised learning. These thirteen papers represent three areas of significant current research on this subject: model selection (explicitly choosing model complexity), sparse models (reducing complexity by enforcing sparse representations), and model combination (combining multiple models to improve generalization).

5 citations


Posted Content
TL;DR: In this article, an out-of-sample statistic for time-series prediction that is analogous to the widely used R2 in-sample statistics is proposed and compared to the one for financial time series.
Abstract: This paper studies an out-of-sample statistic for time-series prediction that is analogous to the widely used R2 in-sample statistic. We propose and study methods to estimate the variance of this out-of-sample statistic. We suggest that the out-of-sample statistic is more robust to distributional and asymptotic assumptions behind many tests for in-sample statistics. Furthermore we argue that it may be more important in some cases to choose a model that generalizes as well as possible rather than choose the parameters that are closest to the true parameters. Comparative experiments are performed on a financial time-series (daily and monthly returns of the TSE300 index). The experiments are performed for varying prediction horizons and we study the relation between predictibility (out-of-sample R2), variability of the out-of-sample R2 statistic, and the prediction horizon. Cet article etudie une statistique hors-echantillon pour la prediction de series temporelles qui est analogue a la tres utilisee statistique R2 de l'ensemble d'entrainement (in-sample). Nous proposons et etudions une methode qui estime la variance de cette statistique hors-echantillon. Nous suggerons que la statistique hors-echantillon est plus robuste aux hypotheses distributionnelles et asymptotiques pour plusieurs tests faits pour les statistiques sur l'ensemble d'entrainement (in-sample). De plus, nous affirmons qu'il peut etre plus important, dans certains cas, de choisir un modele qui generalise le mieux possible plutot que de choisir les parametres qui sont le plus proches des vrais parametres. Des experiences comparatives furent realisees sur des series financieres (rendements journaliers et mensuels de l'indice du TSE300). Les experiences realisees pour plusieurs horizons de predictions, et nous etudions la relation entre la predictibilite (hors-echantillon), la variabilite de la statistique R2 hors-echantillon, et l'horizon de prediction.

4 citations


Posted Content
TL;DR: In this paper, the authors propose an empirical and hypothesis-free method to compare different option pricing systems by having trade against each other or against the market, and use this criterion to train a non-parametric statistical model (here based on neural networks) to estimate a price for the option that maximizes the expected utility when trading against a market.
Abstract: Prior work on option pricing falls mostly in two categories: it either relies on strong distributional or economical assumptions, or it tries to mimic the Black-Scholes formula through statistical models, trained to fit today's market price based on information available today. The work presented here is closer to the second category but its objective is different: predict the future value of the option, and establish its current value based on a trading scenario. This work thus innovates in two ways: first it proposes an empirical and hypothesis-free method to compare different option pricing systems (by having trade against each other or against the market), second it uses this criterion to train a non-parametric statistical model (here based on neural networks) to estimate a price for the option that maximizes the expected utility when trading against the market. Note that the price will depend on the utility function and current portfolio (i.e. current risks) of the trading agent. Preliminary experiments are presented on the S&P 500 options. Les travaux precedents sur la valorisation des options entraient en gros dans deux categories : ou bien ils etaient bases sur de fortes hypotheses distributionnelles ou economiques, ou bien ils essayaient d'imiter la formule de Black-Scholes par des modeles statistiques entraines a approximer les prix de marche quotidiens a l'aide d'information disponible le jour meme. Le travail presente ici se rapproche plus de la deuxieme categorie mais son objectif est different : predire les prix futurs d'une option, et etablir sa valeur courante a l'aide d'un scenario de transactions. Ce travail innove donc de deux facons : premierement, il propose une methode empirique et sans hypothese pour comparer differents systemes de valorisation d'options (en transigeant contre lui-meme ou contre le marche) et deuxiemement, il utilise ce critere pour entrainer un modele statistique non-parametrique (utilisant dans ce cas-ci des reseaux de neurones) pour estimer un prix pour l'option qui maximise l'utilite esperee lorsque l'on transige contre le marche. A noter que les prix dependront de la fonction d'utilite ainsi que du portefeuille (i.e. des risques courants) de la personne qui transige. Des resultats preliminaires sur des options d'achat du S&P 500 sont presentes.

3 citations


Posted Content
TL;DR: In this paper, a particular multi-task learning method that forces the parameters of the models to lie on an affine manifold defined in parameter space and embedding domain information is explored.
Abstract: Multi-task learning is a process used to learn domain-specific bias. It consists in simultaneously training models on different tasks derived from the same domain and forcing them to exchange domain information. This transfer of knowledge is performed by imposing constraints on the parameters defining the models and can lead to improved generalization performance. In this paper, we explore a particular multi-task learning method that forces the parameters of the models to lie on an affine manifold defined in parameter space and embedding domain information. We apply this method to the prediction of the prices of call options on the S&P index for a period of time ranging from 1987 to 1993. An analysis of variance of the results is presented that shows significant improvements of the generalization performance. L'apprentissage multi-tâches est une maniere d'apprendre des particularites d'un domaine (le biais) qui comprend plusieurs tâches possibles. On entraine simultanement plusieurs modeles, un par tâche, en imposant des contraintes sur les parametres de maniere a capturer ce qui est en commun entre les tâches, afin d'obtenir une meilleure generalisation sur chaque tâche, et pour pouvoir rapidement generaliser (avec peu d'exemples) sur une nouvelle tâche provenant du meme domaine. Ici cette commonalite est definie par une variete affine dans l'espace des parametres. Dans cet article, nous appliquons ces methodes a la prediction du prix d'options d'achat de l'indice S&P 500 entre 1987 et 1993. Une analyse de la variance des resultats est presentee, demontrant des ameliorations significatives de la prediction hors-echantillon.

Posted Content
Abstract: Input/Output Hidden Markov Models (IOHMMs) are conditional hidden Markov models in which the emission (and possibly the transition) probabilities can be conditioned on an input sequence. For example, these conditional distributions can be linear, logistic, or non-linear (using for example multi-layer neural networks). We compare the generalization performance of several models which are special cases of Input/Output Hidden Markov Models on financial time-series prediction tasks: an unconditional Gaussian, a conditional linear Gaussian, a mixture of Gaussians, a mixture of conditional linear Gaussians, a hidden Markov model, and various IOHMMs. The experiments compare these models on predicting the conditional density of returns of market and sector indices. Note that the unconditional Gaussian estimates the first moment with the historical average. The results show that, although for the first moment the historical average gives the best results, for the higher moments, the IOHMMs yielded significantly better performance, as estimated by the out-of-sample likelihood. Input/Output Hidden Markov Models (IOHMMs) sont des modeles de Markov caches pour lesquels les probabilites d'emission (et possiblement de transition) peuvent dependre d'une sequence d'entree. Par exemple, ces distributions conditionnelles peuvent etre lineaires, logistique, ou non-lineaire (utilisant, par exemple, une reseau de neurones multi-couches). Nous comparons les performances de generalisation de plusieurs modeles qui sont des cas particuliers de IOHMMs pour des problemes de predictions de series financieres : une gaussienne inconditionnelle, une gaussienne lineaire conditionnelle, une mixture de gaussienne, une mixture de gaussiennes lineaires conditionnelles, un modele de Markov cache, et divers IOHMMs. Les experiences comparent ces modeles sur leur predictions de la densite conditionnelle des rendements des indices sectoriels et du marche. Notons qu'une gaussienne inconditionnelle estime le premier moment avec une moyenne historique. Les resultats montrent que, meme si la moyenne historique donne les meilleurs resultats pour le premier moment, pour les moments d'ordres superieurs les IOHMMs performent significativement mieux, comme estime par la vraisemblance hors-echantillon.

Posted Content
TL;DR: The CIRANO as discussed by the authors is an organisme sans but lucratif constitue en vertu de la Loi des compagnies du Quebec (Loi de l'Ontario) and is composed of several equipes de recherche.
Abstract: Le CIRANO est un organisme sans but lucratif constitue en vertu de la Loi des compagnies du Quebec. Lefinancement de son infrastructure et de ses activites de recherche provient des cotisations de ses organisations-membres, d’une subvention d’infrastructure du ministere de la Recherche, de la Science et de la Technologie, dememe que des subventions et mandats obtenus par ses equipes de recherche.

Proceedings ArticleDOI
07 Nov 2002
TL;DR: Extensions that take advantage of the particular case of time-series data in which the task involves prediction with a horizon h are introduced, to use at t the h unlabeled examples that precede t for model selection, and takeadvant of the different error distributions of cross-validation and the metric methods.
Abstract: Metric-based methods, which use unlabeled data to detect gross differences in behavior away from the training points, have recently been introduced for model selection, often yielding very significant improvements over alternatives (including cross-validation). We introduce extensions that take advantage of the particular case of time-series data in which the task involves prediction with a horizon h. The ideas are: (i) to use at t the h unlabeled examples that precede t for model selection, and (ii) take advantage of the different error distributions of cross-validation and the metric methods. Experimental results establish the effectiveness of these extensions in the context of feature subset selection.

01 Jun 2002
TL;DR: Une approche de découpage thématique que nous utiliserons pour faciliter l’extraction d’information à partir de conversations téléphoniques transcrites avec un modèle de Markov caché utilisant des informations de différents niveaux linguistiques.
Abstract: Nous presentons une approche de decoupage thematique que nous utiliserons pour faciliter l’extraction d’information a partir de conversations telephoniques transcrites. Nous experimentons avec un modele de Markov cache utilisant des informations de differents niveaux linguistiques, des marques d’extra-grammaticalites et les entites nommees comme source additionnelle d’information. Nous comparons le modele obtenu avec notre modele de base utilisant uniquement les marques linguistiques et les extra-grammaticalites. Les resultats montrent l’efficacite de l’approche utilisant les entites nommees.

Posted Content
TL;DR: In this article, the authors introduce a new regularization method called "input decay" that exerts more relative penalty on the parameters associated with the inputs that contribute less to the learned function.
Abstract: To deal with the overfitting problems that occur when there are not enough examples compared to the number of input variables in supervised learning, traditional approaches are weight decay and greedy variable selection. An alternative that has recently started to attract attention is to keep all the variables but to put more emphasis on the most useful ones. We introduce a new regularization method called "input decay"" that exerts more relative penalty on the parameters associated with the inputs that contribute less to the learned function. This method, like weight decay and variable selection, still requires to perform a kind of model selection. Successful comparative experiments with this new method were performed both on a simulated regression task and a real-world financial prediction task." Pour tenir compte des problemes de sur-entrainement qui apparaissent quand il n'y a pas assez d'exemples comparativement au nombre de variables d'entrees durant l'apprentissage supervise, les approches traditionnelles sont la penalisation de la norme des parametres (weight decay) et la selection de variables vorace. Une alternative qui est apparue tout recemment est de garder toutes les variables, mais de mettre plus d'emphase sur celles qui sont le plus utiles. Nous introduisons une nouvelle methode de regularisation, appele "penalisation sur la norme des entrees"" (input decay), qui applique une plus grande penalite relative sur les parametres associes aux entrees qui contribuent le moins a la fonction apprise. Cette methode, comme la penalisation de la norme des parametres (weight decay) et la selection de variables, demande tout de meme d'appliquer une sorte de selection de modele. Une serie d'experiences comparatives avec cette nouvelle methode ont ete appliquees a deux taches de regression, une qui etait simulee et l'autre a partir d'une vrai probleme financier."

Posted Content
TL;DR: In this paper, an empirical study for analysing these factors as a graphical and quantitative manner is presented, focusing on the average difference between the price option and its present average value at maturity, and tries to detect some temporal regularities in the pattern of this bias.
Abstract: Le prix d'une option devrait refleter la valeur moyenne que l'acheteur en recoitainsi qu'une prime de risque. Ce rapport decrit une etude empirique pour analyser cesfacteurs de maniere graphique et quantitative. L'analyse se concentre sur la differencemoyenne entre le prix de l'option et sa valeur actualisee moyenne a maturite (le"biais"), et tente de cerner des regularites temporelles dans les patrons de cettedifference. On y decouvre de surprenants patronsquasi-periodiques de ces variations, en particulier pour les calls de maturite elevee (moins clairement pour les puts ), qui sont etudies avec une analyse spectrale. The price of an option should reflect the average value that a buyer receives forit, and also a risk premium. This report describes an empirical study for analysingthese factors as a graphical and quantitative manner. The analysis focuses on theaverage difference between the price option and its present average value at maturity(the bias), and tries to detect some temporal regularities in the pattern of this bias.We found some very surprising almost-periodic patterns for the bias, in particular forthe long-time maturities (not so clearly for the puts), as studied by spectral analysis.

Posted Content
TL;DR: In this article, a measure of generalization for sequential data is developed for such data and a recently proposed approach to optimize hyper-parameters, based on the computation of the gradient of a model selection criterion with respect to hyperparameters.
Abstract: We consider sequential data that is sampled from an unknown process, so that the data are not necessarily iid. We develop a measure of generalization for such data and we consider a recently proposed approach to optimizing hyper-parameters, based on the computation of the gradient of a model selection criterion with respect to hyper-parameters. Hyper-parameters are used to give varying weights in the historical data sequence. The approach is successfully applied to modeling the volatility of Canadian stock returns one month ahead. Nous considerons des donnees sequentielles echantillonnees a partir d'un processus inconnu, donc les donnees ne sont pas necessairement iid. Nous developpons une mesure de generalisation pour de telles donnees et nous considerons une approche recemment proposee pour optimiser les hyper-parametres qui est basee sur le calcul du gradient d'un critere de selection de modele par rapport a ces hyper-parametres. Les hyper-parametres sont utilises pour donner differents poids dans la sequence de donnees historiques. Notre approche est appliquee avec succes a la modelisation de la volatilite des rendements d'actions canadiennes sur un horizon de un mois.