scispace - formally typeset
Search or ask a question

Showing papers by "Yoshua Bengio published in 2004"


Proceedings Article
01 Dec 2004
TL;DR: This framework, which motivates minimum entropy regularization, enables to incorporate unlabeled data in the standard supervised learning, and includes other approaches to the semi-supervised problem as particular or limiting cases.
Abstract: We consider the semi-supervised learning problem, where a decision rule is to be learned from labeled and unlabeled data. In this framework, we motivate minimum entropy regularization, which enables to incorporate unlabeled data in the standard supervised learning. Our approach includes other approaches to the semi-supervised problem as particular or limiting cases. A series of experiments illustrates that the proposed solution benefits from unlabeled data. The method challenges mixture models when the data are sampled from the distribution class spanned by the generative model. The performances are definitely in favor of minimum entropy regularization when generative models are misspecified, and the weighting of unlabeled data provides robustness to the violation of the "cluster assumption". Finally, we also illustrate that the method can also be far superior to manifold learning in high dimension spaces.

1,606 citations


Journal Article
TL;DR: There exists no universal (valid under all distributions) unbiased estimator of the variance of K-fold cross-validation, and the main theorem shows that this result is based on the eigen-decomposition of the covariance matrix of errors.
Abstract: Most machine learning researchers perform quantitative experiments to estimate generalization error and compare the performance of different algorithms (in particular, their proposed algorithm). In order to be able to draw statistically convincing conclusions, it is important to estimate the uncertainty of such estimates. This paper studies the very commonly used K-fold cross-validation estimator of generalization performance. The main theorem shows that there exists no universal (valid under all distributions) unbiased estimator of the variance of K-fold cross-validation. The analysis that accompanies this result is based on the eigen-decomposition of the covariance matrix of errors, which has only three different eigenvalues corresponding to three degrees of freedom of the matrix and three components of the total variance. This analysis helps to better understand the nature of the problem and how it can make naive estimators (that don't take into account the error correlations due to the overlap between training and test sets) grossly underestimate variance. This is confirmed by numerical experiments in which the three components of the variance are compared when the difficulty of the learning problem and the number of folds are varied.

869 citations


Journal ArticleDOI
TL;DR: A direct relation is shown between spectral embedding methods and kernel principal components analysis and how both are special cases of a more general learning problem: learning the principal eigenfunctions of an operator defined from a kernel and the unknown data-generating density.
Abstract: In this letter, we show a direct relation between spectral embedding methods and kernel principal components analysis and how both are special cases of a more general learning problem: learning the principal eigenfunctions of an operator defined from a kernel and the unknown data-generating density. Whereas spectral embedding methods provided only coordinates for the training points, the analysis justifies a simple extension to out-of-sample examples (the Nystrom formula) for multidimensional scaling (MDS), spectral clustering, Laplacian eigenmaps, locally linear embedding (LLE), and Isomap. The analysis provides, for all such spectral embedding methods, the definition of a loss function, whose empirical average is minimized by the traditional algorithms. The asymptotic expected value of that loss defines a generalization performance and clarifies what these algorithms are trying to learn. Experiments with LLE, Isomap, spectral clustering, and MDS show that this out-of-sample embedding formula generalizes well, with a level of error comparable to the effect of small perturbations of the training set on the embedding.

333 citations


01 Jan 2004
TL;DR: The basics ofHMMs are summarized and several recent related learning algorithms and extensions of HMMs including in particular hybrids of HM Ms with arti cial neural networks are reviewed.
Abstract: Hidden Markov Models HMMs are statistical models of sequential data that have been used successfully in many machine learning applications especially for speech recognition Further more in the last few years many new and promising probabilistic models related to HMMs have been proposed We rst summarize the basics of HMMs and then review several recent related learning algorithms and extensions of HMMs including in particular hybrids of HMMs with arti cial neural networks Input Output HMMs which are conditional HMMs using neu ral networks to compute probabilities weighted transducers variable length Markov models and Markov switching state space models Finally we discuss some of the challenges of future research in this very active area

268 citations


Proceedings Article
01 Jan 2004
TL;DR: Experiments show that the proposed non-parametric algorithms which provide an estimated continuous label for the given unlabeled examples are extended to function induction algorithms that correspond to the minimization of a regularization criterion applied to an out-of-sample example, and happens to have the form of a Parzen windows regressor.
Abstract: There has been an increase of interest for semi-supervised learning recently, because of the many datasets with large amounts of unlabeled examples and only a few labeled ones. This paper follows up on proposed non-parametric algorithms which provide an estimated continuous label for the given unlabeled examples. It extends them to function induction algorithms that correspond to the minimization of a regularization criterion applied to an out-of-sample example, and happens to have the form of a Parzen windows regressor. The advantage of the extension is that it allows predicting the label for a new example without having to solve again a linear system of dimension n (the number of unlabeled and labeled training examples), which can cost O(n 3 ). Experiments show that the extension works well, in the sense of predicting a label close to the one that would have been obtained if the test example had been included in the unlabeled set. This relatively efficient function induction procedure can also be used when n is large to approximate the solution by writing it only in terms of a kernel expansion with m n terms, and reducing the linear system to m equations in m unknowns.

200 citations


Proceedings Article
01 Dec 2004
TL;DR: It is claimed and presented arguments that a large class of manifold learning algorithms that are essentially local and can be framed as kernel learning algorithms will suffer from the curse of dimensionality, at the dimension of the true underlying manifold.
Abstract: We claim and present arguments to the effect that a large class of manifold learning algorithms that are essentially local and can be framed as kernel learning algorithms will suffer from the curse of dimensionality, at the dimension of the true underlying manifold. This observation suggests to explore non-local manifold learning algorithms which attempt to discover shared structure in the tangent planes at different positions. A criterion for such an algorithm is proposed and experiments estimating a tangent plane prediction function are presented, showing its advantages with respect to local manifold learning algorithms: it is able to generalize very far from training data (on learning handwritten character image rotations), where a local non-parametric method fails.

89 citations


Posted Content
TL;DR: In this article, a non-parametric kernel density estimation method is proposed to capture the local structure of an underlying manifold through the leading eigenvectors of regularized local covariance matrices.
Abstract: The similarity between objects is a fundamental element of many learning algorithms. Most non-parametric methods take this similarity to be fixed, but much recent work has shown the advantages of learning it, in particular to exploit the local invariances in the data or to capture the possibly non-linear manifold on which most of the data lies. We propose a new non-parametric kernel density estimation method which captures the local structure of an underlying manifold through the leading eigenvectors of regularized local covariance matrices. Experiments in density estimation show significant improvements with respect to Parzen density estimators. The density estimators can also be used within Bayes classifiers, yielding classification rates similar to SVMs and much superior to the Parzen classifier. La similarite entre objets est un element fondamental de plusieurs algorithmes d'apprentissage. La plupart des methodes non parametriques supposent cette similarite constante, mais des travaux recents ont montre les avantages de les apprendre, en particulier pour exploiter les invariances locales dans les donnees ou pour capturer la variete possiblement non lineaire sur laquelle reposent la plupart des donnees. Nous proposons une nouvelle methode d'estimation de densite a noyau non parametrique qui capture la structure locale d'une variete sous-jacente en utilisant les vecteurs propres principaux de matrices de covariance locales regularisees. Les experiences d'estimation de densite montrent une amelioration significative sur les estimateurs de densite de Parzen. Les estimateurs de densite peuvent aussi etre utilises a l'interieur de classificateurs de Bayes, menant a des taux de classification similaires a ceux des SVMs, et tres superieurs au classificateur de Parzen.

64 citations


Posted Content
TL;DR: In this article, the authors show a direct equivalence between spectral clustering and kernel PCA, and how both are special cases of a more general learning problem, that of learning the principal eigenfunctions of a kernel, when the functions are from a function space whose scalar product is defined with respect to a density model.
Abstract: In this paper, we show a direct equivalence between spectral clustering and kernel PCA, and how both are special cases of a more general learning problem, that of learning the principal eigenfunctions of a kernel, when the functions are from a function space whose scalar product is defined with respect to a density model. This defines a natural mapping for new data points, for methods that only provided an embedding, such as spectral clustering and Laplacian eigenmaps. The analysis hinges on a notion of generalization for embedding algorithms based on the estimation of underlying eigenfunctions, and suggests ways to improve this generalization by smoothing the data empirical distribution.

63 citations


Proceedings ArticleDOI
21 Jul 2004
TL;DR: Two probabilistic models for unsupervised word-sense disambiguation using parallel corpora are described, one of which is a hierarchical model that uses a concept latent variable to relate different language specific sense labels.
Abstract: We describe two probabilistic models for unsupervised word-sense disambiguation using parallel corpora. The first model, which we call the Sense model, builds on the work of Diab and Resnik (2002) that uses both parallel text and a sense inventory for the target language, and recasts their approach in a probabilistic framework. The second model, which we call the Concept model, is a hierarchical model that uses a concept latent variable to relate different language specific sense labels. We show that both models improve performance on the word sense disambiguation task over previous unsupervised approaches, with the Concept model showing the largest improvement. Furthermore, in learning the Concept model, as a by-product, we learn a sense inventory for the parallel language.

30 citations


Posted Content
TL;DR: The robustness of the learning scheme is demonstrated: in situations where unlabeled examples do not convey information, minimum entropy returns a solution discarding unlabeling examples and performs as well as supervised learning.
Abstract: This paper introduces the minimum entropy regularizer for learning from partial labels. This learning problem encompasses the semi-supervised setting, where a decision rule is to be learned from labeled and unlabeled examples. The minimum entropy regularizer applies to diagnosis models, i.e. models of the posterior probabilities of classes. It is shown to include other approaches to the semi-supervised problem as particular or limiting cases. A series of experiments illustrates that the proposed criterion provides solutions taking advantage of unlabeled examples when the latter convey information. Even when the data are sampled from the distribution class spanned by a generative model, the proposed approach improves over the estimated generative model when the number of features is of the order of sample size. The performances are definitely in favor of minimum entropy when the generative model is slightly misspecified. Finally, the robustness of the learning scheme is demonstrated: in situations where unlabeled examples do not convey information, minimum entropy returns a solution discarding unlabeled examples and performs as well as supervised learning. Cet article introduit le regularisateur a entropie minimum pour l'apprentissage d'etiquettes partielles. Ce probleme d'apprentissage incorpore le cadre non supervise, ou une regle de decision doit etre apprise a partir d'exemples etiquetes et non etiquetes. Le regularisateur a entropie minimum s'applique aux modeles de diagnostics, c'est-a-dire aux modeles des probabilites posterieures de classes. Nous montrons comment inclure d'autres approches comme un cas particulier ou limite du probleme semi-supervise. Une serie d'experiences montrent que le critere propose fournit des solutions utilisant les exemples non etiquetes lorsque ces dernieres sont instructives. Meme lorsque les donnees sont echantillonnees a partir de la classe de distribution balayee par un modele generatif, l'approche mentionnee ameliore le modele generatif estime lorsque le nombre de caracteristiques est de l'ordre de la taille de l'echantillon. Les performances avantagent certainement l'entropie minimum lorsque le modele generatif est legerement mal specifie. Finalement, la robustesse de ce cadre d'apprentissage est demontre : lors de situations ou les exemples non etiquetes n'apportent aucune information, l'entropie minimum retourne une solution rejetant les exemples non etiquetes et est aussi performante que l'apprentissage supervise.

29 citations


Proceedings Article
01 Dec 2004
TL;DR: New reinforcement learning algorithms inspired by neurological evidence that provides potential new approaches to the feature construction problem are developed and evaluated.
Abstract: Successful application of reinforcement learning algorithms often involves considerable hand-crafting of the necessary non-linear features to reduce the complexity of the value functions and hence to promote convergence of the algorithm. In contrast, the human brain readily and autonomously finds the complex features when provided with sufficient training. Recent work in machine learning and neurophysiology has demonstrated the role of the basal ganglia and the frontal cortex in mammalian reinforcement learning. This paper develops and explores new reinforcement learning algorithms inspired by neurological evidence that provides potential new approaches to the feature construction problem. The algorithms are compared and evaluated on the Acrobot task.

Journal ArticleDOI
TL;DR: Locally Linear Embedding (LLE), a local non-linear dimensionality reduction technique, that can statistically discover a low-dimensional representation of the chemical data, is introduced.
Abstract: Current practice in Quantitative Structure Activity Relationship (QSAR) methods usually involves generating a great number of chemical descriptors and then cutting them back with variable selection techniques. Variable selection is an effective method to reduce the dimensionality but may discard some valuable information. This paper introduces Locally Linear Embedding (LLE), a local non-linear dimensionality reduction technique, that can statistically discover a low-dimensional representation of the chemical data. LLE is shown to create more stable representations than other non-linear dimensionality reduction algorithms, and to be capable of capturing non-linearity in chemical data.

Posted Content
TL;DR: In this article, the authors put under a common framework a number of non-linear dimensionality reduction methods, such as Locally Linear Embedding, Isomap, Laplacian Eigenmaps and kernel PCA, which are based on performing an eigendecomposition.
Abstract: In this paper, we study and put under a common framework a number of non-linear dimensionality reduction methods, such as Locally Linear Embedding, Isomap, Laplacian Eigenmaps and kernel PCA, which are based on performing an eigen-decomposition (hence the name 'spectral'). That framework also includes classical methods such as PCA and metric multidimensional scaling (MDS). It also includes the data transformation step used in spectral clustering. We show that in all of these cases the learning algorithm estimates the principal eigenfunctions of an operator that depends on the unknown data density and on a kernel that is not necessarily positive semi-definite. This helps to generalize some of these algorithms so as to predict an embedding for out-of-sample examples without having to retrain the model. It also makes it more transparent what these algorithm are minimizing on the empirical data and gives a corresponding notion of generalization error. Dans cet article, nous etudions et developpons un cadre unifie pour un certain nombre de methodes non lineaires de reduction de dimensionalite, telles que LLE, Isomap, LE (Laplacian Eigenmap) et ACP a noyaux, qui font de la decomposition en valeurs propres (d'ou le nom "spectral""). Ce cadre inclut egalement des methodes classiques telles que l'ACP et l'echelonnage multidimensionnel metrique (MDS). Il inclut aussi l'etape de transformation de donnees utilisee dans l'agregation spectrale. Nous montrons que, dans tous les cas, l'algorithme d'apprentissage estime les fonctions propres principales d'un operateur qui depend de la densite inconnue de donnees et d'un noyau qui n'est pas necessairement positif semi-defini. Ce cadre aide a generaliser certains modeles pour predire les coordonnees des exemples hors-echantillons sans avoir a reentrainer le modele. Il aide egalement a rendre plus transparent ce que ces algorithmes minimisent sur les donnees empiriques et donne une notion correspondante d'erreur de generalisation."

Posted Content
TL;DR: In this article, an interesting application of the principle of local learning to density estimation is described, where local weighted fitting of a Gaussian with a regularized full covariance matrix yields a density estimator which displays improved behavior in the case where much of the probability mass is concentrated along a low dimensional manifold.
Abstract: We describe an interesting application of the principle of local learning to density estimation. Locally weighted fitting of a Gaussian with a regularized full covariance matrix yields a density estimator which displays improved behavior in the case where much of the probability mass is concentrated along a low dimensional manifold. While the proposed estimator is not guaranteed to integrate to 1 with a finite sample size, we prove asymptotic convergence to the true density. Experimental results illustrating the advantages of this estimator over classic non-parametric estimators are presented. Nous decrivons une application du principe d'apprentissage local a l'estimation de densite. Le lissage pondere localement d'une gaussienne utilisant une matrice de covariance pleine et regularisee conduit a un estimateur de densite ayant un comportement ameliore lorsque la masse de probabilite est concentree le long d'une variete de basse dimension. Meme si l'estimateur propose n'est pas garanti d'integrer a 1 sur un ensemble de donnees fini, nous prouvons la convergence asymptotique de la vraie densite. Les resultats experimentaux illustrant les avantages de cet estimateur sur les estimateurs non parametriques classiques sont presentes.

01 Jan 2004
TL;DR: This analysis suggests non-local manifold learning algorithms which attempt to discover shared structure in the tangent planes at different positions, which has parameters that are shared across space rather than estimated based on the local neighborhood, as in current non-parametric manifolds learning algorithms.
Abstract: We claim and present arguments to the effect that a large class of manifold learning algorithms that are essentially local will suffer from at least four generic problems associated with (1) noise in the data, (2) curvature of the manifold, (3) dimensionality of the manifold, and (4) the presence of many manifolds with little data per manifold. This analysis suggests non-local manifold learning algorithms which attempt to discover shared structure in the tangent planes at different positions. A criterion for such an algorithm is proposed and experiments estimating a tangent plane prediction function are presented. The function has parameters that are shared across space rather than estimated based on the local neighborhood, as in current non-parametric manifold learning algorithms. The results show clearly the advantages of this approach with respect to local manifold learning algorithms.

01 Apr 2004
TL;DR: In this paper, the authors present les resultats de l'approche statistique que nous avons developpee pour le reperage de mots informatifs a partir de textes oraux.
Abstract: Nous presentons les resultats de l’approche statistique que nous avons developpee pour le reperage de mots informatifs a partir de textes oraux. Ce travail fait partie d’un projet lance par le departement de la defense canadienne pour le developpement d’un systeme d’extraction d’information dans le domaine de la Recherche et Sauvetage maritime (SAR). Il s’agit de trouver et annoter les mots pertinents avec des etiquettes semantiques qui sont les concepts d’une ontologie du domaine (SAR). Notre methode combine deux types d’information : les vecteurs de similarite generes grâce a l’ontologie du domaine et le dictionnaire-thesaurus Wordsmyth ; le contexte d’enonciation represente par le theme. L’evaluation est effectuee en comparant la sortie du systeme avec les reponses de formulaires d’extraction d’information predefinis. Les resultats obtenus sur les textes oraux sont comparables a ceux obtenus dans le cadre de MUC7 pour des textes ecrits.

Posted Content
TL;DR: In this article, a generalized Pareto distribution (GPD) is used to model the tail of the distribution, which can approximate finite, exponential or subexponential tails.
Abstract: We aim at modelling fat-tailed densities whose distributions are unknown but are potentially asymmetric. In this context, the standard normality assumption is not appropriate.In order to make as few distributional assumptions as possible, we use a non-parametric algorithm to model the center of the distribution. Density modelling becomes more difficult as we move further in the tail of the distribution since very few observations fall in the upper tail area. Hence we decide to use the generalized Pareto distribution (GPD) to model the tails of the distribution. The GPD can approximate finite, exponential or subexponential tails. The estimation of the parameters of the GPD is based solely on the extreme observations. An observation is defined as being extreme if it is greater than a given threshold. The main difficulty with GPD modelling is to determine an appropriate threshold. Nous cherchons a modeliser des densites dont la distribution est inconnue mais qui est asymetrique et presente des queues lourdes. Dans ce contexte, l'hypothese de normalite n'est pas appropriee. Afin de maintenir au minimum le nombre d'hypotheses distributionnelles, nous utilisons une methode non parametrique pour modeliser le centre de la distribution. La modelisation est plus difficile dans les queues de la distribution puisque peu d'observations s'y trouvent. Nous nous proposons donc d'utiliser la Pareto generalisee (GPD) pour modeliser les queues de la distribution. La GPD permet d'approximer tous les types de queues de distributions (qu'elles soient finies, exponentielles ou sous-exponentielles). L'estimation des parametres de la GPD est uniquement basee sur les observations extremes. Une observation est definie comme etant extreme si elle depasse un seuil donne. La principale difficulte de la modelisation avec la GPD reside dans le choix d'un seuil adequat.

01 Jan 2004
TL;DR: In this article, the authors present an approche d'etiquetage semantique developpee for le reperage de mots informatifs a partir de textes conversationnels.
Abstract: Nous presentons les resultats d'une approche d'etiquetage semantique developpee pour le reperage de mots informatifs a partir de textes conversationnels. Ce travail entre dans le cadre du developpement d'un systeme d'extraction d'information dans le domaine de la recherche et sauvetage maritime. Il s'agit de detecter et d'annoter les mots pertinents avec des etiquettes semantiques correspondant aux concepts d'une ontologie du domaine. Notre methode combine une approche symbolique basee sur un automate a etats finis et une approche statistique exploitant deux types d'information: les vecteurs de scores de similarite et le contexte discursif represente par le theme. Le F-score obtenu sur des transcriptions manuelles de conversations telephoniques dans le domaine de la recherche et sauvetage maritime est de 82,2 %.