Showing papers by "Yoshua Bengio published in 2005"

PDF

Open Access

Proceedings Article•

Hierarchical Probabilistic Neural Network Language Model.

[...]

Frederic Morin¹, Yoshua Bengio¹•Institutions (1)

01 Jan 2005

TL;DR: A hierarchical decomposition of the conditional probabilities that yields a speed-up of about 200 both during training and recognition, constrained by the prior knowledge extracted from the WordNet semantic hierarchy is introduced.

...read moreread less

Abstract: In recent years, variants of a neural network architecture for statistical language modeling have been proposed and successfully applied, e.g. in the language modeling component of speech recognizers. The main advantage of these architectures is that they learn an embedding for words (or other symbols) in a continuous space that helps to smooth the language model and provide good generalization even when the number of training examples is insufficient. However, these models are extremely slow in comparison to the more commonly used n-gram models, both for training and recognition. As an alternative to an importance sampling method proposed to speed-up training, we introduce a hierarchical decomposition of the conditional probabilities that yields a speed-up of about 200 both during training and recognition. The hierarchical decomposition is a binary hierarchical clustering constrained by the prior knowledge extracted from the WordNet semantic hierarchy.

...read moreread less

1,008 citations

Proceedings Article•

The Curse of Highly Variable Functions for Local Kernel Machines

[...]

Yoshua Bengio¹, Olivier Delalleau¹, Nicolas Le Roux¹•Institutions (1)

Université de Montréal¹

05 Dec 2005

TL;DR: A series of theoretical arguments are presented supporting the claim that a large class of modern learning algorithms that rely solely on the smoothness prior - with similarity between examples expressed with a local kernel - are sensitive to the curse of dimensionality, or more precisely to the variability of the target.

...read moreread less

Abstract: We present a series of theoretical arguments supporting the claim that a large class of modern learning algorithms that rely solely on the smoothness prior - with similarity between examples expressed with a local kernel - are sensitive to the curse of dimensionality, or more precisely to the variability of the target. Our discussion covers supervised, semi-supervised and unsupervised learning algorithms. These algorithms are found to be local in the sense that crucial properties of the learned function at x depend mostly on the neighbors of x in the training set. This makes them sensitive to the curse of dimensionality, well studied for classical non-parametric statistical learning. We show in the case of the Gaussian kernel that when the function to be learned has many variations, these algorithms require a number of training examples proportional to the number of variations, which could be large even though there may exist short descriptions of the target function, i.e. their Kolmogorov complexity may be low. This suggests that there exist non-local learning algorithms that at least have the potential to learn about such structured but apparently complex functions (because locally they have many variations), while not using very specific prior domain knowledge.

...read moreread less

206 citations

Proceedings Article•

Convex Neural Networks

[...]

Yoshua Bengio¹, Nicolas Le Roux¹, Pascal Vincent¹, Olivier Delalleau¹, Patrice Marcotte¹ - Show less +1 more•Institutions (1)

Université de Montréal¹

05 Dec 2005

TL;DR: Training multi-layer neural networks in which the number of hidden units is learned can be viewed as a convex optimization problem, which involves an infinite number of variables but can be solved by incrementally inserting a hidden unit at a time.

...read moreread less

Abstract: Convexity has recently received a lot of attention in the machine learning community, and the lack of convexity has been seen as a major disadvantage of many learning algorithms, such as multi-layer artificial neural networks. We show that training multi-layer neural networks in which the number of hidden units is learned can be viewed as a convex optimization problem. This problem involves an infinite number of variables, but can be solved by incrementally inserting a hidden unit at a time, each time finding a linear classifier that minimizes a weighted sum of errors.

...read moreread less

193 citations

Semi-supervised Learning by Entropy Minimization.

[...]

Yves Grandvalet¹, Yoshua Bengio²•Institutions (2)

University of Technology of Compiègne¹, Université de Montréal²

01 Jan 2005

TL;DR: In this article, the authors consider the semi-supervised learning problem, where a decision rule is to be learned from labeled and unlabeled data, and motivate minimum entropy regularization, which enables to incorporate unlabelled data in the standard supervised learning.

...read moreread less

Abstract: We consider the semi-supervised learning problem, where a decision rule is to be learned from labeled and unlabeled data. In this framework, we motivate minimum entropy regularization, which enables to incorporate unlabeled data in the standard supervised learning. Our approach includes other approaches to the semi-supervised problem as particular or limiting cases. A series of experiments illustrates that the proposed solution benefits from unlabeled data. The method challenges mixture models when the data are sampled from the distribution class spanned by the generative model. The performances are definitely in favor of minimum entropy regularization when generative models are misspecified, and the weighting of unlabeled data provides robustness to the violation of the "cluster assumption". Finally, we also illustrate that the method can also be far superior to manifold learning in high dimension spaces.

...read moreread less

168 citations

The Curse of Dimensionality for Local Kernel Machines

[...]

Yoshua Bengio, Olivier Delalleau

01 Jan 2005

TL;DR: A series of theoretical arguments support the claim that a large class of modern learning algorithms based on local kernels are sensitive to the curse of dimensionality, including local manifold learning algorithms such as Isomap and LLE, support vector classifiers with Gaussian or other local kernels, and graph-based semisupervised learning algorithms using a local similarity function.

...read moreread less

Abstract: We present a series of theoretical arguments supporting the claim that a large class of modern learning algorithms based on local kernels are sensitive to the curse of dimensionality. These include local manifold learning algorithms such as Isomap and LLE, support vector classifiers with Gaussian or other local kernels, and graph-based semisupervised learning algorithms using a local similarity function. These algorithms are shown to be local in the sense that crucial properties of the learned function at x depend mostly on the neighbors of x in the training set. This makes them sensitive to the curse of dimensionality, well studied for classical non-parametric statistical learning. There is a large class of data distributions for which non-local solutions could be expressed compactly and potentially be learned with few examples, but which will require a large number of local bases and therefore a large number of training examples when using a local learning algorithm.

...read moreread less

79 citations

Proceedings Article•

Greedy Spectral Embedding.

[...]

Marie Claude Ouimet, Yoshua Bengio

01 Jan 2005

TL;DR: A greedy selection procedure for this subset of m examples, based on the featurespace distance between a candidate example and the span of the previously chosen ones, to estimate the embedding function based on all the data.

...read moreread less

Abstract: Spectral dimensionality reduction methods and spectral clustering methods require computation of the principal eigenvectors of an n × n matrix where n is the number of examples. Following up on previously proposed techniques to speed-up kernel methods by focusing on a subset of m examples, we study a greedy selection procedure for this subset, based on the featurespace distance between a candidate example and the span of the previously chosen ones. In the case of kernel PCA or spectral clustering this reduces computation to O(m2n). For the same computational complexity, we can also compute the feature space projection of the non-selected examples on the subspace spanned by the selected examples, to estimate the embedding function based on all the data, which yields considerably better estimation of the embedding function. This algorithm can be formulated in an online setting and we can bound the error on the approximation of the Gram matrix.

...read moreread less

64 citations

Journal Article•DOI•

Selective Small Molecule Peptidomimetic Ligands of TrkC and TrkA Receptors Afford Discrete or Complete Neurotrophic Activities

[...]

Maria Clara Zaccaro¹, Hong Boon Lee², Mookda Pattarawarapan², Zebin Xia², Antoine Caron¹, Pierrre-Jean L’Heureux³, Yoshua Bengio³, Kevin Burgess², H. Uri Saragovi - Show less +5 more•Institutions (3)

McGill University¹, Texas A&M University², Université de Montréal³

01 Sep 2005-Chemistry & Biology

TL;DR: The high rate of hits identified suggests that focused minilibraries may be desirable for developing bioactive ligands of cell surface receptors, and small, selective, proteolytically stable ligands with defined biological activity may have therapeutic potential.

...read moreread less

64 citations

Proceedings Article•

Non-Local Manifold Parzen Windows

[...]

Yoshua Bengio¹, Hugo Larochelle¹, Pascal Vincent¹•Institutions (1)

Université de Montréal¹

05 Dec 2005

TL;DR: This work presents a non-local non-parametric density estimator that builds upon previously proposed Gaussian mixture models with regularized covariance matrices to take into account the local shape of the manifold.

...read moreread less

Abstract: To escape from the curse of dimensionality, we claim that one can learn non-local functions, in the sense that the value and shape of the learned function at x must be inferred using examples that may be far from x. With this objective, we present a non-local non-parametric density estimator. It builds upon previously proposed Gaussian mixture models with regularized covariance matrices to take into account the local shape of the manifold. It also builds upon recent work on non-local estimators of the tangent plane of a manifold, which are able to generalize in places with little training data, unlike traditional, local, non-parametric models.

...read moreread less

61 citations

Book Chapter•DOI•

Bias in Estimating the Variance of K-Fold Cross-Validation

[...]

Yoshua Bengio, Yves Grandvalet

01 Jan 2005

TL;DR: The main theorem shows that there exists no universal (valid under all distributions) unbiased estimator of the variance of K-fold cross-validation, based on a single computation of the K- fold cross- validation estimator.

...read moreread less

Abstract: Most machine learning researchers perform quantitative experiments to estimate generalization error and compare the perforniance of different algorithms (in particular, their proposed algorithmn). In order to be able to draw statistically convincing conclusions, it is important to estimate the uncertainty of such estimates. This paper studies the very commonly used K-fold cross-validation estimator of generalization performance. The main theorem shows that there exists no universal (valid under all distributions) unbiased estimator of the variance of K-fold cross-validation, based on a single computation of the K-fold cross-validation estimator. The analysis that accompanies this result is based on the eigen-decomposition of the covariance matrix of errors, which has only three different eigenvalues corresponding to three degrees of freedom of the matrix and three components of the total variance. This analysis helps to better understand the nature of the problem and how it can make naive estimators (that don't take into account the error correlations due to the overlap between training and test sets) grossly underestimate variance. This is confirmed by numerical experiments in which the three components of the variance are compared when the difficulty of the learning problem and the number of folds are varied.

...read moreread less

40 citations

Reassuring and Troubling Views on Graph-Based Semi-Supervised Learning

[...]

Yoshua Bengio, Olivier Delalleau, Nicolas Le Roux

01 Jan 2005