scispace - formally typeset
Search or ask a question

Showing papers by "Yoshua Bengio published in 2005"


Proceedings Article
01 Jan 2005
TL;DR: A hierarchical decomposition of the conditional probabilities that yields a speed-up of about 200 both during training and recognition, constrained by the prior knowledge extracted from the WordNet semantic hierarchy is introduced.
Abstract: In recent years, variants of a neural network architecture for statistical language modeling have been proposed and successfully applied, e.g. in the language modeling component of speech recognizers. The main advantage of these architectures is that they learn an embedding for words (or other symbols) in a continuous space that helps to smooth the language model and provide good generalization even when the number of training examples is insufficient. However, these models are extremely slow in comparison to the more commonly used n-gram models, both for training and recognition. As an alternative to an importance sampling method proposed to speed-up training, we introduce a hierarchical decomposition of the conditional probabilities that yields a speed-up of about 200 both during training and recognition. The hierarchical decomposition is a binary hierarchical clustering constrained by the prior knowledge extracted from the WordNet semantic hierarchy.

1,008 citations


Proceedings Article
05 Dec 2005
TL;DR: A series of theoretical arguments are presented supporting the claim that a large class of modern learning algorithms that rely solely on the smoothness prior - with similarity between examples expressed with a local kernel - are sensitive to the curse of dimensionality, or more precisely to the variability of the target.
Abstract: We present a series of theoretical arguments supporting the claim that a large class of modern learning algorithms that rely solely on the smoothness prior - with similarity between examples expressed with a local kernel - are sensitive to the curse of dimensionality, or more precisely to the variability of the target. Our discussion covers supervised, semi-supervised and unsupervised learning algorithms. These algorithms are found to be local in the sense that crucial properties of the learned function at x depend mostly on the neighbors of x in the training set. This makes them sensitive to the curse of dimensionality, well studied for classical non-parametric statistical learning. We show in the case of the Gaussian kernel that when the function to be learned has many variations, these algorithms require a number of training examples proportional to the number of variations, which could be large even though there may exist short descriptions of the target function, i.e. their Kolmogorov complexity may be low. This suggests that there exist non-local learning algorithms that at least have the potential to learn about such structured but apparently complex functions (because locally they have many variations), while not using very specific prior domain knowledge.

206 citations


Proceedings Article
05 Dec 2005
TL;DR: Training multi-layer neural networks in which the number of hidden units is learned can be viewed as a convex optimization problem, which involves an infinite number of variables but can be solved by incrementally inserting a hidden unit at a time.
Abstract: Convexity has recently received a lot of attention in the machine learning community, and the lack of convexity has been seen as a major disadvantage of many learning algorithms, such as multi-layer artificial neural networks. We show that training multi-layer neural networks in which the number of hidden units is learned can be viewed as a convex optimization problem. This problem involves an infinite number of variables, but can be solved by incrementally inserting a hidden unit at a time, each time finding a linear classifier that minimizes a weighted sum of errors.

193 citations


01 Jan 2005
TL;DR: In this article, the authors consider the semi-supervised learning problem, where a decision rule is to be learned from labeled and unlabeled data, and motivate minimum entropy regularization, which enables to incorporate unlabelled data in the standard supervised learning.
Abstract: We consider the semi-supervised learning problem, where a decision rule is to be learned from labeled and unlabeled data. In this framework, we motivate minimum entropy regularization, which enables to incorporate unlabeled data in the standard supervised learning. Our approach includes other approaches to the semi-supervised problem as particular or limiting cases. A series of experiments illustrates that the proposed solution benefits from unlabeled data. The method challenges mixture models when the data are sampled from the distribution class spanned by the generative model. The performances are definitely in favor of minimum entropy regularization when generative models are misspecified, and the weighting of unlabeled data provides robustness to the violation of the "cluster assumption". Finally, we also illustrate that the method can also be far superior to manifold learning in high dimension spaces.

168 citations


01 Jan 2005
TL;DR: A series of theoretical arguments support the claim that a large class of modern learning algorithms based on local kernels are sensitive to the curse of dimensionality, including local manifold learning algorithms such as Isomap and LLE, support vector classifiers with Gaussian or other local kernels, and graph-based semisupervised learning algorithms using a local similarity function.
Abstract: We present a series of theoretical arguments supporting the claim that a large class of modern learning algorithms based on local kernels are sensitive to the curse of dimensionality. These include local manifold learning algorithms such as Isomap and LLE, support vector classifiers with Gaussian or other local kernels, and graph-based semisupervised learning algorithms using a local similarity function. These algorithms are shown to be local in the sense that crucial properties of the learned function at x depend mostly on the neighbors of x in the training set. This makes them sensitive to the curse of dimensionality, well studied for classical non-parametric statistical learning. There is a large class of data distributions for which non-local solutions could be expressed compactly and potentially be learned with few examples, but which will require a large number of local bases and therefore a large number of training examples when using a local learning algorithm.

79 citations


Proceedings Article
01 Jan 2005
TL;DR: A greedy selection procedure for this subset of m examples, based on the featurespace distance between a candidate example and the span of the previously chosen ones, to estimate the embedding function based on all the data.
Abstract: Spectral dimensionality reduction methods and spectral clustering methods require computation of the principal eigenvectors of an n × n matrix where n is the number of examples. Following up on previously proposed techniques to speed-up kernel methods by focusing on a subset of m examples, we study a greedy selection procedure for this subset, based on the featurespace distance between a candidate example and the span of the previously chosen ones. In the case of kernel PCA or spectral clustering this reduces computation to O(m2n). For the same computational complexity, we can also compute the feature space projection of the non-selected examples on the subspace spanned by the selected examples, to estimate the embedding function based on all the data, which yields considerably better estimation of the embedding function. This algorithm can be formulated in an online setting and we can bound the error on the approximation of the Gram matrix.

64 citations


Journal ArticleDOI
TL;DR: The high rate of hits identified suggests that focused minilibraries may be desirable for developing bioactive ligands of cell surface receptors, and small, selective, proteolytically stable ligands with defined biological activity may have therapeutic potential.

64 citations


Proceedings Article
05 Dec 2005
TL;DR: This work presents a non-local non-parametric density estimator that builds upon previously proposed Gaussian mixture models with regularized covariance matrices to take into account the local shape of the manifold.
Abstract: To escape from the curse of dimensionality, we claim that one can learn non-local functions, in the sense that the value and shape of the learned function at x must be inferred using examples that may be far from x. With this objective, we present a non-local non-parametric density estimator. It builds upon previously proposed Gaussian mixture models with regularized covariance matrices to take into account the local shape of the manifold. It also builds upon recent work on non-local estimators of the tangent plane of a manifold, which are able to generalize in places with little training data, unlike traditional, local, non-parametric models.

61 citations


Book ChapterDOI
01 Jan 2005
TL;DR: The main theorem shows that there exists no universal (valid under all distributions) unbiased estimator of the variance of K-fold cross-validation, based on a single computation of the K- fold cross- validation estimator.
Abstract: Most machine learning researchers perform quantitative experiments to estimate generalization error and compare the perforniance of different algorithms (in particular, their proposed algorithmn). In order to be able to draw statistically convincing conclusions, it is important to estimate the uncertainty of such estimates. This paper studies the very commonly used K-fold cross-validation estimator of generalization performance. The main theorem shows that there exists no universal (valid under all distributions) unbiased estimator of the variance of K-fold cross-validation, based on a single computation of the K-fold cross-validation estimator. The analysis that accompanies this result is based on the eigen-decomposition of the covariance matrix of errors, which has only three different eigenvalues corresponding to three degrees of freedom of the matrix and three components of the total variance. This analysis helps to better understand the nature of the problem and how it can make naive estimators (that don't take into account the error correlations due to the overlap between training and test sets) grossly underestimate variance. This is confirmed by numerical experiments in which the three components of the variance are compared when the difficulty of the learning problem and the number of folds are varied.

40 citations