scispace - formally typeset
Search or ask a question

Showing papers on "MNIST database published in 2011"


Book ChapterDOI
14 Jun 2011
TL;DR: A novel convolutional auto-encoder (CAE) for unsupervised feature learning that initializing a CNN with filters of a trained CAE stack yields superior performance on a digit and an object recognition benchmark.
Abstract: We present a novel convolutional auto-encoder (CAE) for unsupervised feature learning. A stack of CAEs forms a convolutional neural network (CNN). Each CAE is trained using conventional on-line gradient descent without additional regularization terms. A max-pooling layer is essential to learn biologically plausible features consistent with those found by previous approaches. Initializing a CNN with filters of a trained CAE stack yields superior performance on a digit (MNIST) and an object recognition (CIFAR10) benchmark.

1,832 citations


Proceedings ArticleDOI
16 Jul 2011
TL;DR: A fast, fully parameterizable GPU implementation of Convolutional Neural Network variants and their feature extractors are neither carefully designed nor pre-wired, but rather learned in a supervised way.
Abstract: We present a fast, fully parameterizable GPU implementation of Convolutional Neural Network variants. Our feature extractors are neither carefully designed nor pre-wired, but rather learned in a supervised way. Our deep hierarchical architectures achieve the best published results on benchmarks for object classification (NORB, CIFAR10) and handwritten digit recognition (MNIST), with error rates of 2.53%, 19.51%, 0.35%, respectively. Deep nets trained by simple back-propagation perform better than more shallow ones. Learning is surprisingly rapid. NORB is completely trained within five epochs. Test error rates on MNIST drop to 2.42%, 0.97% and 0.48% after 1, 3 and 17 epochs, respectively.

1,216 citations


Proceedings Article
28 Jun 2011
TL;DR: It is shown that more sophisticated off-the-shelf optimization methods such as Limited memory BFGS (L-BFGS) and Conjugate gradient (CG) with line search can significantly simplify and speed up the process of pretraining deep algorithms.
Abstract: The predominant methodology in training deep learning advocates the use of stochastic gradient descent methods (SGDs). Despite its ease of implementation, SGDs are difficult to tune and parallelize. These problems make it challenging to develop, debug and scale up deep learning algorithms with SGDs. In this paper, we show that more sophisticated off-the-shelf optimization methods such as Limited memory BFGS (L-BFGS) and Conjugate gradient (CG) with line search can significantly simplify and speed up the process of pretraining deep algorithms. In our experiments, the difference between L-BFGS/CG and SGDs are more pronounced if we consider algorithmic extensions (e.g., sparsity regularization) and hardware extensions (e.g., GPUs or computer clusters). Our experiments with distributed optimization support the use of L-BFGS with locally connected networks and convolutional neural networks. Using L-BFGS, our convolutional network model achieves 0.69% on the standard MNIST dataset. This is a state-of-the-art result on MNIST among algorithms that do not use distortions or pretraining.

908 citations


Proceedings ArticleDOI
18 Sep 2011
TL;DR: This work applies the same architecture to NIST SD 19, a more challenging dataset including lower and upper case letters, and obtains the best results published so far for both NIST digits and NIST letters.
Abstract: In 2010, after many years of stagnation, the MNIST handwriting recognition benchmark record dropped from 0.40% error rate to 0.35%. Here we report 0.27% for a committee of seven deep CNNs trained on graphics cards, narrowing the gap to human performance. We also apply the same architecture to NIST SD 19, a more challenging dataset including lower and upper case letters. A committee of seven CNNs obtains the best results published so far for both NIST digits and NIST letters. The robustness of our method is verified by analyzing 78125 different 7-net committees.

504 citations


Posted Content
TL;DR: A fast, fully parameterizable GPU implementation of Convolutional Neural Network variants and their feature extractors are neither carefully designed nor pre-wired, but rather learned in a supervised way.
Abstract: We present a fast, fully parameterizable GPU implementation of Convolutional Neural Network variants. Our feature extractors are neither carefully designed nor pre-wired, but rather learned in a supervised way. Our deep hierarchical architectures achieve the best published results on benchmarks for object classification (NORB, CIFAR10) and handwritten digit recognition (MNIST), with error rates of 2.53%, 19.51%, 0.35%, respectively. Deep nets trained by simple back-propagation perform better than more shallow ones. Learning is surprisingly rapid. NORB is completely trained within five epochs. Test error rates on MNIST drop to 2.42%, 0.97% and 0.48% after 1, 3 and 17 epochs, respectively.

275 citations


Proceedings ArticleDOI
20 Jun 2011
TL;DR: The algorithm gives excellent results for hand-written digit recognition on MNIST and object recognition on the Caltech101 benchmark, marking the first time that such accuracies have been achieved using automatically learned features from the pixel level, rather than using hand-designed descriptors.
Abstract: We present a method for learning image representations using a two-layer sparse coding scheme at the pixel level. The first layer encodes local patches of an image. After pooling within local regions, the first layer codes are then passed to the second layer, which jointly encodes signals from the region. Unlike traditional sparse coding methods that encode local patches independently, this approach accounts for high-order dependency among patterns in a local image neighborhood. We develop algorithms for data encoding and codebook learning, and show in experiments that the method leads to more invariant and discriminative image representations. The algorithm gives excellent results for hand-written digit recognition on MNIST and object recognition on the Caltech101 benchmark. This marks the first time that such accuracies have been achieved using automatically learned features from the pixel level, rather than using hand-designed descriptors.

240 citations


Proceedings ArticleDOI
Dong Yu1, Li Deng1
27 Aug 2011
TL;DR: Results on both MNIST and TIMIT tasks evaluated thus far demonstrate superior performance of DCN over the DBN (Deep Belief Network) counterpart that forms the basis of the DNN, reflected not only in training scalability and CPU-only computation, but more importantly in classification accuracy in both tasks.
Abstract: We recently developed context-dependent DNN-HMM (DeepNeural-Net/Hidden-Markov-Model) for large-vocabulary speech recognition. While achieving impressive recognition error rate reduction, we face the insurmountable problem of scalability in dealing with virtually unlimited amount of training data available nowadays. To overcome the scalability challenge, we have designed the deep convex network (DCN) architecture. The learning problem in DCN is convex within each module. Additional structure-exploited fine tuning further improves the quality of DCN. The full learning in DCN is batch-mode based instead of stochastic, naturally lending it amenable to parallel training that can be distributed over many machines. Experimental results on both MNIST and TIMIT tasks evaluated thus far demonstrate superior performance of DCN over the DBN (Deep Belief Network) counterpart that forms the basis of the DNN. The superiority is reflected not only in training scalability and CPU-only computation, but more importantly in classification accuracy in both tasks.

163 citations


02 Jul 2011
TL;DR: In this article, a hierarchical Bayesian model is proposed to transfer knowledge from previously learned categories to a novel category, in the form of a prior over category means and variances, which can discover how to group categories into meaningful super-categories that express different priors for new classes.
Abstract: We develop a hierarchical Bayesian model that learns categories from single training examples. The model transfers acquired knowledge from previously learned categories to a novel category, in the form of a prior over category means and variances. The model discovers how to group categories into meaningful super-categories that express different priors for new classes. Given a single example of a novel category, we can efficiently infer which super-category the novel category belongs to, and thereby estimate not only the new category's mean but also an appropriate similarity metric based on parameters inherited from the super-category. On MNIST and MSR Cambridge image datasets the model learns useful representations of novel categories based on just a single training example, and performs significantly better than simpler hierarchical Bayesian approaches. It can also discover new categories in a completely unsupervised fashion, given just one or a few examples.

111 citations


Proceedings ArticleDOI
18 Sep 2011
TL;DR: A new method to train the members of a committee of one-hidden-layer neural nets is presented, which obtains a recognition error rate on the MNIST digit recognition benchmark set of 0.39%, on par with state-of-the-art recognition rates of more complicated systems.
Abstract: We present a new method to train the members of a committee of one-hidden-layer neural nets. Instead of training various nets on subsets of the training data we preprocess the training data for each individual model such that the corresponding errors are decor related. On the MNIST digit recognition benchmark set we obtain a recognition error rate of 0.39%, using a committee of 25 one-hidden-layer neural nets, which is on par with state-of-the-art recognition rates of more complicated systems.

79 citations


Proceedings Article
28 Jun 2011
TL;DR: This work proposes an SSL algorithmic framework which can utilize unlabeled examples for learning classifiers from a predefined set of fast classifiers and proposes a novel quantitative measure of the so-called cluster assumption.
Abstract: Semi-supervised learning (SSL) addresses the problem of training a classifier using a small number of labeled examples and many un-labeled examples. Most previous work on SSL focused on how availability of unlabeled data can improve the accuracy of the learned classifiers. In this work we study how un-labeled data can be beneficial for constructing faster classifiers. We propose an SSL algorithmic framework which can utilize unlabeled examples for learning classifiers from a predefined set of fast classifiers. We formally analyze conditions under which our algorithmic paradigm obtains significant improvements by the use of unlabeled data. As a side benefit of our analysis we propose a novel quantitative measure of the so-called cluster assumption. We demonstrate the potential merits of our approach by conducting experiments on the MNIST data set, showing that, when a sufficiently large unlabeled sample is available, a fast classifier can be learned from much fewer labeled examples than without such a sample.

46 citations


Proceedings Article
07 Aug 2011
TL;DR: In this paper, the activation probabilities of hidden units in restricted Boltzmann machines were used to capture the local dependencies among hidden units, and the proposed SGRBMs were applied to model patches of natural images, handwritten digits and OCR English letters.
Abstract: Since learning in Boltzmann machines is typically quite slow, there is a need to restrict connections within hidden layers. However, the resulting states of hidden units exhibit statistical dependencies. Based on this observation, we propose using l1/l2 regularization upon the activation probabilities of hidden units in restricted Boltzmann machines to capture the local dependencies among hidden units. This regularization not only encourages hidden units of many groups to be inactive given observed data but also makes hidden units within a group compete with each other for modeling observed data. Thus, the l1/l2 regularization on RBMs yields sparsity at both the group and the hidden unit levels. We call RBMs trained with the regularizer sparse group RBMs (SGRBMs). The proposed SGRBMs are applied to model patches of natural images, handwritten digits and OCR English letters. Then to emphasize that SGRBMs can learn more discriminative features we applied SGRBMs to pretrain deep networks for classification tasks. Furthermore, we illustrate the regularizer can also be applied to deep Boltzmann machines, which lead to sparse group deep Boltzmann machines. When adapted to the MNIST data set, a two-layer sparse group Boltzmann machine achieves an error rate of 0.84%, which is, to our knowledge, the best published result on the permutation-invariant version of the MNIST task.

Proceedings ArticleDOI
18 Sep 2011
TL;DR: An efficient and low-cost semi-automatic labeling system for character datasets that proves that labeling only less than 0.5% of the training data is sufficient to achieve 86.21% recognition rate for a brand new script and 94.81% for the MNIST benchmark dataset.
Abstract: One of the major issues in handwritten character recognition is the efficient creation of ground truth to train and test the different recognizers. The manual labeling of the data by a human expert is a tedious and costly procedure. In this paper we propose an efficient and low-cost semi-automatic labeling system for character datasets. First, the data is represented in different abstraction levels, which is clustered after in an unsupervised manner. The different clusters are labeled by the human experts and finally an unanimity voting is considered to decide if a label is accepted or not. The experimental results prove that labeling only less than 0.5% of the training data is sufficient to achieve 86.21% recognition rate for a brand new script (Lampung) and 94.81% for the MNIST benchmark dataset, considering only a K-nearest neighbor classifier for recognition.

Posted Content
TL;DR: Another substantial improvement is reported: 0.31% obtained using a committee of MLPs using simple but deep MLPs, outperforming all the previous more complex methods.
Abstract: The competitive MNIST handwritten digit recognition benchmark has a long history of broken records since 1998. The most recent substantial improvement by others dates back 7 years (error rate 0.4%) . Recently we were able to significantly improve this result, using graphics cards to greatly speed up training of simple but deep MLPs, which achieved 0.35%, outperforming all the previous more complex methods. Here we report another substantial improvement: 0.31% obtained using a committee of MLPs.

Proceedings ArticleDOI
18 Sep 2011
TL;DR: Three part-based methods for handwritten character recognition are introduced and the relative superiority of the class distance method and the robustness of the multiple voting method against the reduction of training set are shown.
Abstract: The purpose of this paper is to introduce three part-based methods for handwritten character recognition and then compare their performances experimentally. All of those methods decompose handwritten characters into "parts". Then some recognition processes are done in a part-wise manner and, finally, the recognition results at all the parts are combined via voting to have the recognition result of the entire character. Since part-based methods do not rely on the global structure of the character, we can expect their robustness against various deformations. Three voting methods have been investigated for the combination: single voting, multiple voting, and class distance. All of them use different strategies for voting. Experimental results on the MNIST database showed the relative superiority of the class distance method and the robustness of the multiple voting method against the reduction of training set.

Journal ArticleDOI
Yaping Huang1, Jiali Zhao1, Yunhui Liu1, Siwei Luo1, Qi Zou1, Mei Tian1 
TL;DR: A new Nonlinear Neighborhood Preserving (NNP) technique is developed, by utilizing the temporal coherence principle to find an optimal low dimensional representation from the original high dimensional data.

Journal ArticleDOI
TL;DR: The proposed discriminative method to select GMM structures for pattern classification behaves better than the manual method and the generative counterparts, including Bayesian Information Criterion, Minimum Description Length (MDL) and AutoClass.

Proceedings ArticleDOI
Brian Cheung1, Carl Sable1
18 Dec 2011
TL;DR: This paper applies a hybrid evolutionary search procedure to define the initialization and architectural parameters of convolutional networks, one of the first successful deep network models, and makes use of stochastic diagonal Levenberg-Marquardt to accelerate the convergence of training.
Abstract: With the increasing trend of neural network models towards larger structures with more layers, we expect a corresponding exponential increase in the number of possible architectures. In this paper, we apply a hybrid evolutionary search procedure to define the initialization and architectural parameters of convolutional networks, one of the first successful deep network models. We make use of stochastic diagonal Levenberg-Marquardt to accelerate the convergence of training, lowering the time cost of fitness evaluation. Using parameters found from the evolutionary search together with absolute value and local contrast normalization preprocessing between layers, we achieve the best known performance on several of the MNIST Variations, rectangles-image and convex image datasets.

Proceedings ArticleDOI
30 Nov 2011
TL;DR: This paper presents a case study on the impact of using reduced precision arithmetic on learning in Restricted Boltzmann Machine (RBM) deep belief networks and demonstrates that RBM can be trained successfully using resource-efficient fixed point formats commonly found in current FPGA devices.
Abstract: This paper presents a case study on the impact of using reduced precision arithmetic on learning in Restricted Boltzmann Machine (RBM) deep belief networks. FPGAs provide a hardware accelerator framework to speed up many algorithms, including the learning and recognition tasks of ever growing neural network topologies and problem complexities. Current FPGAs include DSP blocks - hard blocks that allow designers to roll in hardware otherwise built using significant quantity of reconfigurable logic (slices) and increase clock performance of arithmetic operations. Accelerators on FPGAs can take advantage of, in some products, thousands DSP blocks on a single chip to scale up the parallelism of designs. Conversely, IEEE floating point representation cannot be fully implemented in single DSP slices and requires a significant amount of general logic thus reducing the amount of resources available to breadth of parallelism in an accelerator design. Reduced precision fixed point format arithmetic can fit within a single DSP slice without external logic. It has been used successfully for training MLP-BP neural networks on small problems. The merit of reduced precision computation in RBM networks for sizable problems has not been evaluated. In this work, a three layer RBM network linked to one classification layer (1.6M weights) is used to learn the classic MNIST dataset over a set of common limited precisions used in FPGA designs. Issues of parameter saturation and a method to overcome inherent training difficulties is discussed. The results demonstrate that RBM can be trained successfully using resource-efficient fixed point formats commonly found in current FPGA devices.

Proceedings ArticleDOI
06 Nov 2011
TL;DR: A directed bilinear model that learns higher-order groupings among features of natural images and achieves high log-likelihood (−94 nats), surpassing the current state of the art for natural images achievable with an mcRBM model.
Abstract: We describe a directed bilinear model that learns higher-order groupings among features of natural images. The model represents images in terms of two sets of latent variables: one set of variables represents which feature groups are active, while the other specifies the relative activity within groups. Such a factorized representation is beneficial because it is stable in response to small variations in the placement of features while still preserving information about relative spatial relationships. When trained on MNIST digits, the resulting representation provides state of the art performance in classification using a simple classifier. When trained on natural images, the model learns to group features according to proximity in position, orientation, and scale. The model achieves high log-likelihood (−94 nats), surpassing the current state of the art for natural images achievable with an mcRBM model.

Proceedings ArticleDOI
03 Oct 2011
TL;DR: The M-DBN is introduced, an unsupervised modular DBN that addresses the forgetting problem and retains learned features even after those features are removed from the training data, while monolithic DBNs of comparable size forget feature mappings learned before.
Abstract: Deep belief networks (DBNs) are popular for learning compact representations of high-dimensional data. However, most approaches so far rely on having a single, complete training set. If the distribution of relevant features changes during subsequent training stages, the features learned in earlier stages are gradually forgotten. Often it is desirable for learning algorithms to retain what they have previously learned, even if the input distribution temporarily changes. This paper introduces the M-DBN, an unsupervised modular DBN that addresses the forgetting problem. M-DBNs are composed of a number of modules that are trained only on samples they best reconstruct. While modularization by itself does not prevent forgetting, the M-DBN additionally uses a learning method that adjusts each module's learning rate proportionally to the fraction of best reconstructed samples. On the MNIST handwritten digit dataset module specialization largely corresponds to the digits discerned by humans. Furthermore, in several learning tasks with changing MNIST digits, M-DBNs retain learned features even after those features are removed from the training data, while monolithic DBNs of comparable size forget feature mappings learned before.

Proceedings ArticleDOI
22 May 2011
TL;DR: Experimental results on the MNIST benchmark indicate that the proposed classifier outperforms current state-of-the-art techniques, especially when very few labeled patterns are available.
Abstract: We propose a novel semi-supervised classifier for handwritten digit recognition problems that is based on the assumption that any digit can be obtained as a slight transformation of another sufficiently close digit. Given a number of labeled and unlabeled images, it is possible to determine the class membership of each unlabeled image by creating a sequence of such image transformations that connect it, through other unlabeled images, to a labeled image. In order to measure the total transformation, a robust and reliable metric of the path length is proposed, which combines a local dissimilarity between consecutive images along the path with a global connectivity-based metric. For the local dissimilarity we use a symmetrized version of the zero-order image deformation model (IDM) proposed by Keysers et al. in [1]. For the global distance we use a connectivity-based metric proposed by Chapelle and Zien in [2]. Experimental results on the MNIST benchmark indicate that the proposed classifier outperforms current state-of-the-art techniques, especially when very few labeled patterns are available.

Journal ArticleDOI
TL;DR: This paper views the problem of embedding a set of relational structures into a metric space for purposes of matching and categorisation from a Riemannian perspective and makes use of the concepts of charts on the manifold to define the embedding as a mixture of class-specific submersions.

Proceedings ArticleDOI
22 May 2011
TL;DR: A new Bayesian model is proposed, integrating dictionary learning and topic modeling into a unified framework, and a subset of the images may be annotated, demonstrating state-of-the-art performance.
Abstract: A new Bayesian model is proposed, integrating dictionary learning and topic modeling into a unified framework. The model is applied to cluster multiple images, and a subset of the images may be annotated. Example results are presented on the MNIST digit data and on the Microsoft MSRC multi-scene image data. These results reveal the working mechanisms of the model and demonstrate state-of-the-art performance.

Proceedings ArticleDOI
01 Nov 2011
TL;DR: A combination of distributed and parallel computing method, CoDLib have been proposed that shows a great speed up on the training of the MNIST dataset where training time has been significantly reduced compared with standard LIBSVM without affecting the quality of the SVM.
Abstract: Support Vector Machine (SVM) is an efficient data mining approach for data classification. However, SVM algorithm requires very large memory requirement and computational time to deal with very large dataset. To reduce the computational time during the process of training the SVM, a combination of distributed and parallel computing method, CoDLib have been proposed. Instead of using a single machine for parallel computing, multiple machines in a cluster are used. Message Passing Interface (MPI) is used in the communication between machines in the cluster. The original dataset is split and distributed to the respective machines. Experiments results shows a great speed up on the training of the MNIST dataset where training time has been significantly reduced compared with standard LIBSVM without affecting the quality of the SVM.

Posted Content
TL;DR: A comparison between a multivariate and a probabilistic approach is shown, concluding that both methods provide similar results in terms of test-error rate.
Abstract: Pattern recognition is one of the major challenges in statistics framework. Its goal is the feature extraction to classify the patterns into categories. A well-known example in this field is the handwritten digit recognition where digits have to be assigned into one of the 10 classes using some classification method. Our purpose is to present alternative classification methods based on statistical techniques. We show a comparison between a multivariate and a probabilistic approach, concluding that both methods provide similar results in terms of test-error rate. Experiments are performed on the known MNIST and USPS databases in binary-level image. Then, as an additional contribution we introduce a novel method to binarize images, based on statistical concepts associated to the written trace of the digit

Journal ArticleDOI
TL;DR: A simple method based on some statistical measurements for Latin handwritten digit recognition is proposed in this paper and six categories are created based on the relation between number of termination points and possible digits.
Abstract: A simple method based on some statistical measurements for Latin handwritten digit recognition is proposed in this paper. Firstly, a preprocess step is started with thresholding the gray-scale digit image into a binary image, and then noise removal, spurring and thinning are performed. Secondly, by reducing the search space, the region-of-interest (ROI) is cropped from the preprocessed image, then a freeman chain code template is applied and five feature sets are extracted from each digit image. Counting the number of termination points, their coordinates with relation to the center of the ROI, Euclidian distances, orientations in terms of angles, and other statistical properties such as minor-to-major axis length ratio, area and others. Finally, six categories are created based on the relation between number of termination points and possible digits. The present method is applied and tested on training set (60,000 images) and test set (10,000 images) of MNIST handwritten digit database. Our experiments report a correct classification of 92.9041% for the testing set and 95.0953% for the training set.

Proceedings Article
14 Jun 2011
TL;DR: The Restricted Boltzmann Machine (RBM) is an undirected graphical model with latent variables, exact inference, rather simple sampling procedures (block Gibbs), and several successful learning algorithms based on approximations of the log-likelihood gradient.
Abstract: The Restricted Boltzmann Machine (Smolensky, 1986; Hinton et al., 2006) has inspired much research in recent years, in particular as a building block for deep architectures (see Bengio (2009) for a review). The Restricted Boltzmann Machine (RBM) is an undirected graphical model with latent variables, exact inference, rather simple sampling procedures (block Gibbs), and several successful learning algorithms based on approximations of the log-likelihood gradient. However, when it comes to actually computing the distribution or density function, it is intractable, except when either the number of inputs or latent variables is very small (about 25 binary hidden units with current computers and about an hour of computing, on MNIST).

Li Deng1, Dong Yu1
01 Jun 2011
TL;DR: Experimental results on handwriting image recognition task (MNIST) and on phone state classification (TIMIT) demonstrate superior performance of DCN over DBN not only in training efficiency but also in classification accuracy.
Abstract:  To overcome the scalability challenge associated with Deep Belief Network (DBN), we have designed a novel deep learning architecture, deep convex network (DCN). The learning problem in DCN is convex within each layer. Additional structure-exploited fine tuning further improves the quality of DCN. The full learning in DCN is batch-mode based instead of stochastic, naturally lending it amenable to parallel training that can be distributed over many machines. Experimental results on handwriting image recognition task (MNIST) and on phone state classification (TIMIT) demonstrate superior performance of DCN over DBN not only in training efficiency but also in classification accuracy. DCN gives the error rate of 0.83%, the lowest without the use of additional training data produced by elastic distortion. The corresponding error rate by the best DBN which we have carefully tuned is 1.06%. On the TIMIT task, DCN also outperforms DBN but with a relatively smaller percentage so far.

Proceedings Article
14 Jun 2011
TL;DR: It is shown that classirrelevant features help class- relevant features to focus on the recognition task and introduce useful regularization effects to reduce the norms of class-relevant features in this hybrid third-order Restricted Boltzmann Machine.
Abstract: Restricted Boltzmann Machines are commonly used in unsupervised learning to extract features from training data. Since these features are learned for regenerating training data a classifier based on them has to be trained. If only a few of the learned features are discriminative other non-discriminative features will distract the classifier during the training process and thus waste computing resources for testing. In this paper, we present a hybrid third-order Restricted Boltzmann Machine in which class-relevant features (for recognizing) and class-irrelevant features (for generating only) are learned simultaneously. As the classification task uses only the class-relevant features, the test itself becomes very fast. We show that classirrelevant features help class-relevant features to focus on the recognition task and introduce useful regularization effects to reduce the norms of class-relevant features. Thus there is no need to use weight-decay for the parameters of this model. Experiments on the MNIST, NORB and Caltech101 Silhouettes datasets show very promising results.

Proceedings ArticleDOI
16 Jul 2011
TL;DR: This work proposes an algorithm that can adapt a preexisting MMB trained with extensive data to a new link from which very limited data is available, and shows it can learn accurate models from data traces of about 1 minute, about 10 times shorter than needed if training an MMB from scratch.
Abstract: The mixture of multivariate Bernoulli distributions (MMB) is a statistical model for high-dimensional binary data in widespread use. Recently, the MMB has been used to model the sequence of packet receptions and losses of wireless links in sensor networks. Given an MMB trained on long data traces recorded from links of a deployed network, one can then use samples from the MMB to test different routing algorithms for as long as desired. However, learning an accurate model for a new link requires collecting from it long traces over periods of hours, a costly process in practice (e.g. limited battery life). We propose an algorithm that can adapt a preexisting MMB trained with extensive data to a new link from which very limited data is available. Our approach constrains the new MMB's parameters through a nonlinear transformation of the existing MMB's parameters. The transformation has a small number of parameters that are estimated using a generalized EM algorithm with an inner loop of BFGS iterations. We demonstrate the efficacy of the approach using the MNIST dataset of handwritten digits, and wireless link data from a sensor network. We show we can learn accurate models from data traces of about 1 minute, about 10 times shorter than needed if training an MMB from scratch.