scispace - formally typeset
Search or ask a question

Showing papers on "MNIST database published in 2012"


Proceedings Article
03 Dec 2012
TL;DR: The state-of-the-art performance of CNNs was achieved by Deep Convolutional Neural Networks (DCNNs) as discussed by the authors, which consists of five convolutional layers, some of which are followed by max-pooling layers, and three fully-connected layers with a final 1000-way softmax.
Abstract: We trained a large, deep convolutional neural network to classify the 1.2 million high-resolution images in the ImageNet LSVRC-2010 contest into the 1000 different classes. On the test data, we achieved top-1 and top-5 error rates of 37.5% and 17.0% which is considerably better than the previous state-of-the-art. The neural network, which has 60 million parameters and 650,000 neurons, consists of five convolutional layers, some of which are followed by max-pooling layers, and three fully-connected layers with a final 1000-way softmax. To make training faster, we used non-saturating neurons and a very efficient GPU implementation of the convolution operation. To reduce overriding in the fully-connected layers we employed a recently-developed regularization method called "dropout" that proved to be very effective. We also entered a variant of this model in the ILSVRC-2012 competition and achieved a winning top-5 test error rate of 15.3%, compared to 26.2% achieved by the second-best entry.

73,978 citations


Posted Content
TL;DR: A novel per-dimension learning rate method for gradient descent called ADADELTA that dynamically adapts over time using only first order information and has minimal computational overhead beyond vanilla stochastic gradient descent is presented.
Abstract: We present a novel per-dimension learning rate method for gradient descent called ADADELTA. The method dynamically adapts over time using only first order information and has minimal computational overhead beyond vanilla stochastic gradient descent. The method requires no manual tuning of a learning rate and appears robust to noisy gradient information, different model architecture choices, various data modalities and selection of hyperparameters. We show promising results compared to other methods on the MNIST digit classification task using a single machine and on a large scale voice dataset in a distributed cluster environment.

6,189 citations


Proceedings ArticleDOI
16 Jun 2012
TL;DR: In this paper, a biologically plausible, wide and deep artificial neural network architectures was proposed to match human performance on tasks such as the recognition of handwritten digits or traffic signs, achieving near-human performance.
Abstract: Traditional methods of computer vision and machine learning cannot match human performance on tasks such as the recognition of handwritten digits or traffic signs. Our biologically plausible, wide and deep artificial neural network architectures can. Small (often minimal) receptive fields of convolutional winner-take-all neurons yield large network depth, resulting in roughly as many sparsely connected neural layers as found in mammals between retina and visual cortex. Only winner neurons are trained. Several deep neural columns become experts on inputs preprocessed in different ways; their predictions are averaged. Graphics cards allow for fast training. On the very competitive MNIST handwriting benchmark, our method is the first to achieve near-human performance. On a traffic sign recognition benchmark it outperforms humans by a factor of two. We also improve the state-of-the-art on a plethora of common image classification benchmarks.

3,717 citations


Journal ArticleDOI
Li Deng1
TL;DR: “Best of the Web” presents the modified National Institute of Standards and Technology (MNIST) resources, consisting of a collection of handwritten digit images used extensively in optical character recognition and machine learning research.
Abstract: In this issue, “Best of the Web” presents the modified National Institute of Standards and Technology (MNIST) resources, consisting of a collection of handwritten digit images used extensively in optical character recognition and machine learning research.

1,626 citations


Journal Article
TL;DR: In this article, the modified National Institute of Standards and Technology (MNIST) resources, consisting of a collection of handwritten digit images used extensively in optical character recognition and machine learning research, are presented.
Abstract: In this issue, “Best of the Web” presents the modified National Institute of Standards and Technology (MNIST) resources, consisting of a collection of handwritten digit images used extensively in optical character recognition and machine learning research. Handwritten digit recognition is an important problem in optical character recognition, and it has been used as a test case for theories of pattern recognition and machine learning algorithms for many years. Historically, to promote machine learning and pattern recognition research, several standard databases have emerged in which the handwritten digits are preprocessed, including segmentation and normalization, so that researchers can compare recognition results of their techniques on a common basis. The freely available MNIST database of handwritten digits has become a standard for fast-testing machine learning algorithms for this purpose. The simplicity of this task is analogous to the TIDigit (a speech database created by Texas Instruments) task in speech recognition. Just like there is a long list for more complex speech recognition tasks, there are many more difficult and challenging tasks for image recognition and computer vision, which will not be addressed in this column.

1,466 citations


Posted Content
TL;DR: This work considers the problem of simultaneously selecting a learning algorithm and setting its hyperparameters, going beyond previous work that attacks these issues separately and shows classification performance often much better than using standard selection and hyperparameter optimization methods.
Abstract: Many different machine learning algorithms exist; taking into account each algorithm's hyperparameters, there is a staggeringly large number of possible alternatives overall. We consider the problem of simultaneously selecting a learning algorithm and setting its hyperparameters, going beyond previous work that addresses these issues in isolation. We show that this problem can be addressed by a fully automated approach, leveraging recent innovations in Bayesian optimization. Specifically, we consider a wide range of feature selection techniques (combining 3 search and 8 evaluator methods) and all classification approaches implemented in WEKA, spanning 2 ensemble methods, 10 meta-methods, 27 base classifiers, and hyperparameter settings for each classifier. On each of 21 popular datasets from the UCI repository, the KDD Cup 09, variants of the MNIST dataset and CIFAR-10, we show classification performance often much better than using standard selection/hyperparameter optimization methods. We hope that our approach will help non-expert users to more effectively identify machine learning algorithms and hyperparameter settings appropriate to their applications, and hence to achieve improved performance.

1,004 citations


Journal ArticleDOI
TL;DR: A hybrid model of integrating the synergy of two superior classifiers: Convolutional Neural Network (CNN) and Support Vector Machine (SVM) which have proven results in recognizing different types of patterns is presented.

585 citations


Proceedings Article
03 Dec 2012
TL;DR: A new loss-augmented inference algorithm that is quadratic in the code length and inspired by latent structural SVMs is developed, showing strong retrieval performance on CIFAR-10 and MNIST, with promising classification results using no more than kNN on the binary codes.
Abstract: Motivated by large-scale multimedia applications we propose to learn mappings from high-dimensional data to binary codes that preserve semantic similarity. Binary codes are well suited to large-scale applications as they are storage efficient and permit exact sub-linear kNN search. The framework is applicable to broad families of mappings, and uses a flexible form of triplet ranking loss. We overcome discontinuous optimization of the discrete mappings by minimizing a piecewise-smooth upper bound on empirical loss, inspired by latent structural SVMs. We develop a new loss-augmented inference algorithm that is quadratic in the code length. We show strong retrieval performance on CIFAR-10 and MNIST, with promising classification results using no more than kNN on the binary codes.

562 citations


Journal ArticleDOI
TL;DR: A new learning algorithm for Boltzmann machines that contain many layers of hidden variables is presented and results on the MNIST and NORB data sets are presented showing that deep BoltZmann machines learn very good generative models of handwritten digits and 3D objects.
Abstract: We present a new learning algorithm for Boltzmann machines that contain many layers of hidden variables. Data-dependent statistics are estimated using a variational approximation that tends to focus on a single mode, and data-independent statistics are estimated using persistent Markov chains. The use of two quite different techniques for estimating the two types of statistic that enter into the gradient of the log likelihood makes it practical to learn Boltzmann machines with multiple hidden layers and millions of parameters. The learning can be made more efficient by using a layer-by-layer pretraining phase that initializes the weights sensibly. The pretraining also allows the variational inference to be initialized sensibly with a single bottom-up pass. We present results on the MNIST and NORB data sets showing that deep Boltzmann machines learn very good generative models of handwritten digits and 3D objects. We also show that the features discovered by deep Boltzmann machines are a very effective way to initialize the hidden layers of feedforward neural nets, which are then discriminatively fine-tuned.

463 citations


Proceedings ArticleDOI
Li Deng1, Dong Yu1, John Platt1
25 Mar 2012
TL;DR: The Deep Stacking Network (DSN) is presented, which overcomes the problem of parallelizing learning algorithms for deep architectures and provides a method of stacking simple processing modules in buiding deep architectures, with a convex learning problem in each module.
Abstract: Deep Neural Networks (DNNs) have shown remarkable success in pattern recognition tasks. However, parallelizing DNN training across computers has been difficult. We present the Deep Stacking Network (DSN), which overcomes the problem of parallelizing learning algorithms for deep architectures. The DSN provides a method of stacking simple processing modules in buiding deep architectures, with a convex learning problem in each module. Additional fine tuning further improves the DSN, while introducing minor non-convexity. Full learning in the DSN is batch-mode, making it amenable to parallel training over many machines and thus be scalable over the potentially huge size of the training data. Experimental results on both the MNIST (image) and TIMIT (speech) classification tasks demonstrate that the DSN learning algorithm developed in this work is not only parallelizable in implementation but it also attains higher classification accuracy than the DNN.

208 citations


Journal ArticleDOI
TL;DR: This work presents a simple and fast geometric method for modeling data by a union of affine subspaces, and gives extensive experimental evidence demonstrating the state of the art accuracy and speed of the suggested algorithms on these problems.
Abstract: We present a simple and fast geometric method for modeling data by a union of affine subspaces. The method begins by forming a collection of local best-fit affine subspaces, i.e., subspaces approximating the data in local neighborhoods. The correct sizes of the local neighborhoods are determined automatically by the Jones' β 2 numbers (we prove under certain geometric conditions that our method finds the optimal local neighborhoods). The collection of subspaces is further processed by a greedy selection procedure or a spectral method to generate the final model. We discuss applications to tracking-based motion segmentation and clustering of faces under different illuminating conditions. We give extensive experimental evidence demonstrating the state of the art accuracy and speed of the suggested algorithms on these problems and also on synthetic hybrid linear data as well as the MNIST handwritten digits data; and we demonstrate how to use our algorithms for fast determination of the number of affine subspaces.

Posted Content
TL;DR: This paper presents the transformation-invariant restricted Boltzmann machine that compactly represents data by its weights and their transformations, which achieves invariance of the feature representation via probabilistic max pooling.
Abstract: Learning invariant representations is an important problem in machine learning and pattern recognition. In this paper, we present a novel framework of transformation-invariant feature learning by incorporating linear transformations into the feature learning algorithms. For example, we present the transformation-invariant restricted Boltzmann machine that compactly represents data by its weights and their transformations, which achieves invariance of the feature representation via probabilistic max pooling. In addition, we show that our transformation-invariant feature learning framework can also be extended to other unsupervised learning methods, such as autoencoders or sparse coding. We evaluate our method on several image classification benchmark datasets, such as MNIST variations, CIFAR-10, and STL-10, and show competitive or superior classification performance when compared to the state-of-the-art. Furthermore, our method achieves state-of-the-art performance on phone classification tasks with the TIMIT dataset, which demonstrates wide applicability of our proposed algorithms to other domains.

Proceedings Article
03 Dec 2012
TL;DR: A different method of pretraining DBMs is developed that distributes the modelling work more evenly over the hidden layers and demonstrates that the new pretraining algorithm allows us to learn better generative models.
Abstract: We describe how the pretraining algorithm for Deep Boltzmann Machines (DBMs) is related to the pretraining algorithm for Deep Belief Networks and we show that under certain conditions, the pretraining procedure improves the variational lower bound of a two-hidden-layer DBM. Based on this analysis, we develop a different method of pretraining DBMs that distributes the modelling work more evenly over the hidden layers. Our results on the MNIST and NORB datasets demonstrate that the new pretraining algorithm allows us to learn better generative models.

Proceedings Article
27 Jun 2012
TL;DR: A hierarchical Bayesian model that learns categories from single training examples that transfers acquired knowledge from previously learned categories to a novel category, in the form of a prior over category means and variances is developed.
Abstract: We develop a hierarchical Bayesian model that learns categories from single training examples. The model transfers acquired knowledge from previously learned categories to a novel category, in the form of a prior over category means and variances. The model discovers how to group categories into meaningful super-categories that express different priors for new classes. Given a single example of a novel category, we can efficiently infer which super-category the novel category belongs to, and thereby estimate not only the new category's mean but also an appropriate similarity metric based on parameters inherited from the super-category. On MNIST and MSR Cambridge image datasets the model learns useful representations of novel categories based on just a single training example, and performs significantly better than simpler hierarchical Bayesian approaches. It can also discover new categories in a completely unsupervised fashion, given just one or a few examples.

Journal ArticleDOI
Dong Yu1, Li Deng1
TL;DR: Experiments show that the algorithms proposed in this paper obtain significantly better classification accuracy than ELM when the same number of hidden units is used, and at the expense of 5 times or less the training cost incurred by the ELM training.

Posted Content
TL;DR: On the very competitive MNIST handwriting benchmark, this method is the first to achieve near-human performance and improves the state-of-the-art on a plethora of common image classification benchmarks.
Abstract: Traditional methods of computer vision and machine learning cannot match human performance on tasks such as the recognition of handwritten digits or traffic signs. Our biologically plausible deep artificial neural network architectures can. Small (often minimal) receptive fields of convolutional winner-take-all neurons yield large network depth, resulting in roughly as many sparsely connected neural layers as found in mammals between retina and visual cortex. Only winner neurons are trained. Several deep neural columns become experts on inputs preprocessed in different ways; their predictions are averaged. Graphics cards allow for fast training. On the very competitive MNIST handwriting benchmark, our method is the first to achieve near-human performance. On a traffic sign recognition benchmark it outperforms humans by a factor of two. We also improve the state-of-the-art on a plethora of common image classification benchmarks.

Journal ArticleDOI
TL;DR: A new structure, folded autoencoder based on symmetric structure of conventional autoen coder, for dimensionality reduction, which reduces the number of weights to be tuned and thus reduces the computational cost.

Proceedings ArticleDOI
02 Jul 2012
TL;DR: It is demonstrated that this kind of representation coupled to a SVM improves classification error on MNIST over the usual deep learning approach where a logistic regression layer is added to the stack of denoising autoencoders.
Abstract: Deep learning allows automatically learning multiple levels of representations of the underlying distribution of the data to be modeled. In this work, a specific implementation called stacked denoising autoencoders is explored. We contribute by demonstrating that this kind of representation coupled to a SVM improves classification error on MNIST over the usual deep learning approach where a logistic regression layer is added to the stack of denoising autoencoders.

Book ChapterDOI
01 Jan 2012
TL;DR: All you need to achieve this until 2011 best result are many hidden layers, many neurons per layer, numerous deformed training images to avoid overfitting, and graphics cards to greatly speed up learning.
Abstract: The competitive MNIST handwritten digit recognition benchmark has a long history of broken records since 1998. The most recent advancement by others dates back 8 years (error rate 0.4 old on-line back-propagation for plain multi-layer perceptrons yields a very low 0.35% error rate on the MNIST handwritten digits benchmark with a single MLP and 0.31% with a committee of seven MLP. All we need to achieve this until 2011 best result are many hidden layers, many neurons per layer, numerous deformed training images to avoid overfitting, and graphics cards to greatly speed up learning.

Journal ArticleDOI
TL;DR: Experimental comparison between the LiRA perceptron and the modular assembly neural network is accomplished, which shows that recognition capability of the modules is somewhat better.

Proceedings ArticleDOI
25 Oct 2012
TL;DR: This paper proposes a hardware architecture for K-means with triangle inequality optimization on FPGA, an optimal 8-bit square calculator for 6-LUT architectures is described to minimize the hardware cost and an approximation solution is proposed to avoid square root calculation in the original Triangle inequality optimization.
Abstract: One of the challenges to data mining raised by technology development is that both data size and dimensionality is growing rapidly. K-means, one of the most popular clustering algorithms in data mining, suffers in computational time when used for large data sets and data with high dimensionality. In this paper, we propose a hardware architecture for K-means with triangle inequality optimization on FPGA. An optimal 8-bit square calculator for 6-LUT architectures is described to minimize the hardware cost and an approximation solution is proposed to avoid square root calculation in the original triangle inequality optimization. Our software and hardware experiments are tested with the MNIST benchmark and uniform random numbers of various size. This approximation results in 2% more distance calculations for MNIST and 5% for uniform random numbers than the original optimization. Compared to the baseline hardware system without optimization, our approach achieves up to 77% improvement in processing time with about 10% logic overhead. We demonstrate that the hardware can achieve 55-fold speed up compared to software for the 1024 MNIST.

Journal ArticleDOI
TL;DR: In this article, an extension of log-linear models incorporating latent variables is proposed, which can be used for image deformation-aware loglinear models, which are fully discriminative, can be trained efficiently and can be controlled.
Abstract: We present latent log-linear models, an extension of log-linear models incorporating latent variables, and we propose two applications thereof: log-linear mixture models and image deformation-aware log-linear models. The resulting models are fully discriminative, can be trained efficiently, and the model complexity can be controlled. Log-linear mixture models offer additional flexibility within the log-linear modeling framework. Unlike previous approaches, the image deformation-aware model directly considers image deformations and allows for a discriminative training of the deformation parameters. Both are trained using alternating optimization. For certain variants, convergence to a stationary point is guaranteed and, in practice, even variants without this guarantee converge and find models that perform well. We tune the methods on the USPS data set and evaluate on the MNIST data set, demonstrating the generalization capabilities of our proposed models. Our models, although using significantly fewer parameters, are able to obtain competitive results with models proposed in the literature.

Book ChapterDOI
11 Sep 2012
TL;DR: Two extensions of the parallel tempering algorithm are looked at, which is a Markov Chain Monte Carlo method to approximate the likelihood gradient, directed at a more effective exchange of information among the parallel sampling chains.
Abstract: Restricted Boltzmann Machines (RBM's) are unsupervised probabilistic neural networks that can be stacked to form Deep Belief Networks. Given the recent popularity of RBM's and the increasing availability of parallel computing architectures, it becomes interesting to investigate learning algorithms for RBM's that benefit from parallel computations. In this paper, we look at two extensions of the parallel tempering algorithm, which is a Markov Chain Monte Carlo method to approximate the likelihood gradient. The first extension is directed at a more effective exchange of information among the parallel sampling chains. The second extension estimates gradients by averaging over chains from different temperatures. We investigate the efficiency of the proposed methods and demonstrate their usefulness on the MNIST dataset. Especially the weighted averaging seems to benefit Maximum Likelihood learning.

Journal Article
TL;DR: An overview of the Semantic Pointer Architecture: Unified Network (Spaun) model is presented and it is demonstrated that this biologically plausible spiking neuron model has the following features: Task Flexibility: No changes are made to the model between tasks.

Proceedings ArticleDOI
01 Jan 2012
TL;DR: This study aims to investigate the novel feature extraction technique called the hotspot technique in order to use it for representing handwritten characters and digits and revealed that the hotspots technique provides the largest average classification rates.
Abstract: Feature extraction techniques can be important in character recognition, because they can enhance the efficacy of recognition in comparison to featureless or pixel-based approaches. This study aims to investigate the novel feature extraction technique called the hotspot technique in order to use it for representing handwritten characters and digits. In the hotspot technique, the distance values between the closest black pixels and the hotspots in each direction are used as representation for a character. The hotspot technique is applied to three data sets including Thai handwritten characters (65 classes), Bangla numeric (10 classes), and MNIST (10 classes). The hotspot technique consists of two parameters including the number of hotspots and the number of chain code directions. The data sets are then classified by the k-Nearest Neighbors algorithm using the Euclidean distance as function for computing distances between data points. In this study, the classification rates obtained from the hotspot, mark direction, and direction of chain code techniques are compared. The results revealed that the hotspot technique provides the largest average classification rates.

Proceedings ArticleDOI
16 Jun 2012
TL;DR: A novel local classifier, Parametric Nearest Neighbor (P-NN) and its extension Ensemble of P-NN (EP-NN), which parameterize the nearest neighbor algorithm based on the minimum weighted squared Euclidean distances between the data points and the prototypes.
Abstract: Linear SVMs are efficient in both training and testing, however the data in real applications is rarely linearly separable. Non-linear kernel SVMs are too computationally intensive for applications with large-scale data sets. Recently locally linear classifiers have gained popularity due to their efficiency whilst remaining competitive with kernel methods. The vanilla nearest neighbor algorithm is one of the simplest locally linear classifiers, but it lacks robustness due to the noise often present in real-world data. In this paper, we introduce a novel local classifier, Parametric Nearest Neighbor (P-NN) and its extension Ensemble of P-NN (EP-NN). We parameterize the nearest neighbor algorithm based on the minimum weighted squared Euclidean distances between the data points and the prototypes, where a prototype is represented by a locally linear combination of some data points. Meanwhile, our method attempts to jointly learn both the prototypes and the classifier parameters discriminatively via max-margin. This makes our classifiers suitable to approximate the classification decision boundaries locally based on nonlinear functions. During testing, the computational complexity of both classifiers is linear in the product of the dimension of data and the number of prototypes. Our classification results on MNIST, USPS, LETTER, and Chars 74K are comparable and in some cases are better than many other methods such as the state-of-the-art locally linear classifiers.

Book ChapterDOI
16 Dec 2012
TL;DR: A library written by C# language for the online handwriting recognition system using UNIPEN-online handwritten training set and a proposed handwriting segmentation algorithm is carried out which can extract sentences, words and characters from handwritten text.
Abstract: This paper presents a library written by C# language for the online handwriting recognition system using UNIPEN-online handwritten training set. The recognition engine based on convolution neural networks and yields recognition rates to 99% to MNIST training set, 97% to UNIPEN's digit training set (1a), 89% to a collection of 44022 capital letters and digits (1a,1b) and 89% to lower case letters (1c). These networks are combined to create a larger system which can recognize 62 English characters and digits. A proposed handwriting segmentation algorithm is carried out which can extract sentences, words and characters from handwritten text. The characters then are given as the input to the network.

Proceedings Article
Chunpeng Wu1, Wei Fan1, Yuan He1, Jun Sun1, Satoshi Naoi1 
01 Nov 2012
TL;DR: This paper presents a handwritten digit recognition method based on cascaded heterogeneous convolutional neural networks (CNNs) that achieves an error rate 0.23% using only 5 C-NNs, on par with human vision system.
Abstract: This paper presents a handwritten digit recognition method based on cascaded heterogeneous convolutional neural networks (CNNs). The reliability and complementation of heterogeneous CNNs are investigated in our method. Each CNN recognizes a proportion of input samples with high-confidence, and feeds the rejected samples into the next CNN. The samples rejected by the last CNN are recognized by a voting committee of all CNNs. Experiments on MNIST dataset show that our method achieves an error rate 0.23% using only 5 C-NNs, on par with human vision system. Using heterogeneous networks can reduce the number of CNNs needed to reach certain performance compared with networks built from the same type. Further improvements include fine-tuning the rejection threshold of each CNN and adding CNNs of more types.

Proceedings ArticleDOI
10 Jun 2012
TL;DR: A spiking neural network of integrate-and-fire neurons to perform pattern recognition is presented and the synaptic dynamics is shown to be compatible with many experimental observations on induction of long-term modifications, like spike-timing-dependent plasticity (STDP).
Abstract: Many conventional methods have been widely studied to solve the pattern recognition task, but most of them lack the biological plausibility. This paper presents a spiking neural network of integrate-and-fire neurons to perform pattern recognition. A biologically plausible supervised synaptic learning rule is used so that neurons can efficiently make a decision. The whole system contains encoding, learning and readout. It can classify complex patterns of activities stored in a vector, as well as the real-world stimuli. We test the performance of the network with digital images from the MNIST and images of alphabetic letters. It turns out to be able to classify these patterns correctly. In addition, the synaptic dynamics is shown to be compatible with many experimental observations on induction of long-term modifications, like spike-timing-dependent plasticity (STDP).

Book ChapterDOI
02 Jan 2012
TL;DR: An improved recognition rate is observed at higher end by using combination of both the features which shows the effectiveness of dynamic directional feature in the classification of handwritten character patterns.
Abstract: In this work, we have proposed modified chain code histogram (CCH) based feature extraction method for handwritten character recognition (HCR) applications. This modified approach explores the dynamic nature of directional information, available in character patterns, by introducing the Differential CCH which is termed as Delta (Δ) CCH. A comparable and higher recognition rate is reported which emphasizes that the dynamic nature of directional information captured by the ΔCCH is as important as that of CCH. All the experiments are conducted on MNIST handwritten numeral database. Finally, an improved recognition rate is observed at higher end by using combination of both the features which shows the effectiveness of dynamic directional feature in the classification of handwritten character patterns.