scispace - formally typeset
Search or ask a question

Showing papers on "Deep learning published in 2011"


Proceedings Article
28 Jun 2011
TL;DR: This work presents a series of tasks for multimodal learning and shows how to train deep networks that learn features to address these tasks, and demonstrates cross modality feature learning, where better features for one modality can be learned if multiple modalities are present at feature learning time.
Abstract: Deep networks have been successfully applied to unsupervised feature learning for single modalities (e.g., text, images or audio). In this work, we propose a novel application of deep networks to learn features over multiple modalities. We present a series of tasks for multimodal learning and show how to train deep networks that learn features to address these tasks. In particular, we demonstrate cross modality feature learning, where better features for one modality (e.g., video) can be learned if multiple modalities (e.g., audio and video) are present at feature learning time. Furthermore, we show how to learn a shared representation between modalities and evaluate it on a unique task, where the classifier is trained with audio-only data but tested with video-only data and vice-versa. Our models are validated on the CUAVE and AVLetters datasets on audio-visual speech classification, demonstrating best published visual speech classification on AVLetters and effective shared representation learning.

2,830 citations


Book ChapterDOI
14 Jun 2011
TL;DR: A novel convolutional auto-encoder (CAE) for unsupervised feature learning that initializing a CNN with filters of a trained CAE stack yields superior performance on a digit and an object recognition benchmark.
Abstract: We present a novel convolutional auto-encoder (CAE) for unsupervised feature learning. A stack of CAEs forms a convolutional neural network (CNN). Each CAE is trained using conventional on-line gradient descent without additional regularization terms. A max-pooling layer is essential to learn biologically plausible features consistent with those found by previous approaches. Initializing a CNN with filters of a trained CAE stack yields superior performance on a digit (MNIST) and an object recognition (CIFAR10) benchmark.

1,832 citations


Proceedings Article
28 Jun 2011
TL;DR: A deep learning approach is proposed which learns to extract a meaningful representation for each review in an unsupervised fashion and clearly outperform state-of-the-art methods on a benchmark composed of reviews of 4 types of Amazon products.
Abstract: The exponential increase in the availability of online reviews and recommendations makes sentiment classification an interesting topic in academic and industrial research. Reviews can span so many different domains that it is difficult to gather annotated training data for all of them. Hence, this paper studies the problem of domain adaptation for sentiment classifiers, hereby a system is trained on labeled reviews from one source domain but is meant to be deployed on another. We propose a deep learning approach which learns to extract a meaningful representation for each review in an unsupervised fashion. Sentiment classifiers trained with this high-level feature representation clearly outperform state-of-the-art methods on a benchmark composed of reviews of 4 types of Amazon products. Furthermore, this method scales well and allowed us to successfully perform domain adaptation on a larger industrial-strength dataset of 22 domains.

1,769 citations


Proceedings ArticleDOI
20 Jun 2011
TL;DR: This paper presents an extension of the Independent Subspace Analysis algorithm to learn invariant spatio-temporal features from unlabeled video data and discovered that this method performs surprisingly well when combined with deep learning techniques such as stacking and convolution to learn hierarchical representations.
Abstract: Previous work on action recognition has focused on adapting hand-designed local features, such as SIFT or HOG, from static images to the video domain. In this paper, we propose using unsupervised feature learning as a way to learn features directly from video data. More specifically, we present an extension of the Independent Subspace Analysis algorithm to learn invariant spatio-temporal features from unlabeled video data. We discovered that, despite its simplicity, this method performs surprisingly well when combined with deep learning techniques such as stacking and convolution to learn hierarchical representations. By replacing hand-designed features with our learned features, we achieve classification results superior to all previous published results on the Hollywood2, UCF, KTH and YouTube action recognition datasets. On the challenging Hollywood2 and YouTube action datasets we obtain 53.3% and 75.8% respectively, which are approximately 5% better than the current best published results. Further benefits of this method, such as the ease of training and the efficiency of training and prediction, will also be discussed. You can download our code and learned spatio-temporal features here: http://ai.stanford.edu/∼wzou/

1,116 citations


Proceedings Article
28 Jun 2011
TL;DR: It is shown that more sophisticated off-the-shelf optimization methods such as Limited memory BFGS (L-BFGS) and Conjugate gradient (CG) with line search can significantly simplify and speed up the process of pretraining deep algorithms.
Abstract: The predominant methodology in training deep learning advocates the use of stochastic gradient descent methods (SGDs). Despite its ease of implementation, SGDs are difficult to tune and parallelize. These problems make it challenging to develop, debug and scale up deep learning algorithms with SGDs. In this paper, we show that more sophisticated off-the-shelf optimization methods such as Limited memory BFGS (L-BFGS) and Conjugate gradient (CG) with line search can significantly simplify and speed up the process of pretraining deep algorithms. In our experiments, the difference between L-BFGS/CG and SGDs are more pronounced if we consider algorithmic extensions (e.g., sparsity regularization) and hardware extensions (e.g., GPUs or computer clusters). Our experiments with distributed optimization support the use of L-BFGS with locally connected networks and convolutional neural networks. Using L-BFGS, our convolutional network model achieves 0.69% on the standard MNIST dataset. This is a state-of-the-art result on MNIST among algorithms that do not use distortions or pretraining.

908 citations


01 Jan 2011
TL;DR: This paper uses speech recognition as an example task, and shows that a real-time hybrid hidden Markov model / neural network (HMM/NN) large vocabulary system can be built with a 10× speedup over an unoptimized baseline and a 4× speed up over an aggressively optimized floating-point baseline at no cost in accuracy.
Abstract: Recent advances in deep learning have made the use of large, deep neural networks with tens of millions of parameters suitable for a number of applications that require real-time processing. The sheer size of these networks can represent a challenging computational burden, even for modern CPUs. For this reason, GPUs are routinely used instead to train and run such networks. This paper is a tutorial for students and researchers on some of the techniques that can be used to reduce this computational cost considerably on modern x86 CPUs. We emphasize data layout, batching of the computation, the use of SSE2 instructions, and particularly leverage SSSE3 and SSE4 fixed-point instructions which provide a 3× improvement over an optimized floating-point baseline. We use speech recognition as an example task, and show that a real-time hybrid hidden Markov model / neural network (HMM/NN) large vocabulary system can be built with a 10× speedup over an unoptimized baseline and a 4× speedup over an aggressively optimized floating-point baseline at no cost in accuracy. The techniques described extend readily to neural network training and provide an effective alternative to the use of specialized hardware.

883 citations


Book ChapterDOI
16 Nov 2011
TL;DR: A fully automated deep model, which learns to classify human actions without using any prior knowledge is proposed, which outperforms existing deep models, and gives comparable results with the best related works.
Abstract: We propose in this paper a fully automated deep model, which learns to classify human actions without using any prior knowledge. The first step of our scheme, based on the extension of Convolutional Neural Networks to 3D, automatically learns spatio-temporal features. A Recurrent Neural Network is then trained to classify each sequence considering the temporal evolution of the learned features for each timestep. Experimental results on the KTH dataset show that the proposed approach outperforms existing deep models, and gives comparable results with the best related works.

788 citations


Journal ArticleDOI
TL;DR: The convolutional deep belief network is presented, a hierarchical generative model that scales to realistic image sizes and is translation-invariant and supports efficient bottom-up and top-down probabilistic inference.
Abstract: There has been much interest in unsupervised learning of hierarchical generative models such as deep belief networks (DBNs); however, scaling such models to full-sized, high-dimensional images remains a difficult problem. To address this problem, we present the convolutional deep belief network, a hierarchical generative model that scales to realistic image sizes. This model is translation-invariant and supports efficient bottom-up and top-down probabilistic inference. Key to our approach is probabilistic max-pooling, a novel technique that shrinks the representations of higher layers in a probabilistically sound way. Our experiments show that the algorithm learns useful high-level visual features, such as object parts, from unlabeled images of objects and natural scenes. We demonstrate excellent performance on several visual recognition tasks and show that our model can perform hierarchical (bottom-up and top-down) inference over full-sized images.

388 citations


Proceedings ArticleDOI
03 Oct 2011
TL;DR: This work describes the approach that won the preliminary phase of the German traffic sign recognition benchmark with a better-than-human recognition rate, and obtains an even better recognition rate by further training the nets.
Abstract: We describe the approach that won the preliminary phase of the German traffic sign recognition benchmark with a better-than-human recognition rate of 98.98%.We obtain an even better recognition rate of 99.15% by further training the nets. Our fast, fully parameterizable GPU implementation of a Convolutional Neural Network does not require careful design of pre-wired feature extractors, which are rather learned in a supervised way. A CNN/MLP committee further boosts recognition performance.

380 citations


Proceedings Article
12 Dec 2011
TL;DR: It is proved there exist families of functions that can be represented much more efficiently with a deep network than with a shallow one, i.e. with substantially fewer hidden units.
Abstract: We investigate the representational power of sum-product networks (computation networks analogous to neural networks, but whose individual units compute either products or weighted sums), through a theoretical analysis that compares deep (multiple hidden layers) vs. shallow (one hidden layer) architectures. We prove there exist families of functions that can be represented much more efficiently with a deep network than with a shallow one, i.e. with substantially fewer hidden units. Such results were not available until now, and contribute to motivate recent research involving learning of deep sum-product networks, and more generally motivate research in Deep Learning.

312 citations


Journal Article
TL;DR: A property common to these shallow learning models is the simple architecture that consists of only one layer responsible for transforming the raw input signals or features into a problem-specific feature space, which may be unobservable.
Abstract: Today, signal processing research has a significantly widened its scope compared with just a few years ago [4], and machine learning has been an important technical area of the signal processing society. Since 2006, deep learning—a new area of machine learning research—has emerged [7], impacting a wide range of signal and information processing work within the traditional and the new, widened scopes. Various workshops, such as the 2009 ICML Workshop on Learning Feature Hierarchies; the 2008 NIPS Deep Learning Workshop: Foundations and Future Directions; and the 2009 NIPS Workshop on Deep Learning for Speech Recognition and Related Applications as well as an upcoming special issue on deep learning for speech and language processing in IEEE Transactions on Audio, Speech, and Language Processing (2010) have been devoted exclusively to deep learning and its applications to classical signal processing areas. We have also seen the government sponsor research on deep learning.

Book ChapterDOI
05 Oct 2011
TL;DR: Some of the theoretical motivations for deep architectures, as well as some of their practical successes, are reviewed, and directions of investigations to address some of the remaining challenges are proposed.
Abstract: Deep architectures are families of functions corresponding to deep circuits. Deep Learning algorithms are based on parametrizing such circuits and tuning their parameters so as to approximately optimize some training objective. Whereas it was thought too difficult to train deep architectures, several successful algorithms have been proposed in recent years. We review some of the theoretical motivations for deep architectures, as well as some of their practical successes, and propose directions of investigations to address some of the remaining challenges.

Proceedings Article
12 Dec 2011
TL;DR: This paper proposes a fast method to choose local receptive fields that group together those low-level features that are most similar to each other according to a pairwise similarity metric, and produces results showing how this method allows even simple unsupervised training algorithms to train successful multi-layered networks that achieve state-of-the-art results on CIFAR and STL datasets.
Abstract: Recent deep learning and unsupervised feature learning systems that learn from unlabeled data have achieved high performance in benchmarks by using extremely large architectures with many features (hidden units) at each layer. Unfortunately, for such large architectures the number of parameters can grow quadratically in the width of the network, thus necessitating hand-coded "local receptive fields" that limit the number of connections from lower level features to higher ones (e.g., based on spatial locality). In this paper we propose a fast method to choose these connections that may be incorporated into a wide variety of unsupervised training methods. Specifically, we choose local receptive fields that group together those low-level features that are most similar to each other according to a pairwise similarity metric. This approach allows us to harness the advantages of local receptive fields (such as improved scalability, and reduced data requirements) when we do not know how to specify such receptive fields by hand or where our unsupervised training algorithm has no obvious generalization to a topographic setting. We produce results showing how this method allows us to use even simple unsupervised training algorithms to train successful multi-layered networks that achieve state-of-the-art results on CIFAR and STL datasets: 82.0% and 60.1% accuracy, respectively.

Proceedings ArticleDOI
24 Oct 2011
TL;DR: A deep neural network is utilized to build a unified discriminative framework that allows for estimating the parameters of the latent space as well as the classification function with a bias for the target classification task at hand.
Abstract: In this paper, we propose an efficient embedding for modeling higher-order (n-gram) phrases that projects the n-grams to low-dimensional latent semantic space, where a classification function can be defined We utilize a deep neural network to build a unified discriminative framework that allows for estimating the parameters of the latent space as well as the classification function with a bias for the target classification task at hand We apply the framework to large-scale sentimental classification task We present comparative evaluation of the proposed method on two (large) benchmark data sets for online product reviews The proposed method achieves superior performance in comparison to the state of the art

Proceedings Article
14 Jun 2011
TL;DR: A new fast purely discriminative algorithm for natural language parsing, based on a “deep” recurrent convolutional graph transformer network (GTN) and assuming a decomposition of a parse tree into a stack of "levels”, the network predicts a level of the tree taking into account predictions of previous levels.
Abstract: We propose a new fast purely discriminative algorithm for natural language parsing, based on a “deep” recurrent convolutional graph transformer network (GTN). Assuming a decomposition of a parse tree into a stack of “levels”, the network predicts a level of the tree taking into account predictions of previous levels. Using only few basic text features which leverage word representations from Collobert and Weston (2008), we show similar performance (in F1 score) to existing pure discriminative parsers and existing “benchmark” parsers (like Collins parser, probabilistic context-free grammars based), with a huge speed advantage.

Proceedings ArticleDOI
22 May 2011
TL;DR: A DBN-based model gives a call-routing classification accuracy that is equal to the best of the other models even though it currently uses an impoverished representation of the input.
Abstract: This paper considers application of Deep Belief Nets (DBNs) to natural language call routing. DBNs have been successfully applied to a number of tasks, including image, audio and speech classification, thanks to the recent discovery of an efficient learning technique. DBNs learn a multi-layer generative model from unlabeled data and the features discovered by this model are then used to initialize a feed-forward neural network which is fine-tuned with backpropagation. We compare a DBN-initialized neural network to three widely used text classification algorithms; Support Vector machines (SVM), Boosting and Maximum Entropy (MaxEnt). The DBN-based model gives a call-routing classification accuracy that is equal to the best of the other models even though it currently uses an impoverished representation of the input.

Journal ArticleDOI
TL;DR: Some weak and strong convergence results for the learning methods are presented, indicating that the gradient of the error function goes to zero and the weight sequence goes to a fixed point, respectively.

Journal ArticleDOI
TL;DR: This paper uses more than a unique neural network to reduce the possibility of decision with error in the prediction of Parkinson's Disease and demonstrates that the designed system, to some extent, deals with the problems of imbalanced data sets.
Abstract: Recently the neural network based diagnosis of medical diseases has taken a great deal of attention. In this paper a parallel feed-forward neural network structure is used in the prediction of Parkinson's Disease. The main idea of this paper is using more than a unique neural network to reduce the possibility of decision with error. The output of each neural network is evaluated by using a rule-based system for the final decision. Another important point in this paper is that during the training process, unlearned data of each neural network is collected and used in the training set of the next neural network. The designed parallel network system significantly increased the robustness of the prediction. A set of nine parallel neural networks yielded an improvement of 8.4% on the prediction of Parkinson's Disease compared to a single unique network. Furthermore, it is demonstrated that the designed system, to some extent, deals with the problems of imbalanced data sets.

Journal ArticleDOI
TL;DR: A deep generative model in which the lowest layer represents the word-count vector of a document and the top layer represents a learned binary code for that document is described, which allows more accurate and much faster retrieval than latent semantic analysis.
Abstract: We describe a deep generative model in which the lowest layer represents the word-count vector of a document and the top layer represents a learned binary code for that document. The top two layers of the generative model form an undirected associative memory and the remaining layers form a belief net with directed, top-down connections. We present efficient learning and inference procedures for this type of generative model and show that it allows more accurate and much faster retrieval than latent semantic analysis. By using our method as a filter for a much slower method called TF-IDF we achieve higher accuracy than TF-IDF alone and save several orders of magnitude in retrieval time. By using short binary codes as addresses, we can perform retrieval on very large document sets in a time that is independent of the size of the document set using only one word of memory to describe each document.

Journal ArticleDOI
TL;DR: A large-scale comparison of performances of the neural network training methods is examined on the data classification datasets and shows that the real-coded genetic algorithm may offer efficient alternative to traditional training methods for the classification problem.
Abstract: Artificial neural networks (ANN) have a wide ranging usage area in the data classification problems. Backpropagation algorithm is classical technique used in the training of the artificial neural networks. Since this algorithm has many disadvantages, the training of the neural networks has been implemented with the binary and real-coded genetic algorithms. These algorithms can be used for the solutions of the classification problems. The real-coded genetic algorithm has been compared with other training methods in the few works. It is known that the comparison of the approaches is as important as proposing a new classification approach. For this reason, in this study, a large-scale comparison of performances of the neural network training methods is examined on the data classification datasets. The experimental comparison contains different real classification data taken from the literature and a simulation study. A comparative analysis on the real data sets and simulation data shows that the real-coded genetic algorithm may offer efficient alternative to traditional training methods for the classification problem.

Journal ArticleDOI
TL;DR: Results show that ReNN can be trained more effectively and efficiently compared to the common neural networks and the proposed regularization measure is an effective indicator of how a network would perform in terms of generalization.
Abstract: Feedforward neural network is one of the most commonly used function approximation techniques and has been applied to a wide variety of problems arising from various disciplines. However, neural networks are black-box models having multiple challenges/difficulties associated with training and generalization. This paper initially looks into the internal behavior of neural networks and develops a detailed interpretation of the neural network functional geometry. Based on this geometrical interpretation, a new set of variables describing neural networks is proposed as a more effective and geometrically interpretable alternative to the traditional set of network weights and biases. Then, this paper develops a new formulation for neural networks with respect to the newly defined variables; this reformulated neural network (ReNN) is equivalent to the common feedforward neural network but has a less complex error response surface. To demonstrate the learning ability of ReNN, in this paper, two training methods involving a derivative-based (a variation of backpropagation) and a derivative-free optimization algorithms are employed. Moreover, a new measure of regularization on the basis of the developed geometrical interpretation is proposed to evaluate and improve the generalization ability of neural networks. The value of the proposed geometrical interpretation, the ReNN approach, and the new regularization measure are demonstrated across multiple test problems. Results show that ReNN can be trained more effectively and efficiently compared to the common neural networks and the proposed regularization measure is an effective indicator of how a network would perform in terms of generalization.

Journal ArticleDOI
TL;DR: This paper proposes a novel semi-supervised classifier called discriminative deep belief networks (DDBN), which utilizes a new deep architecture to integrate the abstraction ability of deep belief nets (DBN) and discrim inative ability of backpropagation strategy.

Journal ArticleDOI
TL;DR: In this article, a deep neural architecture (DNA) was proposed for learning speaker-specific characteristics from mel-frequency cepstral coefficients, an acoustic representation commonly used in both speech recognition and speaker recognition, which results in a speakerspecific overcomplete representation.
Abstract: Speech signals convey various yet mixed information ranging from linguistic to speaker-specific information. However, most of acoustic representations characterize all different kinds of information as whole, which could hinder either a speech or a speaker recognition (SR) system from producing a better performance. In this paper, we propose a novel deep neural architecture (DNA) especially for learning speaker-specific characteristics from mel-frequency cepstral coefficients, an acoustic representation commonly used in both speech recognition and SR, which results in a speaker-specific overcomplete representation. In order to learn intrinsic speaker-specific characteristics, we come up with an objective function consisting of contrastive losses in terms of speaker similarity/dissimilarity and data reconstruction losses used as regularization to normalize the interference of non-speaker-related information. Moreover, we employ a hybrid learning strategy for learning parameters of the deep neural networks: i.e., local yet greedy layerwise unsupervised pretraining for initialization and global supervised learning for the ultimate discriminative goal. With four Linguistic Data Consortium (LDC) benchmarks and two non-English corpora, we demonstrate that our overcomplete representation is robust in characterizing various speakers, no matter whether their utterances have been used in training our DNA, and highly insensitive to text and languages spoken. Extensive comparative studies suggest that our approach yields favorite results in speaker verification and segmentation. Finally, we discuss several issues concerning our proposed approach.

Proceedings ArticleDOI
28 Nov 2011
TL;DR: This paper proposes a novel deep learning model called bilinear deep belief network (BDBN), which aims to provide human-like judgment by referencing the architecture of the human visual system and the procedure of intelligent perception, and develops BDBN under a semi-supervised learning framework.
Abstract: Image classification is a well-known classical problem in multimedia content analysis. This paper proposes a novel deep learning model called bilinear deep belief network (BDBN) for image classification. Unlike previous image classification models, BDBN aims to provide human-like judgment by referencing the architecture of the human visual system and the procedure of intelligent perception. Therefore, the multi-layer structure of the cortex and the propagation of information in the visual areas of the brain are realized faithfully. Unlike most existing deep models, BDBN utilizes a bilinear discriminant strategy to simulate the "initial guess" in human object recognition, and at the same time to avoid falling into a bad local optimum. To preserve the natural tensor structure of the image data, a novel deep architecture with greedy layer-wise reconstruction and global fine-tuning is proposed. To adapt real-world image classification tasks, we develop BDBN under a semi-supervised learning framework, which makes the deep model work well when labeled images are insufficient. Comparative experiments on three standard datasets show that the proposed algorithm outperforms both representative classification models and existing deep learning techniques. More interestingly, our demonstrations show that the proposed BDBN works consistently with the visual perception of humans.

Book ChapterDOI
05 Oct 2011
TL;DR: Some of the theoretical motivations for deep architectures, as well as some of their practical successes, are reviewed, and directions of investigations to address some of the remaining challenges are proposed.
Abstract: Deep architectures are families of functions corresponding to deep circuits. Deep Learning algorithms are based on parametrizing such circuits and tuning their parameters so as to approximately optimize some training objective. Whereas it was thought too difficult to train deep architectures, several successful algorithms have been proposed in recent years. We review some of the theoretical motivations for deep architectures, as well as some of their practical successes, and propose directions of investigations to address some of the remaining challenges.

Proceedings Article
27 Apr 2011
TL;DR: The present tutorial introducing the ESANN deep learning special session details the state-of-the-art models and summarizes the current understanding of this learning approach which is a reference for many difficult classification tasks.
Abstract: The deep learning paradigm tackles problems on which shallow architectures (e.g. SVM) are affected by the curse of dimensionality. As part of a two-stage learning scheme involving multiple layers of non-linear processing a set of statistically robust features is automatically extracted from the data. The present tutorial introducing the ESANN deep learning special session details the state-of-the-art models and summarizes the current understanding of this learning approach which is a reference for many difficult classification tasks.

Proceedings Article
28 Jun 2011
TL;DR: A convolutional factor-analysis model is developed, with the number of filters (factors) inferred via the beta process (BP) and hierarchical BP, for single-task and multi-task learning, respectively.
Abstract: A convolutional factor-analysis model is developed, with the number of filters (factors) inferred via the beta process (BP) and hierarchical BP, for single-task and multi-task learning, respectively. The computation of the model parameters is implemented within a Bayesian setting, employing Gibbs sampling; we explicitly exploit the convolutional nature of the expansion to accelerate computations. The model is used in a multi-level ("deep") analysis of general data, with specific results presented for image-processing data sets, e.g., classification.

Journal ArticleDOI
TL;DR: This paper trains the neural networks by constructive learning and presents the analysis of the convergence rate of the error in a neural network with and without threshold which have been learnt by a constructive method to obtain the simple structure of the network.

Book ChapterDOI
15 Nov 2011
TL;DR: A deep learning model for off-line handwritten signature recognition which is able to extract high-level representations is presented and a two-step hybrid model for signature identification and verification is proposed improving the misclassification rate in the well-known GPDS database.
Abstract: Reliable identification and verification of off-line handwritten signatures from images is a difficult problem with many practical applications. This task is a difficult vision problem within the field of biometrics because a signature may change depending on psychological factors of the individual. Motivated by advances in brain science which describe how objects are represented in the visual cortex, advanced research on deep neural networks has been shown to work reliably on large image data sets. In this paper, we present a deep learning model for off-line handwritten signature recognition which is able to extract high-level representations. We also propose a two-step hybrid model for signature identification and verification improving the misclassification rate in the well-known GPDS database.

Journal ArticleDOI
TL;DR: A new feature extraction method called regularized deep Fisher mapping (RDFM), which learns an explicit mapping from the sample space to the feature space using a deep neural network to enhance the separability of features according to the Fisher criterion is proposed.
Abstract: For classification tasks, it is always desirable to extract features that are most effective for preserving class separability. In this brief, we propose a new feature extraction method called regularized deep Fisher mapping (RDFM), which learns an explicit mapping from the sample space to the feature space using a deep neural network to enhance the separability of features according to the Fisher criterion. Compared to kernel methods, the deep neural network is a deep and nonlocal learning architecture, and therefore exhibits more powerful ability to learn the nature of highly variable datasets from fewer samples. To eliminate the side effects of overfitting brought about by the large capacity of powerful learners, regularizers are applied in the learning procedure of RDFM. RDFM is evaluated in various types of datasets, and the results reveal that it is necessary to apply unsupervised regularization in the fine-tuning phase of deep learning. Thus, for very flexible models, the optimal Fisher feature extractor may be a balance between discriminative ability and descriptive ability.