scispace - formally typeset
Search or ask a question

Showing papers on "Softmax function published in 2013"


Proceedings Article
Tomas Mikolov1, Ilya Sutskever1, Kai Chen1, Greg S. Corrado1, Jeffrey Dean1 
05 Dec 2013
TL;DR: This paper presents a simple method for finding phrases in text, and shows that learning good vector representations for millions of phrases is possible and describes a simple alternative to the hierarchical softmax called negative sampling.
Abstract: The recently introduced continuous Skip-gram model is an efficient method for learning high-quality distributed vector representations that capture a large number of precise syntactic and semantic word relationships. In this paper we present several extensions that improve both the quality of the vectors and the training speed. By subsampling of the frequent words we obtain significant speedup and also learn more regular word representations. We also describe a simple alternative to the hierarchical softmax called negative sampling. An inherent limitation of word representations is their indifference to word order and their inability to represent idiomatic phrases. For example, the meanings of "Canada" and "Air" cannot be easily combined to obtain "Air Canada". Motivated by this example, we present a simple method for finding phrases in text, and show that learning good vector representations for millions of phrases is possible.

24,012 citations


Posted Content
Tomas Mikolov1, Ilya Sutskever1, Kai Chen1, Greg S. Corrado1, Jeffrey Dean1 
TL;DR: In this paper, the Skip-gram model is used to learn high-quality distributed vector representations that capture a large number of precise syntactic and semantic word relationships and improve both the quality of the vectors and the training speed.
Abstract: The recently introduced continuous Skip-gram model is an efficient method for learning high-quality distributed vector representations that capture a large number of precise syntactic and semantic word relationships. In this paper we present several extensions that improve both the quality of the vectors and the training speed. By subsampling of the frequent words we obtain significant speedup and also learn more regular word representations. We also describe a simple alternative to the hierarchical softmax called negative sampling. An inherent limitation of word representations is their indifference to word order and their inability to represent idiomatic phrases. For example, the meanings of "Canada" and "Air" cannot be easily combined to obtain "Air Canada". Motivated by this example, we present a simple method for finding phrases in text, and show that learning good vector representations for millions of phrases is possible.

11,343 citations


Posted Content
TL;DR: In this article, the authors introduce a novel visualization technique that gives insight into the function of intermediate feature layers and the operation of the classifier, and perform an ablation study to discover the performance contribution from different model layers.
Abstract: Large Convolutional Network models have recently demonstrated impressive classification performance on the ImageNet benchmark. However there is no clear understanding of why they perform so well, or how they might be improved. In this paper we address both issues. We introduce a novel visualization technique that gives insight into the function of intermediate feature layers and the operation of the classifier. We also perform an ablation study to discover the performance contribution from different model layers. This enables us to find model architectures that outperform Krizhevsky \etal on the ImageNet classification benchmark. We show our ImageNet model generalizes well to other datasets: when the softmax classifier is retrained, it convincingly beats the current state-of-the-art results on Caltech-101 and Caltech-256 datasets.

2,982 citations


Posted Content
TL;DR: The results using L2-SVMs show that by simply replacing softmax with linear SVMs gives significant gains on popular deep learning datasets MNIST, CIFAR-10, and the ICML 2013 Representation Learning Workshop's face expression recognition challenge.
Abstract: Recently, fully-connected and convolutional neural networks have been trained to achieve state-of-the-art performance on a wide variety of tasks such as speech recognition, image classification, natural language processing, and bioinformatics. For classification tasks, most of these "deep learning" models employ the softmax activation function for prediction and minimize cross-entropy loss. In this paper, we demonstrate a small but consistent advantage of replacing the softmax layer with a linear support vector machine. Learning minimizes a margin-based loss instead of the cross-entropy loss. While there have been various combinations of neural nets and SVMs in prior art, our results using L2-SVMs show that by simply replacing softmax with linear SVMs gives significant gains on popular deep learning datasets MNIST, CIFAR-10, and the ICML 2013 Representation Learning Workshop's face expression recognition challenge.

760 citations


Posted Content
12 Nov 2013
TL;DR: In this paper, a novel visualization technique was introduced to give insight into the function of intermediate feature layers and the operation of the classifier, which enabled the authors to find model architectures that outperformed Krizhevsky et al. on ImageNet classification benchmark.
Abstract: Large Convolutional Neural Network models have recently demonstrated impressive classification performance on the ImageNet benchmark \cite{Kriz12}. However there is no clear understanding of why they perform so well, or how they might be improved. In this paper we address both issues. We introduce a novel visualization technique that gives insight into the function of intermediate feature layers and the operation of the classifier. We also perform an ablation study to discover the performance contribution from different model layers. This enables us to find model architectures that outperform Krizhevsky \etal on the ImageNet classification benchmark. We show our ImageNet model generalizes well to other datasets: when the softmax classifier is retrained, it convincingly beats the current state-of-the-art results on Caltech-101 and Caltech-256 datasets.

513 citations


Proceedings ArticleDOI
25 Aug 2013
TL;DR: This paper investigates several CNN architectures, including full and limited weight sharing, convolution along frequency and time axes, and stacking of several convolution layers, and develops a novel weighted softmax pooling layer so that the size in the pooled layer can be automatically learned.
Abstract: Recently, convolutional neural networks (CNNs) have been shown to outperform the standard fully connected deep neural networks within the hybrid deep neural network / hidden Markov model (DNN/HMM) framework on the phone recognition task. In this paper, we extend the earlier basic form of the CNN and explore it in multiple ways. We first investigate several CNN architectures, including full and limited weight sharing, convolution along frequency and time axes, and stacking of several convolution layers. We then develop a novel weighted softmax pooling layer so that the size in the pooling layer can be automatically learned. Further, we evaluate the effect of CNN pretraining, which is achieved by using a convolutional version of the RBM. We show that all CNN architectures we have investigated outperform the earlier basic form of the DNN on both the phone recognition and large vocabulary speech recognition tasks. The architecture with limited weight sharing provides additional gains over the full weight sharing architecture. The softmax pooling layer performs as well as the best CNN with the manually tuned fixed-pooling size, and has a potential for further improvement. Finally, we show that CNN pretraining produces significantly better results on a large vocabulary speech recognition task.

378 citations


Journal ArticleDOI
TL;DR: Variational Bayes is considered as an alternative scheme that provides formal constraints on the computational anatomy of inference and action—constraints that are remarkably consistent with neuroanatomy.
Abstract: This paper considers agency in the setting of embodied or active inference. In brief, we associate a sense of agency with prior beliefs about action and ask what sorts of beliefs underlie optimal behaviour. In particular, we consider prior beliefs that action minimises the Kullback-Leibler divergence between desired states and attainable states in the future. This allows one to formulate bounded rationality as approximate Bayesian inference that optimises a free energy bound on model evidence. We show that constructs like expected utility, exploration bonuses, softmax choice rules and optimism bias emerge as natural consequences of this formulation. Previous accounts of active inference have focused on predictive coding and Bayesian filtering schemes for minimising free energy. Here, we consider variational Bayes as an alternative scheme that provides formal constraints on the computational anatomy of inference and action – constraints that are remarkably consistent with neuroanatomy. Furthermore, this scheme contextualises optimal decision theory and economic (utilitarian) formulations as pure inference problems. For example, expected utility theory emerges as a special case of free energy minimisation, where the sensitivity or inverse temperature (of softmax functions and quantal response equilibria) has a unique and Bayes-optimal solution – that minimises free energy. This sensitivity corresponds to the precision of beliefs about behaviour, such that attainable goals are afforded a higher precision or confidence. In turn, this means that optimal behaviour entails a representation of confidence about outcomes that are under an agent's control.

270 citations


Journal ArticleDOI
TL;DR: A sufficient depth of the T-DSN, a symmetry in the two hidden layers structure in each T- DSN block, the model parameter learning algorithm, and a softmax layer on top of T-DsN are shown to have all contributed to the low error rates observed in the experiments for all three tasks.
Abstract: A novel deep architecture, the tensor deep stacking network (T-DSN), is presented. The T-DSN consists of multiple, stacked blocks, where each block contains a bilinear mapping from two hidden layers to the output layer, using a weight tensor to incorporate higher order statistics of the hidden binary (([0,1])) features. A learning algorithm for the T-DSN's weight matrices and tensors is developed and described in which the main parameter estimation burden is shifted to a convex subproblem with a closed-form solution. Using an efficient and scalable parallel implementation for CPU clusters, we train sets of T-DSNs in three popular tasks in increasing order of the data size: handwritten digit recognition using MNIST (60k), isolated state/phone classification and continuous phone recognition using TIMIT (1.1 m), and isolated phone classification using WSJ0 (5.2 m). Experimental results in all three tasks demonstrate the effectiveness of the T-DSN and the associated learning methods in a consistent manner. In particular, a sufficient depth of the T-DSN, a symmetry in the two hidden layers structure in each T-DSN block, our model parameter learning algorithm, and a softmax layer on top of T-DSN are shown to have all contributed to the low error rates observed in the experiments for all three tasks.

164 citations


Proceedings Article
11 Aug 2013
TL;DR: A type of Deep Boltzmann Machine that is suitable for extracting distributed semantic representations from a large unstructured collection of documents is introduced and it is shown that the model assigns better log probability to unseen data than the Replicated Softmax model.
Abstract: We introduce a type of Deep Boltzmann Machine (DBM) that is suitable for extracting distributed semantic representations from a large unstructured collection of documents. We overcome the apparent difficulty of training a DBM with judicious parameter tying. This enables an efficient pretraining algorithm and a state initialization scheme for fast inference. The model can be trained just as efficiently as a standard Restricted Boltzmann Machine. Our experiments show that the model assigns better log probability to unseen data than the Replicated Softmax model. Features extracted from our model outperform LDA, Replicated Softmax, and DocNADE models on document retrieval and document classification tasks.

123 citations


Posted Content
TL;DR: A Deep Boltzmann Machine model suitable for modeling and extracting latent semantic representations from a large unstructured collection of documents is introduced and it is shown that the model assigns better log probability to unseen data than the Replicated Softmax model.
Abstract: We introduce a Deep Boltzmann Machine model suitable for modeling and extracting latent semantic representations from a large unstructured collection of documents. We overcome the apparent difficulty of training a DBM with judicious parameter tying. This parameter tying enables an efficient pretraining algorithm and a state initialization scheme that aids inference. The model can be trained just as efficiently as a standard Restricted Boltzmann Machine. Our experiments show that the model assigns better log probability to unseen data than the Replicated Softmax model. Features extracted from our model outperform LDA, Replicated Softmax, and DocNADE models on document retrieval and document classification tasks.

108 citations


Journal ArticleDOI
TL;DR: This paper proposes a different mixture of KL divergences, which is a scaled version of the generalized Jensen-Shannon divergence, and shows experimentally that this divergence produces embeddings that better preserve small K-ary neighborhoods, as compared to both the single KL divergence used in SNE and t-SNE and the mixture used in NeRV.

Proceedings ArticleDOI
01 Nov 2013
TL;DR: A segment-based object detection approach using laser range data that combines multiple softmax regression classifiers learned on specific bag-of-word representations using different parameterizations of a descriptor to detect pedestrians, cars, and cyclists.
Abstract: In this paper, we propose a segment-based object detection approach using laser range data. Our detection approach is built up of three stages: First, a hierarchical segmentation approach generates a hierarchy of coarse-to-fine segments to reduce the impact of over- and under-segmentation in later stages. Next, we employ a learned mixture model to classify all segments. The model combines multiple softmax regression classifiers learned on specific bag-of-word representations using different parameterizations of a descriptor. In the final stage, we filter irrelevant and duplicate detections using a greedy method in consideration of the segment hierarchy. We experimentally evaluate our approach on recently published real-world datasets to detect pedestrians, cars, and cyclists.

Journal ArticleDOI
TL;DR: A probability-weighted autoregressive exogenous (PrARX) model wherein the multiple ARX models are composed of the probabilistic weighting functions, which can represent both the motion-control and decision-making aspects of the driving behavior.
Abstract: This paper proposes a probability-weighted autoregressive exogenous (PrARX) model wherein the multiple ARX models are composed of the probabilistic weighting functions. This model can represent both the motion-control and decision-making aspects of the driving behavior. As the probabilistic weighting function, a “softmax” function is introduced. Then, the parameter estimation problem for the proposed model is formulated as a single optimization problem. The “soft” partition defined by the PrARX model can represent the decision-making characteristics of the driver with vagueness. This vagueness can be quantified by introducing the “decision entropy.” In addition, it can be easily extended to the online estimation scheme due to its small computational cost. Finally, the proposed model is applied to the modeling of the vehicle-following task, and the usefulness of the model is verified and discussed.

Journal ArticleDOI
TL;DR: It is shown how a new version of the IA model called the multinomial interactive activation (MIA) model can sample correctly from the joint posterior of a proposed generative model for perception of letters in words, indicating that interactive processing is fully consistent with principled probabilistic computation.
Abstract: This article seeks to establish a rapprochement between explicitly Bayesian models of contextual effects in perception and neural network models of such effects, particularly the connectionist interactive activation (IA) model of perception. The article is in part an historical review and in part a tutorial, reviewing the probabilistic Bayesian approach to understanding perception and how it may be shaped by context, and also reviewing ideas about how such probabilistic computations may be carried out in neural networks, focusing on the role of context in interactive neural networks, in which both bottom-up and top-down signals affect the interpretation of sensory inputs. It is pointed out that connectionist units that use the logistic or softmax activation functions can exactly compute Bayesian posterior probabilities when the bias terms and connection weights affecting such units are set to the logarithms of appropriate probabilistic quantities. Bayesian concepts such the prior, likelihood, (joint and marginal) posterior, probability matching and maximizing, and calculating vs. sampling from the posterior are all reviewed and linked to neural network computations. Probabilistic and neural network models are explicitly linked to the concept of a probabilistic generative model that describes the relationship between the underlying target of perception (e.g., the word intended by a speaker or other source of sensory stimuli) and the sensory input that reaches the perceiver for use in inferring the underlying target. It is shown how a new version of the IA model called the multinomial interactive activation (MIA) model can sample correctly from the joint posterior of a proposed generative model for perception of letters in words, indicating that interactive processing is fully consistent with principled probabilistic computation. Ways in which these computations might be realized in real neural systems are also considered.

Journal ArticleDOI
TL;DR: This work proposes a novel EANN approach, where a weighted n-fold validation fitness scheme is used to build an ensemble of neural networks, under four different combination methods: mean, median, softmax and rank-based.

Journal ArticleDOI
TL;DR: Experiments for hardware-based multitarget search missions with a cooperative human-autonomous robot team show that humans can serve as highly informative sensors through proper data modeling and fusion, and that VBIS provides reliable and scalable Bayesian fusion estimates via GMs.
Abstract: This paper considers Bayesian data fusion of conventional robot sensor information with ambiguous human-generated categorical information about continuous world states of interest. First, it is shown that such soft information can be generally modeled via hybrid continuous-to-discrete likelihoods that are based on the softmax function. A new hybrid fusion procedure, called variational Bayesian importance sampling (VBIS), is then introduced to combine the strengths of variational Bayes approximations and fast Monte Carlo methods to produce reliable posterior estimates for Gaussian priors and softmax likelihoods. VBIS is then extended to more general fusion problems that involve complex Gaussian mixture (GM) priors and multimodal softmax likelihoods, leading to accurate GM approximations of highly non-Gaussian fusion posteriors for a wide range of robot sensor data and soft human data. Experiments for hardware-based multitarget search missions with a cooperative human-autonomous robot team show that humans can serve as highly informative sensors through proper data modeling and fusion, and that VBIS provides reliable and scalable Bayesian fusion estimates via GMs.

Patent
Jui-Ting Huang1, Jinyu Li1, Dong Yu1, Li Deng1, Yifan Gong1 
11 Mar 2013
TL;DR: In this article, various technologies pertaining to a multilingual deep neural network (MDNN) are described, wherein values for weight parameters of the plurality of hidden layers are learned during a training phase based upon training data in terms of acoustic raw features for multiple languages.
Abstract: Described herein are various technologies pertaining to a multilingual deep neural network (MDNN). The MDNN includes a plurality of hidden layers, wherein values for weight parameters of the plurality of hidden layers are learned during a training phase based upon training data in terms of acoustic raw features for multiple languages. The MDNN further includes softmax layers that are trained for each target language separately, making use of the hidden layer values trained jointly with multiple source languages. The MDNN is adaptable, such that a new softmax layer may be added on top of the existing hidden layers, where the new softmax layer corresponds to a new target language.

Journal ArticleDOI
TL;DR: In this article, the authors introduce a model-based approach to distributed computing for multinomial logistic (softmax) regression, treating counts for each response category as independent Poisson regressions via plug-in estimates for fixed effects shared across categories.
Abstract: This article introduces a model-based approach to distributed computing for multinomial logistic (softmax) regression. We treat counts for each response category as independent Poisson regressions via plug-in estimates for fixed effects shared across categories. The work is driven by the high-dimensional-response multinomial models that are used in analysis of a large number of random counts. Our motivating applications are in text analysis, where documents are tokenized and the token counts are modeled as arising from a multinomial dependent upon document attributes. We estimate such models for a publicly available data set of reviews from Yelp, with text regressed onto a large set of explanatory variables (user, business, and rating information). The fitted models serve as a basis for exploring the connection between words and variables of interest, for reducing dimension into supervised factor scores, and for prediction. We argue that the approach herein provides an attractive option for social scientists and other text analysts who wish to bring familiar regression tools to bear on text data.

Journal ArticleDOI
TL;DR: This study proposes a scaled version of free-energy based reinforcement learning to achieve more robust and more efficient learning performance and tests the method's robustness with respect to different exploration schedules.
Abstract: Free-energy based reinforcement learning (FERL) was proposed for learning in high-dimensional state- and action spaces, which cannot be handled by standard function approximation methods. In this study, we propose a scaled version of free-energy based reinforcement learning to achieve more robust and more efficient learning performance. The action-value function is approximated by the negative free-energy of a restricted Boltzmann machine, divided by a constant scaling factor that is related to the size of the Boltzmann machine (the square root of the number of state nodes in this study). Our first task is a digit floor gridworld task, where the states are represented by images of handwritten digits from the MNIST data set. The purpose of the task is to investigate the proposed method's ability, through the extraction of task-relevant features in the hidden layer, to cluster images of the same digit and to cluster images of different digits that corresponds to states with the same optimal action. We also test the method's robustness with respect to different exploration schedules, i.e., different settings of the initial temperature and the temperature discount rate in softmax action selection. Our second task is a robot visual navigation task, where the robot can learn its position by the different colors of the lower part of four landmarks and it can infer the correct corner goal area by the color of the upper part of the landmarks. The state space consists of binarized camera images with, at most, nine different colors, which is equal to 6642 binary states. For both tasks, the learning performance is compared with standard FERL and with function approximation where the action-value function is approximated by a two-layered feedforward neural network.

Proceedings Article
05 Dec 2013
TL;DR: A "relevance topic model" is proposed for jointly learning meaningful mid-level representations upon bag-of-words (BoW) video representations and a classifier with sparse weights that achieves state of the art performance and outperforms other supervised topic model in terms of classification accuracy.
Abstract: Unstructured social group activity recognition in web videos is a challenging task due to 1) the semantic gap between class labels and low-level visual features and 2) the lack of labeled training data. To tackle this problem, we propose a "relevance topic model" for jointly learning meaningful mid-level representations upon bag-of-words (BoW) video representations and a classifier with sparse weights. In our approach, sparse Bayesian learning is incorporated into an undirected topic model (i.e., Replicated Softmax) to discover topics which are relevant to video classes and suitable for prediction. Rectified linear units are utilized to increase the expressive power of topics so as to explain better video data containing complex contents and make variational inference tractable for the proposed model. An efficient variational EM algorithm is presented for model parameter estimation and inference. Experimental results on the Unstructured Social Activity Attribute dataset show that our model achieves state of the art performance and outperforms other supervised topic model in terms of classification accuracy, particularly in the case of a very small number of labeled training videos.

Posted Content
TL;DR: In this paper, the first exact inference algorithm for augmented conditional linear Gaussians (CLG) networks with continuous nodes and continuous children of continuous parents was proposed. But the algorithm is not exact in the sense that it computes the exact distributions over the discrete nodes, and the exact first and second moments of the continuous ones, up to the accuracy obtained by numerical integration.
Abstract: Many real life domains contain a mixture of discrete and continuous variables and can be modeled as hybrid Bayesian Networks. Animportant subclass of hybrid BNs are conditional linear Gaussian (CLG) networks, where the conditional distribution of the continuous variables given an assignment to the discrete variables is a multivariate Gaussian. Lauritzen's extension to the clique tree algorithm can be used for exact inference in CLG networks. However, many domains also include discrete variables that depend on continuous ones, and CLG networks do not allow such dependencies to berepresented. No exact inference algorithm has been proposed for these enhanced CLG networks. In this paper, we generalize Lauritzen's algorithm, providing the first "exact" inference algorithm for augmented CLG networks - networks where continuous nodes are conditional linear Gaussians but that also allow discrete children ofcontinuous parents. Our algorithm is exact in the sense that it computes the exact distributions over the discrete nodes, and the exact first and second moments of the continuous ones, up to the accuracy obtained by numerical integration used within thealgorithm. When the discrete children are modeled with softmax CPDs (as is the case in many real world domains) the approximation of the continuous distributions using the first two moments is particularly accurate. Our algorithm is simple to implement and often comparable in its complexity to Lauritzen's algorithm. We show empirically that it achieves substantially higher accuracy than previous approximate algorithms.

Book ChapterDOI
09 Sep 2013
TL;DR: A reconstruction rule based on softmax regression which considers the reconstruction task as a new classification problem and uses both the crisp labels and the reliabilities of binary decisions as second-order features.
Abstract: Classification by binary decomposition is a well-known method to solve multiclass classification tasks since a large number of algorithms were designed for binary classification. Once the polychotomy has been decomposed into several dichotomies, the decisions of binary learners on a test sample are aggregated by a reconstruction rule to set the final multiclass label. In this context, this paper presents a reconstruction rule based on softmax regression which considers the reconstruction task as a new classification problem. To this aim, as second-order features we use both the crisp labels and the reliabilities of binary decisions. Six heterogeneous datasets and three different classification architectures have been used to test our method, whose performance favorably compare with those provided by other three reconstruction rules both in terms of global accuracy and geometric mean of accuracies.

Proceedings ArticleDOI
20 Jun 2013
TL;DR: A reconstruction rule based on softmax regression, where the features of the new classification task are the crisp labels and the reliabilities of dichotomizers' classifications, which favorably compare with those provided by other two well-known reconstruction rules both in terms of global accuracy and accuracy per class.
Abstract: Several medical and biological applications face with multiclass recognition problems. Such polychotomies can be addressed by decomposition techniques, which reduce the polychotomy into a series of dichotomies and then provide the final multiclass label using a reconstruction rule. Within this framework, we present a reconstruction rule based on softmax regression, where the features of the new classification task are the crisp labels and the reliabilities of dichotomizers' classifications. The approach has been tested on six medical and biological datasets, decomposing the polychotomies via the Error-Correcting Output Code. Its performances favorably compare with those provided by other two well-known reconstruction rules both in terms of global accuracy and accuracy per class.

Posted Content
TL;DR: The normalized variables module generates normalized variables using z-score, min-max, softmax and sigmoid techniques and supports multiple variables and panel dataset.
Abstract: norm generates normalized variables using z-score, min-max, softmax and sigmoid techniques. The module supports multiple variables and panel dataset.

Journal ArticleDOI
TL;DR: A sparse auto-encoder model was trained to extract the code of different facial expression, which comprises four encoder layers and three decode layers, the representation locating in the fourth layer (code layer) is the features expected.
Abstract: A sparse auto-encoder model was trained to extract the code of different facial expression, which comprises four encoder layers and three decode layers, the representation locating in the fourth layer (code layer) is the features expected. With large amounts of patches randomly selected from training faces, the model was trained firstly via backpropagation which minimizes an unsupervised sparse reconstruction error, and then a softmax classifier was learned for supervised classification. The input vector for the classification is the feature of facial image induced by the learned sparse auto-encoder and two key operations (convolving and pooling). Using a small number of hidden units per layer and a relatively small number of training set, the proposed model achieves excellent performance in the experiments.

26 Apr 2013
TL;DR: This work introduces a type of Deep Boltzmann Machine that is suitable for extracting distributed semantic representations from a large unstructured collection of documents and proposes an approximate inference method that interacts with learning in a way that makes it possible to train the DBM more eciently than previously proposed methods.
Abstract: We introduce a type of Deep Boltzmann Machine (DBM) that is suitable for extracting distributed semantic representations from a large unstructured collection of documents. We propose an approximate inference method that interacts with learning in a way that makes it possible to train the DBM more eciently than previously proposed methods. Even though the model has two hidden layers, it can be trained just as eciently as a standard Restricted Boltzmann Machine. Our experiments show that the model assigns better log probability to unseen data than the Replicated Softmax model. Features extracted from our model outperform LDA, Replicated Softmax, and DocNADE models on document retrieval and document classication tasks.