scispace - formally typeset
Search or ask a question

Showing papers by "Chris Pal published in 2017"


Journal ArticleDOI
TL;DR: A fast and accurate fully automatic method for brain tumor segmentation which is competitive both in terms of accuracy and speed compared to the state of the art, and introduces a novel cascaded architecture that allows the system to more accurately model local label dependencies.

2,538 citations


Journal ArticleDOI
TL;DR: The key concepts of deep learning for clinical radiologists are reviewed, technical requirements are discussed, emerging applications in clinical radiology are described, and limitations and future directions in this field are outlined.
Abstract: Deep learning is a class of machine learning methods that are gaining success and attracting interest in many domains, including computer vision, speech recognition, natural language processing, and playing games. Deep learning methods produce a mapping from raw inputs to desired outputs (eg, image classes). Unlike traditional machine learning methods, which require hand-engineered feature extraction from inputs, deep learning methods learn these features directly from data. With the advent of large datasets and increased computing power, these methods can produce models with exceptional performance. These models are multilayer artificial neural networks, loosely inspired by biologic neural systems. Weighted connections between nodes (neurons) in the network are iteratively adjusted based on example pairs of inputs and target outputs by back-propagating a corrective error signal through the network. For computer vision tasks, convolutional neural networks (CNNs) have proven to be effective. Recently, several clinical applications of CNNs have been proposed and studied in radiology for classification, detection, and segmentation tasks. This article reviews the key concepts of deep learning for clinical radiologists, discusses technical requirements, describes emerging applications in clinical radiology, and outlines limitations and future directions in this field. Radiologists should become familiar with the principles and potential applications of deep learning in medical imaging. ©RSNA, 2017.

687 citations


Journal ArticleDOI
TL;DR: This paper proposes a common evaluation framework for automatic stroke lesion segmentation from MRIP, describes the publicly available datasets, and presents the results of the two sub‐challenges: Sub‐Acute Stroke Lesion Segmentation (SISS) and Stroke Perfusion Estimation (SPES).

417 citations


Posted Content
TL;DR: This work relies on complex convolutions and present algorithms for complex batch-normalization, complex weight initialization strategies for complex-valued neural nets and uses them in experiments with end-to-end training schemes and demonstrates that such complex- valued models are competitive with their real-valued counterparts.
Abstract: At present, the vast majority of building blocks, techniques, and architectures for deep learning are based on real-valued operations and representations. However, recent work on recurrent neural networks and older fundamental theoretical analysis suggests that complex numbers could have a richer representational capacity and could also facilitate noise-robust memory retrieval mechanisms. Despite their attractive properties and potential for opening up entirely new neural architectures, complex-valued deep neural networks have been marginalized due to the absence of the building blocks required to design such models. In this work, we provide the key atomic components for complex-valued deep neural networks and apply them to convolutional feed-forward networks and convolutional LSTMs. More precisely, we rely on complex convolutions and present algorithms for complex batch-normalization, complex weight initialization strategies for complex-valued neural nets and we use them in experiments with end-to-end training schemes. We demonstrate that such complex-valued models are competitive with their real-valued counterparts. We test deep complex models on several computer vision tasks, on music transcription using the MusicNet dataset and on Speech Spectrum Prediction using the TIMIT dataset. We achieve state-of-the-art performance on these audio-related tasks.

371 citations


Proceedings Article
27 May 2017
TL;DR: In this paper, the authors provide the key atomic components for complex-valued deep neural networks and apply them to convolutional feed-forward networks, and demonstrate that such complexvalued models are competitive with their real-valued counterparts.
Abstract: At present, the vast majority of building blocks, techniques, and architectures for deep learning are based on real-valued operations and representations. However, recent work on recurrent neural networks and older fundamental theoretical analysis suggests that complex numbers could have a richer representational capacity and could also facilitate noise-robust memory retrieval mechanisms. Despite their attractive properties and potential for opening up entirely new neural architectures, complex-valued deep neural networks have been marginalized due to the absence of the building blocks required to design such models. In this work, we provide the key atomic components for complex-valued deep neural networks and apply them to convolutional feed-forward networks. More precisely, we rely on complex convolutions and present algorithms for complex batch-normalization, complex weight initialization strategies for complex-valued neural nets and we use them in experiments with end-to-end training schemes. We demonstrate that such complex-valued models are competitive with their real-valued counterparts. We test deep complex models on several computer vision tasks, on music transcription using the MusicNet dataset and on Speech spectrum prediction using TIMIT. We achieve state-of-the-art performance on these audio-related tasks.

288 citations


Journal ArticleDOI
TL;DR: The Large Scale Movie Description Challenge (LSMDC) as discussed by the authors ) is a dataset of 128,118 sentences aligned to video clips from 200 movies (around 150 h of video in total).
Abstract: Audio description (AD) provides linguistic descriptions of movies and allows visually impaired people to follow a movie along with their peers. Such descriptions are by design mainly visual and thus naturally form an interesting data source for computer vision and computational linguistics. In this work we propose a novel dataset which contains transcribed ADs, which are temporally aligned to full length movies. In addition we also collected and aligned movie scripts used in prior work and compare the two sources of descriptions. We introduce the Large Scale Movie Description Challenge (LSMDC) which contains a parallel corpus of 128,118 sentences aligned to video clips from 200 movies (around 150 h of video in total). The goal of the challenge is to automatically generate descriptions for the movie clips. First we characterize the dataset by benchmarking different approaches for generating video descriptions. Comparing ADs to scripts, we find that ADs are more visual and describe precisely what is shown rather than what should happen according to the scripts created prior to movie production. Furthermore, we present and compare the results of several teams who participated in the challenges organized in the context of two workshops at ICCV 2015 and ECCV 2016.

212 citations


Posted Content
TL;DR: In this paper, a simple yet powerful pipeline for medical image segmentation that combines Fully Convolutional Networks (FCNs) with Fully convolutional residual networks (FC-ResNets) is presented.
Abstract: In this paper, we introduce a simple, yet powerful pipeline for medical image segmentation that combines Fully Convolutional Networks (FCNs) with Fully Convolutional Residual Networks (FC-ResNets). We propose and examine a design that takes particular advantage of recent advances in the understanding of both Convolutional Neural Networks as well as ResNets. Our approach focuses upon the importance of a trainable pre-processing when using FC-ResNets and we show that a low-capacity FCN model can serve as a pre-processor to normalize medical input data. In our image segmentation pipeline, we use FCNs to obtain normalized images, which are then iteratively refined by means of a FC-ResNet to generate a segmentation prediction. As in other fully convolutional approaches, our pipeline can be used off-the-shelf on different image modalities. We show that using this pipeline, we exhibit state-of-the-art performance on the challenging Electron Microscopy benchmark, when compared to other 2D methods. We improve segmentation results on CT images of liver lesions, when contrasting with standard FCN methods. Moreover, when applying our 2D pipeline on a challenging 3D MRI prostate segmentation challenge we reach results that are competitive even when compared to 3D methods. The obtained results illustrate the strong potential and versatility of the pipeline by achieving highly accurate results on multi-modality images from different anatomical regions and organs.

160 citations


Proceedings Article
06 Aug 2017
TL;DR: This paper proposes a weight matrix factorization and parameterization strategy through which the degree of expansivity induced during backpropagation can be controlled and finds that hard constraints on orthogonality can negatively affect the speed of convergence and model performance.
Abstract: It is well known that it is challenging to train deep neural networks and recurrent neural networks for tasks that exhibit long term dependencies. The vanishing or exploding gradient problem is a well known issue associated with these challenges. One approach to addressing vanishing and exploding gradients is to use either soft or hard constraints on weight matrices so as to encourage or enforce orthogonality. Orthogonal matrices preserve gradient norm during back-propagation and may therefore be a desirable property. This paper explores issues with optimization convergence, speed and gradient stability when encouraging or enforcing orthogonality. To perform this analysis, we propose a weight matrix factorization and parameterization strategy through which we can bound matrix norms and therein control the degree of expansivity induced during backpropagation. We find that hard constraints on orthogonality can negatively affect the speed of convergence and model performance.

158 citations


Posted Content
TL;DR: The authors used GANs to generate sentences from context-free and probabilistic context free grammars, and qualitative language modeling results. But their results were not commensurate with the progress made in generating images, and still lag far behind likelihood based methods.
Abstract: Generative Adversarial Networks (GANs) have gathered a lot of attention from the computer vision community, yielding impressive results for image generation Advances in the adversarial generation of natural language from noise however are not commensurate with the progress made in generating images, and still lag far behind likelihood based methods In this paper, we take a step towards generating natural language with a GAN objective alone We introduce a simple baseline that addresses the discrete output space problem without relying on gradient estimators and show that it is able to achieve state-of-the-art results on a Chinese poem generation dataset We present quantitative results on generating sentences from context-free and probabilistic context-free grammars, and qualitative language modeling results A conditional version is also described that can generate sequences conditioned on sentence characteristics

155 citations


Proceedings Article
01 Jan 2017
TL;DR: This work presents a multichannel spatiotemporal CNN architecture for semi-supervised bounding box prediction and exploratory data analysis, and demonstrates that this approach is able to leverage temporal information and unlabeled data to improve the localization of extreme weather events.
Abstract: Then detection and identification of extreme weather events in large-scale climate simulations is an important problem for risk management, informing governmental policy decisions and advancing our basic understanding of the climate system. Recent work has shown that fully supervised convolutional neural networks (CNNs) can yield acceptable accuracy for classifying well-known types of extreme weather events when large amounts of labeled data are available. However, many different types of spatially localized climate patterns are of interest including hurricanes, extra-tropical cyclones, weather fronts, and blocking events among others. Existing labeled data for these patterns can be incomplete in various ways, such as covering only certain years or geographic areas and having false negatives. This type of climate data therefore poses a number of interesting machine learning challenges. We present a multichannel spatiotemporal CNN architecture for semi-supervised bounding box prediction and exploratory data analysis. We demonstrate that our approach is able to leverage temporal information and unlabeled data to improve the localization of extreme weather events. Further, we explore the representations learned by our model in order to better understand this important data. We present a dataset, ExtremeWeather, to encourage machine learning research in this area and to help facilitate further work in understanding and mitigating the effects of climate change. The dataset is available at extremeweatherdataset.github.io and the code is available at https://github.com/eracah/hur-detect.

118 citations


Posted Content
TL;DR: In this article, a weight matrix factorization and parameterization strategy is proposed to control the degree of expansivity induced during backpropagation, and the authors find that hard constraints on orthogonality can negatively affect convergence and model performance.
Abstract: It is well known that it is challenging to train deep neural networks and recurrent neural networks for tasks that exhibit long term dependencies. The vanishing or exploding gradient problem is a well known issue associated with these challenges. One approach to addressing vanishing and exploding gradients is to use either soft or hard constraints on weight matrices so as to encourage or enforce orthogonality. Orthogonal matrices preserve gradient norm during backpropagation and may therefore be a desirable property. This paper explores issues with optimization convergence, speed and gradient stability when encouraging or enforcing orthogonality. To perform this analysis, we propose a weight matrix factorization and parameterization strategy through which we can bound matrix norms and therein control the degree of expansivity induced during backpropagation. We find that hard constraints on orthogonality can negatively affect the speed of convergence and model performance.

Proceedings ArticleDOI
12 Apr 2017
TL;DR: In this article, an attention-based modular neural framework for computer vision is proposed, which consists of three modules: a recurrent attention module controlling where to look in an image or video frame, a feature-extraction module providing a representation of what is seen, and an objective module formalizing why the model learns its attentive behavior.
Abstract: We present an attention-based modular neural framework for computer vision. The framework uses a soft attention mechanism allowing models to be trained with gradient descent. It consists of three modules: a recurrent attention module controlling where to look in an image or video frame, a feature-extraction module providing a representation of what is seen, and an objective module formalizing why the model learns its attentive behavior. The attention module allows the model to focus computation on task-related information in the input. We apply the framework to several object tracking tasks and explore various design choices. We experiment with three data sets, bouncing ball, moving digits and the real-world KTH data set. The proposed RATM performs well on all three tasks and can generalize to related but previously unseen sequences from a challenging tracking data set.

Proceedings ArticleDOI
31 May 2017
TL;DR: A simple baseline is introduced that addresses the discrete output space problem without relying on gradient estimators and is able to achieve state-of-the-art results on a Chinese poem generation dataset.
Abstract: Generative Adversarial Networks (GANs) have gathered a lot of attention from the computer vision community, yielding impressive results for image generation. Advances in the adversarial generation of natural language from noise however are not commensurate with the progress made in generating images, and still lag far behind likelihood based methods. In this paper, we take a step towards generating natural language with a GAN objective alone. We introduce a simple baseline that addresses the discrete output space problem without relying on gradient estimators and show that it is able to achieve state-of-the-art results on a Chinese poem generation dataset. We present quantitative results on generating sentences from context-free and probabilistic context-free grammars, and qualitative language modeling results. A conditional version is also described that can generate sequences conditioned on sentence characteristics.

Proceedings ArticleDOI
21 Jul 2017
TL;DR: The MovieFIB dataset as mentioned in this paper is a large-scale question-answering dataset with over 300,000 examples based on descriptive video annotations for the visually impaired, which is used as a benchmark for video understanding.
Abstract: While deep convolutional neural networks frequently approach or exceed human-level performance in benchmark tasks involving static images, extending this success to moving images is not straightforward. Video understanding is of interest for many applications, including content recommendation, prediction, summarization, event/object detection, and understanding human visual perception. However, many domains lack sufficient data to explore and perfect video models. In order to address the need for a simple, quantitative benchmark for developing and understanding video, we present MovieFIB, a fill-in-the-blank question-answering dataset with over 300,000 examples, based on descriptive video annotations for the visually impaired. In addition to presenting statistics and a description of the dataset, we perform a detailed analysis of 5 different models predictions, and compare these with human performance. We investigate the relative importance of language, static (2D) visual features, and moving (3D) visual features, the effects of increasing dataset size, the number of frames sampled, and of vocabulary size. We illustrate that: this task is not solvable by a language model alone, our model combining 2D and 3D visual information indeed provides the best result, all models perform significantly worse than human-level. We provide human evaluation for responses given by different models and find that accuracy on the MovieFIB evaluation corresponds well with human judgment. We suggest avenues for improving video models, and hope that the MovieFIB challenge can be useful for measuring and encouraging progress in this very interesting field.

Proceedings Article
17 Jul 2017
TL;DR: This work proposes a straightforward technique to constrain discrete ordinal probability distributions to be unimodal via a combination of the Poisson probability mass function and the softmax nonlinearity.
Abstract: Probability distributions produced by the cross-entropy loss for ordinal classification problems can possess undesired properties. We propose a straightforward technique to constrain discrete ordinal probability distributions to be unimodal via the use of the Poisson and binomial probability distributions. We evaluate this approach in the context of deep learning on two large ordinal image datasets, obtaining promising results.

Posted Content
TL;DR: In this paper, the authors leverage the common situation where precise landmark locations are only provided for a small data subset, but where class labels for classification or regression tasks related to the landmarks are more abundantly available.
Abstract: We present two techniques to improve landmark localization in images from partially annotated datasets. Our primary goal is to leverage the common situation where precise landmark locations are only provided for a small data subset, but where class labels for classification or regression tasks related to the landmarks are more abundantly available. First, we propose the framework of sequential multitasking and explore it here through an architecture for landmark localization where training with class labels acts as an auxiliary signal to guide the landmark localization on unlabeled data. A key aspect of our approach is that errors can be backpropagated through a complete landmark localization model. Second, we propose and explore an unsupervised learning technique for landmark localization based on having a model predict equivariant landmarks with respect to transformations applied to the image. We show that these techniques, improve landmark prediction considerably and can learn effective detectors even when only a small fraction of the dataset has landmark labels. We present results on two toy datasets and four real datasets, with hands and faces, and report new state-of-the-art on two datasets in the wild, e.g. with only 5\% of labeled images we outperform previous state-of-the-art trained on the AFLW dataset.

Posted Content
TL;DR: In this paper, a model for joint segmentation of the liver and liver lesions in computed tomography (CT) volumes is proposed, where two fully convolutional networks are connected in tandem and trained together end-to-end.
Abstract: We propose a model for the joint segmentation of the liver and liver lesions in computed tomography (CT) volumes. We build the model from two fully convolutional networks, connected in tandem and trained together end-to-end. We evaluate our approach on the 2017 MICCAI Liver Tumour Segmentation Challenge, attaining competitive liver and liver lesion detection and segmentation scores across a wide range of metrics. Unlike other top performing methods, our model output post-processing is trivial, we do not use data external to the challenge, and we propose a simple single-stage model that is trained end-to-end. However, our method nearly matches the top lesion segmentation performance and achieves the second highest precision for lesion detection while maintaining high recall.

Journal ArticleDOI
TL;DR: A deformable model that uses a voxel classifier based on a multilayer perceptron (MLP) to interpret the CT image and considers vertex displacement towards apparent tumour boundaries and regularization that promotes surface smoothness is proposed.
Abstract: The segmentation of liver tumours in CT images is useful for the diagnosis and treatment of liver cancer. Furthermore, an accurate assessment of tumour volume aids in the diagnosis and evaluation of treatment response. Currently, segmentation is performed manually by an expert, and because of the time required, a rough estimate of tumour volume is often done instead. We propose a semi-automatic segmentation method that makes use of machine learning within a deformable surface model. Specifically, we propose a deformable model that uses a voxel classifier based on a multilayer perceptron (MLP) to interpret the CT image. The new deformable model considers vertex displacement towards apparent tumour boundaries and regularization that promotes surface smoothness. During operation, a user identifies the target tumour and the mesh then automatically delineates the tumour from the MLP processed image. The method was tested on a dataset of 40 abdominal CT scans with a total of 95 colorectal metastases collected from a variety of scanners with variable spatial resolution. The segmentation results are encouraging with a Dice similarity metric of [Formula: see text] and demonstrates that the proposed method can deal with highly variable data. This work motivates further research into tumour segmentation using machine learning with more data and deeper neural networks.

Journal ArticleDOI
TL;DR: Generalized arc-consistency Expectation Maximization Message-Passing (GEM-MP), a novel message-passing approach to inference in an extended factor graph that combines constraint programming techniques with variational methods, is introduced.
Abstract: Many important real-world applications of machine learning, statistical physics, constraint programming and information theory can be formulated using graphical models that involve determinism and cycles. Accurate and efficient inference and training of such graphical models remains a key challenge. Markov logic networks (MLNs) have recently emerged as a popular framework for expressing a number of problems which exhibit these properties. While loopy belief propagation (LBP) can be an effective solution in some cases; unfortunately, when both determinism and cycles are present, LBP frequently fails to converge or converges to inaccurate results. As such, sampling based algorithms have been found to be more effective and are more popular for general inference tasks in MLNs. In this paper, we introduce Generalized arc-consistency Expectation Maximization Message-Passing (GEM-MP), a novel message-passing approach to inference in an extended factor graph that combines constraint programming techniques with variational methods. We focus our experiments on Markov logic and Ising models but the method is applicable to graphical models in general. In contrast to LBP, GEM-MP formulates the message-passing structure as steps of variational expectation maximization. Moreover, in the algorithm we leverage the local structures in the factor graph by using generalized arc consistency when performing a variational mean-field approximation. Thus each such update increases a lower bound on the model evidence. Our experiments on Ising grids, entity resolution and link prediction problems demonstrate the accuracy and convergence of GEM-MP over existing state-of-the-art inference algorithms such as MC-SAT, LBP, and Gibbs sampling, as well as convergent message passing algorithms such as the concave---convex procedure, residual BP, and the L2-convex method.

Posted Content
TL;DR: This work proposes a first step toward the learning and synthesis of these using recent advances in deep generative modelling with openly available satellite imagery from NASA.
Abstract: Procedural terrain generation for video games has been traditionally been done with smartly designed but handcrafted algorithms that generate heightmaps. We propose a first step toward the learning and synthesis of these using recent advances in deep generative modelling with openly available satellite imagery from NASA.

Posted Content
TL;DR: This work proposes an alternative algorithm, Sparse Attentive Backtracking, which might also be related to principles used by brains to learn long-term dependencies, and learns an attention mechanism over the hidden states of the past and selectively backpropagates through paths with high attention weights.
Abstract: A major drawback of backpropagation through time (BPTT) is the difficulty of learning long-term dependencies, coming from having to propagate credit information backwards through every single step of the forward computation. This makes BPTT both computationally impractical and biologically implausible. For this reason, full backpropagation through time is rarely used on long sequences, and truncated backpropagation through time is used as a heuristic. However, this usually leads to biased estimates of the gradient in which longer term dependencies are ignored. Addressing this issue, we propose an alternative algorithm, Sparse Attentive Backtracking, which might also be related to principles used by brains to learn long-term dependencies. Sparse Attentive Backtracking learns an attention mechanism over the hidden states of the past and selectively backpropagates through paths with high attention weights. This allows the model to learn long term dependencies while only backtracking for a small number of time steps, not just from the recent past but also from attended relevant past states.

Posted Content
24 Apr 2017
TL;DR: The approach is able to leverage temporal information and unlabelled data to improve localization of extreme weather events and explore the representations learned by the model in order to better understand this important data, and facilitate further work in understanding and mitigating the effects of climate change.
Abstract: The detection and identification of extreme weather events in large scale climate simulations is an important problem for risk management, informing governmental policy decisions and advancing our basic understanding of the climate system. Recent work has shown that fully supervised convolutional neural networks (CNNs) can yield acceptable accuracy for classifying well-known types of extreme weather events when large amounts of labeled data are available. However, there are many different types of spatially localized climate patterns of interest (including hurricanes, extra-tropical cyclones, weather fronts, blocking events, etc.) found in simulation data for which labeled data is not available at large scale for all simulations of interest. We present a multichannel spatiotemporal encoder-decoder CNN architecture for semi-supervised bounding box prediction and exploratory data analysis. This architecture is designed to fully model multi-channel simulation data, temporal dynamics and unlabelled data within a reconstruction and prediction framework so as to improve the detection of a wide range of extreme weather events. Our architecture can be viewed as a 3D convolutional autoencoder with an additional modified one-pass bounding box regression loss. We demonstrate that our approach is able to leverage temporal information and unlabelled data to improve localization of extreme weather events. Further, we explore the representations learned by our model in order to better understand this important data, and facilitate further work in understanding and mitigating the effects of climate change.

Posted Content
22 Aug 2017
TL;DR: This paper introduces a simple way of encouraging the RNNs to plan for the future by introducing an additional neural network which is trained to generate the sequence in reverse order, and requires closeness between the states of the forward RNN and backward RNN that predict the same token.
Abstract: Being able to model long-term dependencies in sequential data, such as text, has been among the long-standing challenges of recurrent neural networks (RNNs). This issue is strictly related to the absence of explicit planning in current RNN architectures. More explicitly, the RNNs are trained to predict only the next token given previous ones. In this paper, we introduce a simple way of encouraging the RNNs to plan for the future. In order to accomplish this, we introduce an additional neural network which is trained to generate the sequence in reverse order, and we require closeness between the states of the forward RNN and backward RNN that predict the same token. At each step, the states of the forward RNN are required to match the future information contained in the backward states. We hypothesize that the approach eases modeling of long-term dependencies thus helping in generating more globally consistent samples. The model trained with conditional generation for a speech recognition task achieved 12\% relative improvement (CER of 6.7 compared to a baseline of 7.6).

Posted Content
TL;DR: In this article, a straightforward technique to constrain discrete ordinal probability distributions to be unimodal via the use of the Poisson and binomial probability distributions has been proposed.
Abstract: Probability distributions produced by the cross-entropy loss for ordinal classification problems can possess undesired properties We propose a straightforward technique to constrain discrete ordinal probability distributions to be unimodal via the use of the Poisson and binomial probability distributions We evaluate this approach in the context of deep learning on two large ordinal image datasets, obtaining promising results

Posted Content
TL;DR: The GAN framework is reframed so that the generator is no longer trained using gradients through the discriminator, but is instead trained using a learned critic in the actor-critic framework with a Temporal Difference (TD) objective.
Abstract: Generative Adversarial Networks (GANs) are a powerful framework for deep generative modeling. Posed as a two-player minimax problem, GANs are typically trained end-to-end on real-valued data and can be used to train a generator of high-dimensional and realistic images. However, a major limitation of GANs is that training relies on passing gradients from the discriminator through the generator via back-propagation. This makes it fundamentally difficult to train GANs with discrete data, as generation in this case typically involves a non-differentiable function. These difficulties extend to the reinforcement learning setting when the action space is composed of discrete decisions. We address these issues by reframing the GAN framework so that the generator is no longer trained using gradients through the discriminator, but is instead trained using a learned critic in the actor-critic framework with a Temporal Difference (TD) objective. This is a natural fit for sequence modeling and we use it to achieve improvements on language modeling tasks over the standard Teacher-Forcing methods.

Posted Content
TL;DR: This work proposes a new self-organizing hierarchical softmax formulation for neural-network-based language models over large vocabularies that is capable of learning word clusters with clear syntactical and semantic meaning during the language model training process.
Abstract: We propose a new self-organizing hierarchical softmax formulation for neural-network-based language models over large vocabularies. Instead of using a predefined hierarchical structure, our approach is capable of learning word clusters with clear syntactical and semantic meaning during the language model training process. We provide experiments on standard benchmarks for language modeling and sentence compression tasks. We find that this approach is as fast as other efficient softmax approximations, while achieving comparable or even better performance relative to similar full softmax models.

Posted Content
TL;DR: The authors propose to train a backward recurrent network to generate a given sequence in reverse order, and encourage states of the forward model to predict cotemporal states of a backward model.
Abstract: We propose a simple technique for encouraging generative RNNs to plan ahead. We train a "backward" recurrent network to generate a given sequence in reverse order, and we encourage states of the forward model to predict cotemporal states of the backward model. The backward network is used only during training, and plays no role during sampling or inference. We hypothesize that our approach eases modeling of long-term dependencies by implicitly forcing the forward states to hold information about the longer-term future (as contained in the backward states). We show empirically that our approach achieves 9% relative improvement for a speech recognition task, and achieves significant improvement on a COCO caption generation task.