Showing papers by "Chris Pal published in 2016"

PDF

Open Access

Posted Content•

Theano: A Python framework for fast computation of mathematical expressions

[...]

Rami Al-Rfou, Guillaume Alain, Amjad Almahairi, Christof Angermueller, Dzmitry Bahdanau, Nicolas Ballas, Frédéric Bastien, Justin Bayer, Anatoly Belikov, Alexander Belopolsky, Yoshua Bengio, Arnaud Bergeron, James Bergstra, Valentin Bisson, Josh Bleecher Snyder, Nicolas Bouchard, Nicolas Boulanger-Lewandowski, Xavier Bouthillier, Alexandre de Brébisson, Olivier Breuleux, Pierre Luc Carrier, Kyunghyun Cho, Jan Chorowski, Paul F. Christiano, Tim Cooijmans, Marc-Alexandre Côté, Myriam Côté, Aaron Courville, Yann N. Dauphin, Olivier Delalleau, Julien Demouth, Guillaume Desjardins, Sander Dieleman, Laurent Dinh, Mélanie Ducoffe, Vincent Dumoulin, Samira Ebrahimi Kahou, Dumitru Erhan, Ziye Fan, Orhan Firat, Mathieu Germain, Xavier Glorot, Ian Goodfellow, Matthew M. Graham, Caglar Gulcehre, Philippe Hamel, Iban Harlouchet, Jean-Philippe Heng, Balázs Hidasi, Sina Honari, Arjun Jain, Sébastien Jean, Kai Jia, Mikhail Korobov, Vivek Kulkarni, Alex Lamb, Pascal Lamblin, Eric Larsen, César Laurent, Sean Lee, Simon Lefrancois, Simon Lemieux, Nicholas Léonard, Zhouhan Lin, Jesse A. Livezey, Cory Lorenz, Jeremiah Lowin, Qianli Ma, Pierre-Antoine Manzagol, Olivier Mastropietro, Robert T. McGibbon, Roland Memisevic, Bart van Merriënboer, Vincent Michalski, Mehdi Mirza, Alberto Orlandi, Chris Pal, Razvan Pascanu, Mohammad Pezeshki, Colin Raffel, Daniel Renshaw, Matthew Rocklin, Adriana Romero, Markus Roth, Peter Sadowski, John Salvatier, François Savard, Jan Schlüter, John Schulman, Gabriel Schwartz, Iulian Vlad Serban, Dmitriy Serdyuk, Samira Shabanian, Étienne Simon, Sigurd Spieckermann, S. Ramana Subramanyam, Jakub Sygnowski, Jérémie Tanguay, Gijs van Tulder, Joseph Turian, Sebastian Urban, Pascal Vincent, Francesco Visin, Harm de Vries, David Warde-Farley, Dustin J. Webb, Matthew Willson, Kelvin Xu, Lijun Xue, Li Yao, Saizheng Zhang, Ying Zhang - Show less +108 more

09 May 2016-arXiv: Symbolic Computation

TL;DR: The performance of Theano is compared against Torch7 and TensorFlow on several machine learning models and recently-introduced functionalities and improvements are discussed.

...read moreread less

Abstract: Theano is a Python library that allows to define, optimize, and evaluate mathematical expressions involving multi-dimensional arrays efficiently. Since its introduction, it has been one of the most used CPU and GPU mathematical compilers - especially in the machine learning community - and has shown steady performance improvements. Theano is being actively and continuously developed since 2008, multiple frameworks have been built on top of it and it has been used to produce many state-of-the-art machine learning models. The present article is structured as follows. Section I provides an overview of the Theano software and its community. Section II presents the principal features of Theano and how to use them, and compares them with other similar projects. Section III focuses on recently-introduced functionalities and improvements. Section IV compares the performance of Theano against Torch7 and TensorFlow on several machine learning models. Section V discusses current limitations of Theano and potential ways of improving it.

...read moreread less

2,194 citations

Book Chapter•DOI•

The Importance of Skip Connections in Biomedical Image Segmentation

[...]

Michal Drozdzal¹, Eugene Vorontsov¹, Gabriel Chartrand², Samuel Kadoury¹, Chris Pal¹ - Show less +1 more•Institutions (2)

École Polytechnique de Montréal¹, Université de Montréal²

21 Oct 2016

TL;DR: This paper extends Fully Convolutional Networks by adding short skip connections, that are similar to the ones introduced in residual networks, in order to build very deep FCNs (of hundreds of layers).

...read moreread less

Abstract: In this paper, we study the influence of both long and short skip connections on Fully Convolutional Networks (FCN) for biomedical image segmentation. In standard FCNs, only long skip connections are used to skip features from the contracting path to the expanding path in order to recover spatial information lost during downsampling. We extend FCNs by adding short skip connections, that are similar to the ones introduced in residual networks, in order to build very deep FCNs (of hundreds of layers). A review of the gradient flow confirms that for a very deep FCN it is beneficial to have both long and short skip connections. Finally, we show that a very deep FCN can achieve near-to-state-of-the-art results on the EM dataset without any further post-processing.

...read moreread less

663 citations

Posted Content•

The Importance of Skip Connections in Biomedical Image Segmentation

[...]

Michal Drozdzal¹, Eugene Vorontsov¹, Gabriel Chartrand², Samuel Kadoury¹, Chris Pal¹ - Show less +1 more•Institutions (2)

École Polytechnique de Montréal¹, Université de Montréal²

14 Aug 2016-arXiv: Computer Vision and Pattern Recognition

TL;DR: In this article, the influence of both long and short skip connections on Fully Convolutional Networks (FCN) for biomedical image segmentation was studied. And they showed that for a very deep FCN, it is beneficial to have both long skip connections.

...read moreread less

451 citations

Journal Article•DOI•

EmoNets: Multimodal deep learning approaches for emotion recognition in video

[...]

Samira Ebrahimi Kahou¹, Xavier Bouthillier¹, Pascal Lamblin¹, Caglar Gulcehre¹, Vincent Michalski², Kishore Konda², Sébastien Jean¹, Pierre Froumenty¹, Yann N. Dauphin¹, Nicolas Boulanger-Lewandowski¹, Raul Chandias Ferrari¹, Mehdi Mirza¹, David Warde-Farley¹, Aaron Courville¹, Pascal Vincent¹, Roland Memisevic¹, Chris Pal¹, Yoshua Bengio¹ - Show less +14 more•Institutions (2)

Université de Montréal¹, Goethe University Frankfurt²

01 Jun 2016-Journal on Multimodal User Interfaces

TL;DR: In this article, the authors presented an approach to learn several specialist models using deep learning techniques, each focusing on one modality, including CNN, deep belief net, K-means based bag-of-mouths, and relational autoencoder.

...read moreread less

Abstract: The task of the Emotion Recognition in the Wild (EmotiW) Challenge is to assign one of seven emotions to short video clips extracted from Hollywood style movies. The videos depict acted-out emotions under realistic conditions with a large degree of variation in attributes such as pose and illumination, making it worthwhile to explore approaches which consider combinations of features from multiple modalities for label assignment. In this paper we present our approach to learning several specialist models using deep learning techniques, each focusing on one modality. Among these are a convolutional neural network, focusing on capturing visual information in detected faces, a deep belief net focusing on the representation of the audio stream, a K-Means based “bag-of-mouths” model, which extracts visual features around the mouth region and a relational autoencoder, which addresses spatio-temporal aspects of videos. We explore multiple methods for the combination of cues from these modalities into one common classifier. This achieves a considerably greater accuracy than predictions from our strongest single-modality classifier. Our method was the winning submission in the 2013 EmotiW challenge and achieved a test set accuracy of 47.67 % on the 2014 dataset.

...read moreread less

357 citations

Posted Content•

Zoneout: Regularizing RNNs by Randomly Preserving Hidden Activations

[...]

David Krueger¹, Tegan Maharaj², János Kramár¹, Mohammad Pezeshki¹, Nicolas Ballas¹, Nan Rosemary Ke¹, Anirudh Goyal¹, Yoshua Bengio¹, Aaron Courville¹, Chris Pal¹ - Show less +6 more•Institutions (2)

Université de Montréal¹, École Polytechnique de Montréal²

03 Jun 2016-arXiv: Neural and Evolutionary Computing

TL;DR: This work proposes zoneout, a novel method for regularizing RNNs that uses random noise to train a pseudo-ensemble, improving generalization and performs an empirical investigation of various RNN regularizers, and finds that zoneout gives significant performance improvements across tasks.

...read moreread less

Abstract: We propose zoneout, a novel method for regularizing RNNs At each timestep, zoneout stochastically forces some hidden units to maintain their previous values Like dropout, zoneout uses random noise to train a pseudo-ensemble, improving generalization But by preserving instead of dropping hidden units, gradient information and state information are more readily propagated through time, as in feedforward stochastic depth networks We perform an empirical investigation of various RNN regularizers, and find that zoneout gives significant performance improvements across tasks We achieve competitive results with relatively simple models in character- and word-level language modelling on the Penn Treebank and Text8 datasets, and combining with recurrent batch normalization yields state-of-the-art results on permuted sequential MNIST

...read moreread less

263 citations

Proceedings Article•

Delving Deeper into Convolutional Networks for Learning Video Representations

[...]

Nicolas Ballas¹, Li Yao¹, Chris Pal², Aaron Courville¹•Institutions (2)

Université de Montréal¹, École Polytechnique de Montréal²

01 Jan 2016

TL;DR: In this article, Gated-Recurrent-Unit Recurrent Networks (GRUs) are used to learn spatio-temporal features in videos from intermediate visual representations called "percepts", which are extracted from all level of a deep convolutional network trained on the ImageNet dataset.

...read moreread less

Abstract: We propose an approach to learn spatio-temporal features in videos from intermediate visual representations we call "percepts" using Gated-Recurrent-Unit Recurrent Networks (GRUs).Our method relies on percepts that are extracted from all level of a deep convolutional network trained on the large ImageNet dataset. While high-level percepts contain highly discriminative information, they tend to have a low-spatial resolution. Low-level percepts, on the other hand, preserve a higher spatial resolution from which we can model finer motion patterns. Using low-level percepts can leads to high-dimensionality video representations. To mitigate this effect and control the model number of parameters, we introduce a variant of the GRU model that leverages the convolution operations to enforce sparse connectivity of the model units and share parameters across the input spatial locations. We empirically validate our approach on both Human Action Recognition and Video Captioning tasks. In particular, we achieve results equivalent to state-of-art on the YouTube2Text dataset using a simpler text-decoder model and without extra 3D CNN features.

...read moreread less

259 citations

Proceedings Article•DOI•

Recombinator Networks: Learning Coarse-to-Fine Feature Aggregation

[...]

Sina Honari¹, Jason Yosinski², Pascal Vincent¹, Chris Pal³•Institutions (3)

Université de Montréal¹, Cornell University², École Polytechnique³

27 Jun 2016

TL;DR: The Recombinator Network as discussed by the authors proposes to combine upsampled coarse, abstract features with finer features to produce robust pixel-level predictions, which can make use of several layers of computation in deciding how to use coarse features.

...read moreread less

Abstract: Deep neural networks with alternating convolutional, max-pooling and decimation layers are widely used in state of the art architectures for computer vision. Max-pooling purposefully discards precise spatial information in order to create features that are more robust, and typically organized as lower resolution spatial feature maps. On some tasks, such as whole-image classification, max-pooling derived features are well suited, however, for tasks requiring precise localization, such as pixel level prediction and segmentation, max-pooling destroys exactly the information required to perform well. Precise localization may be preserved by shallow convnets without pooling but at the expense of robustness. Can we have our max-pooled multilayered cake and eat it too? Several papers have proposed summation and concatenation based methods for combining upsampled coarse, abstract features with finer features to produce robust pixel level predictions. Here we introduce another model — dubbed Recombinator Networks — where coarse features inform finer features early in their formation such that finer features can make use of several layers of computation in deciding how to use coarse features. The model is trained once, end-to-end and performs better than summation-based architectures, reducing the error from the previous state of the art on two facial keypoint datasets, AFW and AFLW, by 30% and beating the current state-of-the-art on 300W without using extra data. We improve performance even further by adding a denoising prediction model based on a novel convnet formulation.

...read moreread less

101 citations

Posted Content•

ExtremeWeather: A large-scale climate dataset for semi-supervised detection, localization, and understanding of extreme weather events

[...]

Evan Racah¹, Christopher Beckham², Tegan Maharaj², Samira Ebrahimi Kahou², Prabhat¹, Chris Pal² - Show less +2 more•Institutions (2)

Lawrence Berkeley National Laboratory¹, École Polytechnique de Montréal²

07 Dec 2016-arXiv: Computer Vision and Pattern Recognition

TL;DR: In this article, a multichannel spatio-temporal CNN architecture for semi-supervised bounding box prediction and exploratory data analysis is proposed to detect extreme weather events in large-scale climate simulations.

...read moreread less

Abstract: Then detection and identification of extreme weather events in large-scale climate simulations is an important problem for risk management, informing governmental policy decisions and advancing our basic understanding of the climate system. Recent work has shown that fully supervised convolutional neural networks (CNNs) can yield acceptable accuracy for classifying well-known types of extreme weather events when large amounts of labeled data are available. However, many different types of spatially localized climate patterns are of interest including hurricanes, extra-tropical cyclones, weather fronts, and blocking events among others. Existing labeled data for these patterns can be incomplete in various ways, such as covering only certain years or geographic areas and having false negatives. This type of climate data therefore poses a number of interesting machine learning challenges. We present a multichannel spatiotemporal CNN architecture for semi-supervised bounding box prediction and exploratory data analysis. We demonstrate that our approach is able to leverage temporal information and unlabeled data to improve the localization of extreme weather events. Further, we explore the representations learned by our model in order to better understand this important data. We present a dataset, ExtremeWeather, to encourage machine learning research in this area and to help facilitate further work in understanding and mitigating the effects of climate change. The dataset is available at this http URL and the code is available at this https URL.

...read moreread less

94 citations

Proceedings Article•

Zoneout: Regularizing RNNs by Randomly Preserving Hidden Activations.

[...]

Université de Montréal¹, École Polytechnique de Montréal²

03 Jun 2016

TL;DR: Zoomout as discussed by the authors uses random noise to train a pseudo-ensemble, improving generalization by preserving instead of dropping hidden units, gradient information and state information are more readily propagated through time, as in feed forward stochastic depth networks.

...read moreread less

Abstract: We propose zoneout, a novel method for regularizing RNNs. At each timestep, zoneout stochastically forces some hidden units to maintain their previous values. Like dropout, zoneout uses random noise to train a pseudo-ensemble, improving generalization. But by preserving instead of dropping hidden units, gradient information and state information are more readily propagated through time, as in feedforward stochastic depth networks. We perform an empirical investigation of various RNN regularizers, and find that zoneout gives significant performance improvements across tasks. We achieve competitive results with relatively simple models in character- and word-level language modelling on the Penn Treebank and Text8 datasets, and combining with recurrent batch normalization yields state-of-the-art results on permuted sequential MNIST.

...read moreread less

91 citations

Posted Content•

A dataset and exploration of models for understanding video data through fill-in-the-blank question-answering

[...]

Tegan Maharaj¹, Nicolas Ballas², Anna Rohrbach³, Aaron Courville², Chris Pal¹ - Show less +1 more•Institutions (3)

École Polytechnique de Montréal¹, Université de Montréal², Max Planck Society³

23 Nov 2016-arXiv: Computer Vision and Pattern Recognition

TL;DR: This task is not solvable by a language model alone, and the model combining 2D and 3D visual information indeed provides the best result, all models perform significantly worse than human-level.

...read moreread less

Abstract: While deep convolutional neural networks frequently approach or exceed human-level performance at benchmark tasks involving static images, extending this success to moving images is not straightforward. Having models which can learn to understand video is of interest for many applications, including content recommendation, prediction, summarization, event/object detection and understanding human visual perception, but many domains lack sufficient data to explore and perfect video models. In order to address the need for a simple, quantitative benchmark for developing and understanding video, we present MovieFIB, a fill-in-the-blank question-answering dataset with over 300,000 examples, based on descriptive video annotations for the visually impaired. In addition to presenting statistics and a description of the dataset, we perform a detailed analysis of 5 different models' predictions, and compare these with human performance. We investigate the relative importance of language, static (2D) visual features, and moving (3D) visual features; the effects of increasing dataset size, the number of frames sampled; and of vocabulary size. We illustrate that: this task is not solvable by a language model alone; our model combining 2D and 3D visual information indeed provides the best result; all models perform significantly worse than human-level. We provide human evaluations for responses given by different models and find that accuracy on the MovieFIB evaluation corresponds well with human judgement. We suggest avenues for improving video models, and hope that the proposed dataset can be useful for measuring and encouraging progress in this very interesting field.

...read moreread less

36 citations

Posted Content•

A simple squared-error reformulation for ordinal classification.

[...]

Christopher Beckham, Chris Pal

02 Dec 2016-arXiv: Machine Learning

TL;DR: This paper explores ordinal classification (in the context of deep neural networks) through a simple modification of the squared error loss which not only allows it to not only be sensitive to class ordering, but also allows the possibility of having a discrete probability distribution over the classes.

...read moreread less

Abstract: In this paper, we explore ordinal classification (in the context of deep neural networks) through a simple modification of the squared error loss which not only allows it to not only be sensitive to class ordering, but also allows the possibility of having a discrete probability distribution over the classes. Our formulation is based on the use of a softmax hidden layer, which has received relatively little attention in the literature. We empirically evaluate its performance on the Kaggle diabetic retinopathy dataset, an ordinal and high-resolution dataset and show that it outperforms all of the baselines employed.

...read moreread less

Posted Content•

Convolutional Residual Memory Networks.

[...]

Joel Ruben Antony Moniz, Chris Pal

16 Jun 2016-arXiv: Computer Vision and Pattern Recognition

TL;DR: This work proposes and evaluates a memory mechanism enhanced convolutional neural network architecture based on augmenting Convolutional residual networks with a long short term memory mechanism, which can yield state of the art performance on the CIFAR-100 benchmark and compares well with other state ofThe art techniques on the SOTA and SVHN benchmarks.

...read moreread less

Abstract: Very deep convolutional neural networks (CNNs) yield state of the art results on a wide variety of visual recognition problems. A number of state of the the art methods for image recognition are based on networks with well over 100 layers and the performance vs. depth trend is moving towards networks in excess of 1000 layers. In such extremely deep architectures the vanishing or exploding gradient problem becomes a key issue. Recent evidence also indicates that convolutional networks could benefit from an interface to explicitly constructed memory mechanisms interacting with a CNN feature processing hierarchy. Correspondingly, we propose and evaluate a memory mechanism enhanced convolutional neural network architecture based on augmenting convolutional residual networks with a long short term memory mechanism. We refer to this as a convolutional residual memory network. To the best of our knowledge this approach can yield state of the art performance on the CIFAR-100 benchmark and compares well with other state of the art techniques on the CIFAR-10 and SVHN benchmarks. This is achieved using networks with more breadth, much less depth and much less overall computation relative to comparable deep ResNets without the memory mechanism. Our experiments and analysis explore the importance of the memory mechanism, network depth, breadth, and predictive performance.

...read moreread less

Journal Article•DOI•

Fully automatic person segmentation in unconstrained video using spatio-temporal conditional random fields

[...]

Chetan Bhole¹, Chris Pal²•Institutions (2)

University of Rochester¹, Université de Montréal²

01 Jul 2016-Image and Vision Computing

TL;DR: This paper proposes a method for automatic segmentation of people that yields state of the art qualitative and quantitative performance compared to prior work and more heuristic alternative approaches and provides an extensive evaluation of the approach.

...read moreread less

Journal Article•DOI•

Improving facial analysis and performance driven animation through disentangling identity and expression

[...]

David Rim¹, Sina Honari², Kamrul Hasan¹, Chris Pal¹•Institutions (2)

École Polytechnique de Montréal¹, Université de Montréal²

01 Aug 2016-Image and Vision Computing

TL;DR: This paper uses a weakly-supervised approach in which identity labels are used to learn the different factors of variation linked to identity separately from factors related to expression to improve performance on emotion recognition, markerless performance-driven facial animation and facial key-point tracking.

...read moreread less

Posted Content•

Movie Description

[...]

Anna Rohrbach¹, Atousa Torabi², Marcus Rohrbach³, Niket Tandon¹, Chris Pal⁴, Hugo Larochelle⁵, Aaron Courville⁶, Bernt Schiele¹ - Show less +4 more•Institutions (6)

Max Planck Society¹, Disney Research², University of California, Berkeley³, École Polytechnique de Montréal⁴, Université de Sherbrooke⁵, Université de Montréal⁶

12 May 2016-arXiv: Computer Vision and Pattern Recognition

TL;DR: A novel dataset which contains transcribed ADs, which are temporally aligned to full length movies are proposed, which find that ADs are more visual and describe precisely what is shown rather than what should happen according to the scripts created prior to movie production.

...read moreread less

Abstract: Audio Description (AD) provides linguistic descriptions of movies and allows visually impaired people to follow a movie along with their peers Such descriptions are by design mainly visual and thus naturally form an interesting data source for computer vision and computational linguistics In this work we propose a novel dataset which contains transcribed ADs, which are temporally aligned to full length movies In addition we also collected and aligned movie scripts used in prior work and compare the two sources of descriptions In total the Large Scale Movie Description Challenge (LSMDC) contains a parallel corpus of 118,114 sentences and video clips from 202 movies First we characterize the dataset by benchmarking different approaches for generating video descriptions Comparing ADs to scripts, we find that ADs are indeed more visual and describe precisely what is shown rather than what should happen according to the scripts created prior to movie production Furthermore, we present and compare the results of several teams who participated in a challenge organized in the context of the workshop "Describing and Understanding Video & The Large Scale Movie Description Challenge (LSMDC)", at ICCV 2015

...read moreread less