Showing papers by &quot;Luc Van Gool published in 2016&quot;

Temporal Segment Networks: Towards Good Practices for Deep Action Recognition

08 Oct 2016

TL;DR: Temporal Segment Networks (TSN) as discussed by the authors combine a sparse temporal sampling strategy and video-level supervision to enable efficient and effective learning using the whole action video, which obtains the state-of-the-art performance on the datasets of HMDB51 and UCF101.

...read moreread less

Abstract: Deep convolutional networks have achieved great success for visual recognition in still images. However, for action recognition in videos, the advantage over traditional methods is not so evident. This paper aims to discover the principles to design effective ConvNet architectures for action recognition in videos and learn these models given limited training samples. Our first contribution is temporal segment network (TSN), a novel framework for video-based action recognition. which is based on the idea of long-range temporal structure modeling. It combines a sparse temporal sampling strategy and video-level supervision to enable efficient and effective learning using the whole action video. The other contribution is our study on a series of good practices in learning ConvNets on video data with the help of temporal segment network. Our approach obtains the state-the-of-art performance on the datasets of HMDB51 ($ 69.4\,\% $) and UCF101 ($ 94.2\,\% $). We also visualize the learned ConvNet models, which qualitatively demonstrates the effectiveness of temporal segment network and the proposed good practices (Models and code at https://github.com/yjxiong/temporal-segment-networks).

...read moreread less

2,778 citations

Posted Content•

[...]

Limin Wang¹, Yuanjun Xiong², Zhe Wang, Yu Qiao, Dahua Lin², Xiaoou Tang², Luc Van Gool¹ - Show less +3 more•Institutions (2)

One-Shot Video Object Segmentation

02 Aug 2016-arXiv: Computer Vision and Pattern Recognition

TL;DR: Temporal Segment Network (TSN) as discussed by the authors is based on the idea of long-range temporal structure modeling and combines a sparse temporal sampling strategy and video-level supervision to enable efficient and effective learning using the whole action video.

...read moreread less

Abstract: Deep convolutional networks have achieved great success for visual recognition in still images. However, for action recognition in videos, the advantage over traditional methods is not so evident. This paper aims to discover the principles to design effective ConvNet architectures for action recognition in videos and learn these models given limited training samples. Our first contribution is temporal segment network (TSN), a novel framework for video-based action recognition. which is based on the idea of long-range temporal structure modeling. It combines a sparse temporal sampling strategy and video-level supervision to enable efficient and effective learning using the whole action video. The other contribution is our study on a series of good practices in learning ConvNets on video data with the help of temporal segment network. Our approach obtains the state-the-of-art performance on the datasets of HMDB51 ( $ 69.4\% $) and UCF101 ($ 94.2\% $). We also visualize the learned ConvNet models, which qualitatively demonstrates the effectiveness of temporal segment network and the proposed good practices.

...read moreread less

958 citations

Proceedings Article•

Dynamic Filter Networks

[...]

Xu Jia¹, Bert De Brabandere¹, Tinne Tuytelaars¹, Luc Van Gool²•Institutions (2)

Katholieke Universiteit Leuven¹, ETH Zurich²

01 Jan 2016

TL;DR: In this article, the Dynamic Filter Network (DFN) is proposed, where filters are generated dynamically conditioned on an input, and a wide variety of filtering operation can be learned this way, including local spatial transformations, selective (de)blurring or adaptive feature extraction.

...read moreread less

Abstract: In a traditional convolutional layer, the learned filters stay fixed after training. In contrast, we introduce a new framework, the Dynamic Filter Network, where filters are generated dynamically conditioned on an input. We show that this architecture is a powerful one, with increased flexibility thanks to its adaptive nature, yet without an excessive increase in the number of model parameters. A wide variety of filtering operation can be learned this way, including local spatial transformations, but also others like selective (de)blurring or adaptive feature extraction. Moreover, multiple such layers can be combined, e.g. in a recurrent architecture. We demonstrate the effectiveness of the dynamic filter network on the tasks of video and stereo prediction, and reach state-of-the-art performance on the moving MNIST dataset with a much smaller model. By visualizing the learned filters, we illustrate that the network has picked up flow information by only looking at unlabelled training data. This suggests that the network can be used to pretrain networks for various supervised tasks in an unsupervised way, like optical flow and depth estimation.

...read moreread less

819 citations

Posted Content•

[...]

Sergi Caelles, Kevis-Kokitsi Maninis, Jordi Pont-Tuset, Laura Leal-Taixé, Daniel Cremers, Luc Van Gool - Show less +2 more

16 Nov 2016-arXiv: Computer Vision and Pattern Recognition

TL;DR: One-shot video object segmentation (OSVOS) as mentioned in this paper is based on a fully-convolutional neural network architecture that is able to successively transfer generic semantic information, learned on ImageNet, to the task of foreground segmentation, and finally to learning the appearance of a single annotated object of the test sequence.

...read moreread less

Abstract: This paper tackles the task of semi-supervised video object segmentation, i.e., the separation of an object from the background in a video, given the mask of the first frame. We present One-Shot Video Object Segmentation (OSVOS), based on a fully-convolutional neural network architecture that is able to successively transfer generic semantic information, learned on ImageNet, to the task of foreground segmentation, and finally to learning the appearance of a single annotated object of the test sequence (hence one-shot). Although all frames are processed independently, the results are temporally coherent and stable. We perform experiments on two annotated video segmentation databases, which show that OSVOS is fast and improves the state of the art by a significant margin (79.8% vs 68.0%).

...read moreread less

523 citations

Book Chapter•DOI•

Deep Retinal Image Understanding

[...]

Kevis-Kokitsi Maninis¹, Jordi Pont-Tuset¹, Pablo Arbeláez², Luc Van Gool¹, Luc Van Gool³ - Show less +1 more•Institutions (3)

ETH Zurich¹, University of Los Andes², Katholieke Universiteit Leuven³

17 Oct 2016

TL;DR: Deep Retinal Image Understanding is presented, a unified framework of retinal image analysis that provides both retinal vessel and optic disc segmentation and shows super-human performance, that is, it shows results more consistent with a gold standard than a second human annotator used as control.

...read moreread less

Abstract: This paper presents Deep Retinal Image Understanding (DRIU), a unified framework of retinal image analysis that provides both retinal vessel and optic disc segmentation. We make use of deep Convolutional Neural Networks (CNNs), which have proven revolutionary in other fields of computer vision such as object detection and image classification, and we bring their power to the study of eye fundus images. DRIU uses a base network architecture on which two set of specialized layers are trained to solve both the retinal vessel and optic disc segmentation. We present experimental validation, both qualitative and quantitative, in four public datasets for these tasks. In all of them, DRIU presents super-human performance, that is, it shows results more consistent with a gold standard than a second human annotator used as control.

...read moreread less

416 citations

Proceedings Article•DOI•

Seven Ways to Improve Example-Based Single Image Super Resolution

[...]

Radu Timofte¹, Rasmus Rothe¹, Luc Van Gool¹•Institutions (1)

ETH Zurich¹

27 Jun 2016

TL;DR: In this article, the authors present seven techniques that everybody should know to improve example-based single image super resolution (SR): augmentation of data, use of large dictionaries with efficient search structures, cascading, image self-similarities, back projection refinement, enhanced prediction by consistency check, and context reasoning.

...read moreread less

Abstract: In this paper we present seven techniques that everybody should know to improve example-based single image super resolution (SR): 1) augmentation of data, 2) use of large dictionaries with efficient search structures, 3) cascading, 4) image self-similarities, 5) back projection refinement, 6) enhanced prediction by consistency check, and 7) context reasoning. We validate our seven techniques on standard SR benchmarks (i.e. Set5, Set14, B100) and methods (i.e. A+, SRCNN, ANR, Zeyde, Yang) and achieve substantial improvements. The techniques are widely applicable and require no changes or only minor adjustments of the SR methods. Moreover, our Improved A+ (IA) method sets new stateof-the-art results outperforming A+ by up to 0.9dB on average PSNR whilst maintaining a low time complexity.

...read moreread less

366 citations

Posted Content•

A Riemannian Network for SPD Matrix Learning

[...]

Zhiwu Huang¹, Luc Van Gool²•Institutions (2)

Fast Optical Flow using Dense Inverse Search

15 Aug 2016-arXiv: Computer Vision and Pattern Recognition

TL;DR: In this paper, a Riemannian network architecture is proposed for symmetric positive definite (SPD) matrix learning, where bilinear mapping layers are used to transform the input SPD matrices to more desirable SPD matrix matrices, eigenvalue rectification layers are exploited to apply a non-linear activation function to the new non-regular activation function, and an eigen value logarithm layer is designed to perform Riemanian computing on the resulting SPD matures for regular output layers.

...read moreread less

Abstract: Symmetric Positive Definite (SPD) matrix learning methods have become popular in many image and video processing tasks, thanks to their ability to learn appropriate statistical representations while respecting Riemannian geometry of underlying SPD manifolds. In this paper we build a Riemannian network architecture to open up a new direction of SPD matrix non-linear learning in a deep model. In particular, we devise bilinear mapping layers to transform input SPD matrices to more desirable SPD matrices, exploit eigenvalue rectification layers to apply a non-linear activation function to the new SPD matrices, and design an eigenvalue logarithm layer to perform Riemannian computing on the resulting SPD matrices for regular output layers. For training the proposed deep network, we exploit a new backpropagation with a variant of stochastic gradient descent on Stiefel manifolds to update the structured connection weights and the involved SPD matrix data. We show through experiments that the proposed SPD matrix network can be simply trained and outperform existing SPD matrix learning and state-of-the-art methods in three typical visual classification tasks.

...read moreread less

223 citations

Posted Content•

[...]

Till Kroeger¹, Radu Timofte¹, Dengxin Dai¹, Luc Van Gool¹, Luc Van Gool² - Show less +1 more•Institutions (2)

Weakly Supervised Cascaded Convolutional Networks

11 Mar 2016-arXiv: Computer Vision and Pattern Recognition

TL;DR: In this paper, the Dense Inverse Search-based method (DIS) is proposed to find correspondences inspired by the inverse compositional image alignment proposed by Baker and Matthews in 2001.

...read moreread less

Abstract: Most recent works in optical flow extraction focus on the accuracy and neglect the time complexity. However, in real-life visual applications, such as tracking, activity detection and recognition, the time complexity is critical. We propose a solution with very low time complexity and competitive accuracy for the computation of dense optical flow. It consists of three parts: 1) inverse search for patch correspondences; 2) dense displacement field creation through patch aggregation along multiple scales; 3) variational refinement. At the core of our Dense Inverse Search-based method (DIS) is the efficient search of correspondences inspired by the inverse compositional image alignment proposed by Baker and Matthews in 2001. DIS is competitive on standard optical flow benchmarks with large displacements. DIS runs at 300Hz up to 600Hz on a single CPU core, reaching the temporal resolution of human's biological vision system. It is order(s) of magnitude faster than state-of-the-art methods in the same range of accuracy, making DIS ideal for visual applications.

...read moreread less

218 citations

Posted Content•

[...]

Ali Diba¹, Vivek Sharma², Ali Mohammad Pazandeh³, Hamed Pirsiavash⁴, Luc Van Gool¹ - Show less +1 more•Institutions (4)

Katholieke Universiteit Leuven¹, Karlsruhe Institute of Technology², Sharif University of Technology³, University of Maryland, Baltimore County⁴

24 Nov 2016-arXiv: Computer Vision and Pattern Recognition

TL;DR: This work introduces two new architecture of cascaded networks, with either two cascade stages or three which are trained in an end-to-end pipeline to learn a convolutional neural network (CNN) under such conditions.

...read moreread less

Abstract: Object detection is a challenging task in visual understanding domain, and even more so if the supervision is to be weak. Recently, few efforts to handle the task without expensive human annotations is established by promising deep neural network. A new architecture of cascaded networks is proposed to learn a convolutional neural network (CNN) under such conditions. We introduce two such architectures, with either two cascade stages or three which are trained in an end-to-end pipeline. The first stage of both architectures extracts best candidate of class specific region proposals by training a fully convolutional network. In the case of the three stage architecture, the middle stage provides object segmentation, using the output of the activation maps of first stage. The final stage of both architectures is a part of a convolutional neural network that performs multiple instance learning on proposals extracted in the previous stage(s). Our experiments on the PASCAL VOC 2007, 2010, 2012 and large scale object datasets, ILSVRC 2013, 2014 datasets show improvements in the areas of weakly-supervised object detection, classification and localization.

...read moreread less

211 citations

Book Chapter•DOI•

Fast Optical Flow Using Dense Inverse Search

[...]

Till Kroeger¹, Radu Timofte¹, Dengxin Dai¹, Luc Van Gool², Luc Van Gool¹ - Show less +1 more•Institutions (2)

Sub-Markov Random Walk for Image Segmentation

08 Oct 2016

TL;DR: The Dense Inverse Search-based method (DIS) is the efficient search of correspondences inspired by the inverse compositional image alignment proposed by Baker and Matthews (2001, 2004), making DIS ideal for real-time applications.

...read moreread less

Abstract: Most recent works in optical flow extraction focus on the accuracy and neglect the time complexity. However, in real-life visual applications, such as tracking, activity detection and recognition, the time complexity is critical. We propose a solution with very low time complexity and competitive accuracy for the computation of dense optical flow. It consists of three parts: (1) inverse search for patch correspondences; (2) dense displacement field creation through patch aggregation along multiple scales; (3) variational refinement. At the core of our Dense Inverse Search-based method (DIS) is the efficient search of correspondences inspired by the inverse compositional image alignment proposed by Baker and Matthews (2001, 2004). DIS is competitive on standard optical flow benchmarks. DIS runs at 300 Hz up to 600 Hz on a single CPU core (1024 $\times $ 436 resolution. 42 Hz/46 Hz when including preprocessing: disk access, image re-scaling, gradient computation. More details in Sect. 3.1.), reaching the temporal resolution of human’s biological vision system. It is order(s) of magnitude faster than state-of-the-art methods in the same range of accuracy, making DIS ideal for real-time applications.

...read moreread less

210 citations

Journal Article•DOI•

[...]

Xingping Dong¹, Jianbing Shen¹, Ling Shao², Luc Van Gool³•Institutions (3)

Beijing Institute of Technology¹, Northumbria University², ETH Zurich³

01 Feb 2016-IEEE Transactions on Image Processing

TL;DR: The experimental results demonstrate that the proposed subRW method outperforms previous RW algorithms for seeded image segmentation, and designs a new subRW algorithm with label prior to solve the segmentation problem of objects with thin and elongated parts.

...read moreread less

Abstract: A novel sub-Markov random walk (subRW) algorithm with label prior is proposed for seeded image segmentation, which can be interpreted as a traditional random walker on a graph with added auxiliary nodes. Under this explanation, we unify the proposed subRW and other popular random walk (RW) algorithms. This unifying view will make it possible for transferring intrinsic findings between different RW algorithms, and offer new ideas for designing novel RW algorithms by adding or changing auxiliary nodes. To verify the second benefit, we design a new subRW algorithm with label prior to solve the segmentation problem of objects with thin and elongated parts. The experimental results on both synthetic and natural images with twigs demonstrate that the proposed subRW method outperforms previous RW algorithms for seeded image segmentation.

...read moreread less

Posted Content•

Deep Learning on Lie Groups for Skeleton-based Action Recognition

[...]

Zhiwu Huang¹, Chengde Wan¹, Thomas Probst¹, Luc Van Gool²•Institutions (2)

Incremental Learning of Random Forests for Large-Scale Image Classification

18 Dec 2016-arXiv: Computer Vision and Pattern Recognition

TL;DR: Li et al. as mentioned in this paper incorporated the Lie group structure into a deep network architecture to learn more appropriate Lie group features for skeleton-based action recognition, and designed rotation mapping layers to transform the input Lie group feature into desirable ones, which are aligned better in the temporal domain.

...read moreread less

Abstract: In recent years, skeleton-based action recognition has become a popular 3D classification problem. State-of-the-art methods typically first represent each motion sequence as a high-dimensional trajectory on a Lie group with an additional dynamic time warping, and then shallowly learn favorable Lie group features. In this paper we incorporate the Lie group structure into a deep network architecture to learn more appropriate Lie group features for 3D action recognition. Within the network structure, we design rotation mapping layers to transform the input Lie group features into desirable ones, which are aligned better in the temporal domain. To reduce the high feature dimensionality, the architecture is equipped with rotation pooling layers for the elements on the Lie group. Furthermore, we propose a logarithm mapping layer to map the resulting manifold data into a tangent space that facilitates the application of regular output layers for the final classification. Evaluations of the proposed network for standard 3D human action recognition datasets clearly demonstrate its superiority over existing shallow Lie group feature learning methods as well as most conventional deep learning methods.

...read moreread less

Journal Article•DOI•

[...]

Marko Ristin¹, Matthieu Guillaumin¹, Juergen Gall², Luc Van Gool¹•Institutions (2)

ETH Zurich¹, University of Bonn²

01 Mar 2016-IEEE Transactions on Pattern Analysis and Machine Intelligence

TL;DR: It is shown that RFs initially trained with just 10 classes can be extended to 1,000 classes with an acceptable loss of accuracy compared to training from the full data and with great computational savings compared to retraining for each new batch of classes.

...read moreread less

Abstract: Large image datasets such as ImageNet or open-ended photo websites like Flickr are revealing new challenges to image classification that were not apparent in smaller, fixed sets. In particular, the efficient handling of dynamically growing datasets, where not only the amount of training data but also the number of classes increases over time, is a relatively unexplored problem. In this challenging setting, we study how two variants of Random Forests (RF) perform under four strategies to incorporate new classes while avoiding to retrain the RFs from scratch. The various strategies account for different trade-offs between classification accuracy and computational efficiency. In our extensive experiments, we show that both RF variants, one based on Nearest Class Mean classifiers and the other on SVMs, outperform conventional RFs and are well suited for incrementally learning new classes. In particular, we show that RFs initially trained with just 10 classes can be extended to 1,000 classes with an acceptable loss of accuracy compared to training from the full data and with great computational savings compared to retraining for each new batch of classes.

...read moreread less

Posted Content•

CUHK & ETHZ & SIAT Submission to ActivityNet Challenge 2016.

[...]

Yuanjun Xiong, Limin Wang, Zhe Wang, Bowen Zhang, Hang Song, Wei Li, Dahua Lin, Yu Qiao, Luc Van Gool, Xiaoou Tang - Show less +6 more

02 Aug 2016-arXiv: Computer Vision and Pattern Recognition

TL;DR: This paper uses the latest deep model architecture, e.g., ResNet and Inception V3, and introduces new aggregation schemes (top-k and attention-weighted pooling) and incorporates the audio as a complementary channel, extracting relevant information via a CNN applied to the spectrograms.

...read moreread less

Abstract: This paper presents the method that underlies our submission to the untrimmed video classification task of ActivityNet Challenge 2016. We follow the basic pipeline of temporal segment networks and further raise the performance via a number of other techniques. Specifically, we use the latest deep model architecture, e.g., ResNet and Inception V3, and introduce new aggregation schemes (top-k and attention-weighted pooling). Additionally, we incorporate the audio as a complementary channel, extracting relevant information via a CNN applied to the spectrograms. With these techniques, we derive an ensemble of deep models, which, together, attains a high classification accuracy (mAP $93.23\%$) on the testing set and secured the first place in the challenge.

...read moreread less

Proceedings Article•DOI•

Deep Convolutional Neural Networks and Data Augmentation for Acoustic Event Recognition.

[...]

Naoya Takahashi¹, Michael Gygli², Beat Pfister², Luc Van Gool²•Institutions (2)

Sony Broadcast & Professional Research Laboratories¹, ETH Zurich²

08 Sep 2016

Book Chapter•DOI•

Deep Retinal Image Understanding

[...]

Kevis-Kokitsi Maninis¹, Jordi Pont-Tuset¹, Pablo Arbeláez², Luc Van Gool¹, Luc Van Gool³ - Show less +1 more•Institutions (3)

ETH Zurich¹, University of Los Andes², Katholieke Universiteit Leuven³

05 Sep 2016-arXiv: Computer Vision and Pattern Recognition

TL;DR: Deep Retinal Image Understanding (DRIU) as mentioned in this paper uses a base network architecture on which two set of specialized layers are trained to solve both the retinal vessel and optic disc segmentation.

...read moreread less

Book Chapter•DOI•

Convolutional Oriented Boundaries

[...]

Kevis-Kokitsi Maninis¹, Jordi Pont-Tuset¹, Pablo Arbeláez², Luc Van Gool³, Luc Van Gool¹ - Show less +1 more•Institutions (3)

ETH Zurich¹, University of Los Andes², Katholieke Universiteit Leuven³

08 Oct 2016

TL;DR: Convolutional Oriented Boundaries is presented, which produces multiscale oriented contours and region hierarchies starting from generic image classification Convolutional Neural Networks and it gives a significant leap in performance over the state-of-the-art.

...read moreread less

Abstract: We present Convolutional Oriented Boundaries (COB), which produces multiscale oriented contours and region hierarchies starting from generic image classification Convolutional Neural Networks (CNNs). COB is computationally efficient, because it requires a single CNN forward pass for contour detection and it uses a novel sparse boundary representation for hierarchical segmentation; it gives a significant leap in performance over the state-of-the-art, and it generalizes very well to unseen categories and datasets. Particularly, we show that learning to estimate not only contour strength but also orientation provides more accurate results. We perform extensive experiments on BSDS, PASCAL Context, PASCAL Segmentation, and MS-COCO, showing that COB provides state-of-the-art contours, region hierarchies, and object proposals in all datasets.

...read moreread less

Book Chapter•DOI•

Convolutional Oriented Boundaries

[...]

Kevis-Kokitsi Maninis¹, Jordi Pont-Tuset¹, Pablo Arbeláez², Luc Van Gool¹, Luc Van Gool³ - Show less +1 more•Institutions (3)

ETH Zurich¹, University of Los Andes², Katholieke Universiteit Leuven³

09 Aug 2016-arXiv: Computer Vision and Pattern Recognition

TL;DR: Convolutional Oriented Boundaries (COB) as mentioned in this paper produces multiscale oriented contours and region hierarchies starting from generic image classification Convolutional Neural Networks (CNNs).

...read moreread less

Posted Content•

Building Deep Networks on Grassmann Manifolds

[...]

Zhiwu Huang¹, Jiqing Wu², Luc Van Gool¹•Institutions (2)

Katholieke Universiteit Leuven¹, ETH Zurich²

17 Nov 2016-arXiv: Computer Vision and Pattern Recognition

TL;DR: This paper proposes a deep network architecture by generalizing the Euclidean network paradigm to Grassmann manifolds and designs full rank mapping layers to transform input Grassmannian data to more desirable ones, and exploits re-orthonormalization layers to normalize the resulting matrices.

...read moreread less

Abstract: Learning representations on Grassmann manifolds is popular in quite a few visual recognition tasks. In order to enable deep learning on Grassmann manifolds, this paper proposes a deep network architecture by generalizing the Euclidean network paradigm to Grassmann manifolds. In particular, we design full rank mapping layers to transform input Grassmannian data to more desirable ones, exploit re-orthonormalization layers to normalize the resulting matrices, study projection pooling layers to reduce the model complexity in the Grassmannian context, and devise projection mapping layers to respect Grassmannian geometry and meanwhile achieve Euclidean forms for regular output layers. To train the Grassmann networks, we exploit a stochastic gradient descent setting on manifolds of the connection weights, and study a matrix generalization of backpropagation to update the structured data. The evaluations on three visual recognition tasks show that our Grassmann networks have clear advantages over existing Grassmann learning methods, and achieve results comparable with state-of-the-art approaches.

...read moreread less

Proceedings Article•DOI•

Two-Stream SR-CNNs for Action Recognition in Videos.

[...]

Yifan Wang, Jie Song¹, Limin Wang¹, Luc Van Gool¹, Otmar Hilliges¹ - Show less +1 more•Institutions (1)

ETH Zurich¹

01 Jan 2016

TL;DR: This paper proposes a new deep architecture by incorporating human/object detection results into the framework, called two-stream semantic region based CNNs (SR-CNNs), which not only shares great modeling capacity with the original two- stream CNNs, but also exhibits the flexibility of leveraging semantic cues for action understanding.

...read moreread less

Abstract: Human action is a high-level concept in computer vision research and understanding it may benefit from different semantics, such as human pose, interacting objects, and scene context. In this paper, we explicitly exploit semantic cues with aid of existing human/object detectors for action recognition in videos, and thoroughly study their effect on the recognition performance for different types of actions. Specifically, we propose a new deep architecture by incorporating human/object detection results into the framework, called two-stream semantic region based CNNs (SR-CNNs). Our proposed architecture not only shares great modeling capacity with the original two-stream CNNs, but also exhibits the flexibility of leveraging semantic cues (e.g. scene, person, object) for action understanding. We perform experiments on the UCF101 dataset and demonstrate its superior performance to the original two-stream CNNs. In addition, we systematically study the effect of incorporating semantic cues on the recognition performance for different types of action classes, and try to provide some insights for building more reasonable action benchmarks and developing better recognition algorithms.

...read moreread less

Posted Content•

Deep Convolutional Neural Networks and Data Augmentation for Acoustic Event Detection

[...]

Naoya Takahashi, Michael Gygli, Beat Pfister, Luc Van Gool

25 Apr 2016-arXiv: Sound

TL;DR: This work introduces a convolutional neural network (CNN) with a large input field for AED that significantly outperforms state of the art methods including Bag of Audio Words (BoAW) and classical CNNs, achieving a 16% absolute improvement.

...read moreread less

Abstract: We propose a novel method for Acoustic Event Detection (AED). In contrast to speech, sounds coming from acoustic events may be produced by a wide variety of sources. Furthermore, distinguishing them often requires analyzing an extended time period due to the lack of a clear sub-word unit. In order to incorporate the long-time frequency structure for AED, we introduce a convolutional neural network (CNN) with a large input field. In contrast to previous works, this enables to train audio event detection end-to-end. Our architecture is inspired by the success of VGGNet and uses small, 3x3 convolutions, but more depth than previous methods in AED. In order to prevent over-fitting and to take full advantage of the modeling capabilities of our network, we further propose a novel data augmentation method to introduce data variation. Experimental results show that our CNN significantly outperforms state of the art methods including Bag of Audio Words (BoAW) and classical CNNs, achieving a 16% absolute improvement.

...read moreread less

Proceedings Article•DOI•

Actionness Estimation Using Hybrid Fully Convolutional Networks

[...]

Limin Wang¹, Yu Qiao, Xiaoou Tang², Luc Van Gool•Institutions (2)

Hand Pose Estimation from Local Surface Normals

01 Jun 2016

TL;DR: Wang et al. as discussed by the authors proposed a hybrid fully convolutional network (HFCN), which is composed of appearance FCN (A-FCN) and motion FCN, to estimate actionness maps from the perspectives of static appearance and dynamic motion.

...read moreread less

Abstract: Actionness [3] was introduced to quantify the likelihood of containing a generic action instance at a specific location. Accurate and efficient estimation of actionness is important in video analysis and may benefit other relevant tasks such as action recognition and action detection. This paper presents a new deep architecture for actionness estimation, called hybrid fully convolutional network (HFCN), which is composed of appearance FCN (A-FCN) and motion FCN (M-FCN). These two FCNs leverage the strong capacity of deep models to estimate actionness maps from the perspectives of static appearance and dynamic motion, respectively. In addition, the fully convolutional nature of H-FCN allows it to efficiently process videos with arbitrary sizes. Experiments are conducted on the challenging datasets of Stanford40, UCF Sports, and JHMDB to verify the effectiveness of H-FCN on actionness estimation, which demonstrate that our method achieves superior performance to previous ones. Moreover, we apply the estimated actionness maps on action proposal generation and action detection. Our actionness maps advance the current state-of-the-art performance of these tasks substantially.

...read moreread less

Book Chapter•DOI•

[...]

Chengde Wan¹, Angela Yao², Luc Van Gool¹, Luc Van Gool³•Institutions (3)

ETH Zurich¹, University of Bonn², Katholieke Universiteit Leuven³

08 Oct 2016

TL;DR: A hierarchical regression framework for estimating hand joint positions from single depth images based on local surface normals and a conditional regression forest, i.e. the Frame Conditioned Regression Forest (FCRF) which uses a new normal difference feature.

...read moreread less

Abstract: We present a hierarchical regression framework for estimating hand joint positions from single depth images based on local surface normals. The hierarchical regression follows the tree structured topology of hand from wrist to finger tips. We propose a conditional regression forest, i.e. the Frame Conditioned Regression Forest (FCRF) which uses a new normal difference feature. At each stage of the regression, the frame of reference is established from either the local surface normal or previously estimated hand joints. By making the regression with respect to the local frame, the pose estimation is more robust to rigid transformations. We also introduce a new efficient approximation to estimate surface normals. We verify the effectiveness of our method by conducting experiments on two challenging real-world datasets and show consistent improvements over previous discriminative pose estimation methods.

...read moreread less

Posted Content•

Actionness Estimation Using Hybrid Fully Convolutional Networks

[...]

Limin Wang¹, Yu Qiao, Xiaoou Tang², Luc Van Gool•Institutions (2)

Energy-efficient ConvNets through approximate computing

25 Apr 2016-arXiv: Computer Vision and Pattern Recognition

TL;DR: A new deep architecture for actionness estimation is presented, called hybrid fully convolutional network (HFCN), which is composed of appearance FCN (A-FCN) and motionFCN (M-FCNs), which leverage the strong capacity of deep models to estimate actionness maps from the perspectives of static appearance and dynamic motion.

...read moreread less

Abstract: Actionness was introduced to quantify the likelihood of containing a generic action instance at a specific location. Accurate and efficient estimation of actionness is important in video analysis and may benefit other relevant tasks such as action recognition and action detection. This paper presents a new deep architecture for actionness estimation, called hybrid fully convolutional network (H-FCN), which is composed of appearance FCN (A-FCN) and motion FCN (M-FCN). These two FCNs leverage the strong capacity of deep models to estimate actionness maps from the perspectives of static appearance and dynamic motion, respectively. In addition, the fully convolutional nature of H-FCN allows it to efficiently process videos with arbitrary sizes. Experiments are conducted on the challenging datasets of Stanford40, UCF Sports, and JHMDB to verify the effectiveness of H-FCN on actionness estimation, which demonstrate that our method achieves superior performance to previous ones. Moreover, we apply the estimated actionness maps on action proposal generation and action detection. Our actionness maps advance the current state-of-the-art performance of these tasks substantially.

...read moreread less

Posted Content•

Dynamic Filter Networks

[...]

Bert De Brabandere, Xu Jia, Tinne Tuytelaars, Luc Van Gool

31 May 2016-arXiv: Learning

TL;DR: The Dynamic Filter Network is introduced, where filters are generated dynamically conditioned on an input, and it is shown that this architecture is a powerful one, with increased flexibility thanks to its adaptive nature, yet without an excessive increase in the number of model parameters.

...read moreread less

Abstract: In a traditional convolutional layer, the learned filters stay fixed after training. In contrast, we introduce a new framework, the Dynamic Filter Network, where filters are generated dynamically conditioned on an input. We show that this architecture is a powerful one, with increased flexibility thanks to its adaptive nature, yet without an excessive increase in the number of model parameters. A wide variety of filtering operations can be learned this way, including local spatial transformations, but also others like selective (de)blurring or adaptive feature extraction. Moreover, multiple such layers can be combined, e.g. in a recurrent architecture. We demonstrate the effectiveness of the dynamic filter network on the tasks of video and stereo prediction, and reach state-of-the-art performance on the moving MNIST dataset with a much smaller model. By visualizing the learned filters, we illustrate that the network has picked up flow information by only looking at unlabelled training data. This suggests that the network can be used to pretrain networks for various supervised tasks in an unsupervised way, like optical flow and depth estimation.

...read moreread less

Proceedings Article•DOI•

[...]

Bert Moons¹, Bert De Brabandere¹, Luc Van Gool¹, Marian Verhelst¹•Institutions (1)

Efficient Two-Stream Motion and Appearance 3D CNNs for Video Classification

07 Mar 2016

TL;DR: In this article, the authors proposed methods based on approximate computing to reduce energy consumption in state-of-the-art ConvNet accelerators by combining techniques both at the system and circuit level.

...read moreread less

Abstract: Recently convolutional neural networks (ConvNets) have come up as state-of-the-art classification and detection algorithms, achieving near-human performance in visual detection. However, ConvNet algorithms are typically very computation and memory intensive. In order to be able to embed ConvNet-based classification into wearable platforms and embedded systems such as smartphones or ubiquitous electronics for the internet-of-things, their energy consumption should be reduced drastically. This paper proposes methods based on approximate computing to reduce energy consumption in state-of-the-art ConvNet accelerators. By combining techniques both at the system- and circuit level, we can gain energy in the systems arithmetic: up to 30 X without losing classification accuracy and more than 100 X at 99% classification accuracy, compared to the commonly used 16-bit fixed point number format.

...read moreread less

Posted Content•

[...]

Ali Diba¹, Ali Mohammad Pazandeh, Luc Van Gool•Institutions (1)

ATLAS: A Three-Layered Approach to Facade Parsing

31 Aug 2016-arXiv: Computer Vision and Pattern Recognition

TL;DR: This work presents some novel deep CNNs using 3D architecture to model actions and motion representation in an efficient way to be accurate and also as fast as real-time.

...read moreread less

Abstract: The video and action classification have extremely evolved by deep neural networks specially with two stream CNN using RGB and optical flow as inputs and they present outstanding performance in terms of video analysis. One of the shortcoming of these methods is handling motion information extraction which is done out side of the CNNs and relatively time consuming also on GPUs. So proposing end-to-end methods which are exploring to learn motion representation, like 3D-CNN can achieve faster and accurate performance. We present some novel deep CNNs using 3D architecture to model actions and motion representation in an efficient way to be accurate and also as fast as real-time. Our new networks learn distinctive models to combine deep motion features into appearance model via learning optical flow features inside the network.

...read moreread less

Journal Article•DOI•

[...]

Markus Mathias¹, Andelo Martinovic¹, Luc Van Gool¹•Institutions (1)

Energy-Efficient ConvNets Through Approximate Computing

01 May 2016-International Journal of Computer Vision

TL;DR: A novel approach for semantic segmentation of building facades that incorporates additional meta-knowledge in the form of weak architectural principles, which enforces architectural plausibility and consistency on the final reconstruction.

...read moreread less

Abstract: We propose a novel approach for semantic segmentation of building facades. Our system consists of three distinct layers, representing different levels of abstraction in facade images: segments, objects and architectural elements. In the first layer, the facade is segmented into regions, each of which is assigned a probability distribution over semantic classes. We evaluate different state-of-the-art segmentation and classification strategies to obtain the initial probabilistic semantic labeling. In the second layer, we investigate the performance of different object detectors and show the benefit of using such detectors to improve our initial labeling. The generic approaches of the first two layers are then specialized for the task of facade labeling in the third layer. There, we incorporate additional meta-knowledge in the form of weak architectural principles, which enforces architectural plausibility and consistency on the final reconstruction. Rigorous tests performed on two existing datasets of building facades demonstrate that we outperform the current state of the art, even when using outputs from lower layers of the pipeline. Finally, we demonstrate how the output of the highest layer can be used to create a procedural building reconstruction.

...read moreread less

Proceedings Article•DOI•

[...]

Bert Moons¹, Bert De Brabandere¹, Luc Van Gool¹, Marian Verhelst¹•Institutions (1)

Some Like It Hot — Visual Guidance for Preference Prediction

22 Mar 2016-arXiv: Computer Vision and Pattern Recognition

TL;DR: Methods based on approximate computing to reduce energy consumption in state-of-the-art ConvNet accelerators are proposed and can gain energy in the systems arithmetic: up to 30× without losing classification accuracy and more than 100× at 99% classification accuracy, compared to the commonly used 16-bit fixed point number format.

...read moreread less

Abstract: Recently ConvNets or convolutional neural networks (CNN) have come up as state-of-the-art classification and detection algorithms, achieving near-human performance in visual detection. However, ConvNet algorithms are typically very computation and memory intensive. In order to be able to embed ConvNet-based classification into wearable platforms and embedded systems such as smartphones or ubiquitous electronics for the internet-of-things, their energy consumption should be reduced drastically. This paper proposes methods based on approximate computing to reduce energy consumption in state-of-the-art ConvNet accelerators. By combining techniques both at the system- and circuit level, we can gain energy in the systems arithmetic: up to 30x without losing classification accuracy and more than 100x at 99% classification accuracy, compared to the commonly used 16-bit fixed point number format.

...read moreread less

Proceedings Article•DOI•

[...]

Rasmus Rothe¹, Radu Timofte¹, Luc Van Gool²•Institutions (2)