Showing papers by "Luc Van Gool published in 2017"

PDF

Open Access

Proceedings Article•DOI•

NTIRE 2017 Challenge on Single Image Super-Resolution: Methods and Results

[...]

Radu Timofte¹, Eirikur Agustsson¹, Luc Van Gool¹, Ming-Hsuan Yang², Lei Zhang³, Bee Oh Lim⁴, Sanghyun Son⁴, Heewon Kim⁴, Seungjun Nah⁴, Kyoung Mu Lee⁴, Xintao Wang⁵, Yapeng Tian⁶, Ke Yu⁵, Yulun Zhang⁶, Shixiang Wu⁶, Chao Dong, Liang Lin, Yu Qiao⁶, Chen Change Loy⁵, Woong Bae⁷, Jaejun Yoo⁷, Yoseob Han⁷, Jong Chul Ye⁷, Jae-Seok Choi⁷, Munchurl Kim⁷, Yuchen Fan⁸, Jiahui Yu⁸, Wei Han⁸, Ding Liu⁸, Haichao Yu⁸, Zhangyang Wang⁸, Honghui Shi⁸, Xinchao Wang⁸, Thomas S. Huang⁸, Yunjin Chen, Kai Zhang⁹, Wangmeng Zuo⁹, Zhimin Tang¹⁰, Linkai Luo¹⁰, Shaohui Li¹⁰, Min Fu¹⁰, Lei Cao¹⁰, Wen Heng¹¹, Giang Bui¹², Truc Le¹², Ye Duan¹², Dacheng Tao¹³, Ruxin Wang, Xu Lin, Jianxin Pang, Xu Jinchang¹⁴, Yu Zhao¹⁴, Xiangyu Xu², Jinshan Pan², Deqing Sun², Yujin Zhang², Xibin Song¹⁵, Yuchao Dai¹⁶, Xueying Qin¹⁵, Xuan-Phung Huynh¹⁷, Tiantong Guo¹⁸, Hojjat Seyed Mousavi¹⁸, Tiep H. Vu¹⁸, Vishal Monga¹⁸, Cristóvão Cruz¹⁹, Karen Egiazarian¹⁹, Vladimir Katkovnik¹⁹, Rakesh Mehta¹⁹, Arnav Kumar Jain²⁰, Abhinav Agarwalla²⁰, Ch V. Sai Praveen²⁰, Ruofan Zhou²¹, Hongdiao Wen²², Che Zhu²², Zhiqiang Xia²², Zhengtao Wang²², Qi Guo²² - Show less +73 more•Institutions (22)

21 Jul 2017

TL;DR: This paper reviews the first challenge on single image super-resolution (restoration of rich details in an low resolution image) with focus on proposed solutions and results and gauges the state-of-the-art in single imagesuper-resolution.

...read moreread less

Abstract: This paper reviews the first challenge on single image super-resolution (restoration of rich details in an low resolution image) with focus on proposed solutions and results. A new DIVerse 2K resolution image dataset (DIV2K) was employed. The challenge had 6 competitions divided into 2 tracks with 3 magnification factors each. Track 1 employed the standard bicubic downscaling setup, while Track 2 had unknown downscaling operators (blur kernel and decimation) but learnable through low and high res train images. Each competition had ∽100 registered participants and 20 teams competed in the final testing phase. They gauge the state-of-the-art in single image super-resolution.

...read moreread less

1,243 citations

Posted Content•

The 2017 DAVIS Challenge on Video Object Segmentation

[...]

Jordi Pont-Tuset, Federico Perazzi, Sergi Caelles, Pablo Arbeláez, Alexander Sorkine-Hornung, Luc Van Gool - Show less +2 more

03 Apr 2017-arXiv: Computer Vision and Pattern Recognition

TL;DR: The scope of the benchmark, the main characteristics of the dataset, the evaluation metrics of the competition, and a detailed analysis of the results of the participants to the challenge are described.

...read moreread less

Abstract: We present the 2017 DAVIS Challenge on Video Object Segmentation, a public dataset, benchmark, and competition specifically designed for the task of video object segmentation. Following the footsteps of other successful initiatives, such as ILSVRC and PASCAL VOC, which established the avenue of research in the fields of scene classification and semantic segmentation, the DAVIS Challenge comprises a dataset, an evaluation methodology, and a public competition with a dedicated workshop co-located with CVPR 2017. The DAVIS Challenge follows up on the recent publication of DAVIS (Densely-Annotated VIdeo Segmentation), which has fostered the development of several novel state-of-the-art video object segmentation techniques. In this paper we describe the scope of the benchmark, highlight the main characteristics of the dataset, define the evaluation metrics of the competition, and present a detailed analysis of the results of the participants to the challenge.

...read moreread less

652 citations

Proceedings Article•

Pose Guided Person Image Generation

[...]

Liqian Ma¹, Xu Jia¹, Qianru Sun², Bernt Schiele², Tinne Tuytelaars¹, Luc Van Gool³ - Show less +2 more•Institutions (3)

Katholieke Universiteit Leuven¹, Max Planck Society², ETH Zurich³

25 May 2017

TL;DR: Zhang et al. as discussed by the authors proposed a pose guided person generation network (PG$^2$) that allows to synthesize person images in arbitrary poses, based on an image of that person and a novel pose.

...read moreread less

Abstract: This paper proposes the novel Pose Guided Person Generation Network (PG$^2$) that allows to synthesize person images in arbitrary poses, based on an image of that person and a novel pose. Our generation framework PG$^2$ utilizes the pose information explicitly and consists of two key stages: pose integration and image refinement. In the first stage the condition image and the target pose are fed into a U-Net-like network to generate an initial but coarse image of the person with the target pose. The second stage then refines the initial and blurry result by training a U-Net-like generator in an adversarial way. Extensive experimental results on both 128$\times$64 re-identification images and 256$\times$256 fashion photos show that our model generates high-quality person images with convincing details.

...read moreread less

554 citations

Proceedings Article•DOI•

UntrimmedNets for Weakly Supervised Action Recognition and Detection

[...]

Limin Wang¹, Yuanjun Xiong², Dahua Lin², Luc Van Gool¹•Institutions (2)

ETH Zurich¹, The Chinese University of Hong Kong²

01 Jul 2017

TL;DR: This paper presents a new weakly supervised architecture, called UntrimmedNet, which is able to directly learn action recognition models from untrimmed videos without the requirement of temporal annotations of action instances.

...read moreread less

Abstract: Current action recognition methods heavily rely on trimmed videos for model training. However, it is expensive and time-consuming to acquire a large-scale trimmed video dataset. This paper presents a new weakly supervised architecture, called UntrimmedNet, which is able to directly learn action recognition models from untrimmed videos without the requirement of temporal annotations of action instances. Our UntrimmedNet couples two important components, the classification module and the selection module, to learn the action models and reason about the temporal duration of action instances, respectively. These two components are implemented with feed-forward networks, and UntrimmedNet is therefore an end-to-end trainable architecture. We exploit the learned models for action recognition (WSR) and detection (WSD) on the untrimmed video datasets of THUMOS14 and ActivityNet. Although our UntrimmedNet only employs weak supervision, our method achieves performance superior or comparable to that of those strongly supervised approaches on these two datasets.

...read moreread less

464 citations

Posted Content•

Semantic Instance Segmentation with a Discriminative Loss Function

[...]

Bert De Brabandere, Davy Neven, Luc Van Gool

08 Aug 2017-arXiv: Computer Vision and Pattern Recognition

TL;DR: This work proposes an approach of combining an off-the-shelf network with a principled loss function inspired by a metric learning objective that encourages a convolutional network to produce a representation of the image that can easily be clustered into instances with a simple post-processing step.

...read moreread less

Abstract: Semantic instance segmentation remains a challenging task. In this work we propose to tackle the problem with a discriminative loss function, operating at the pixel level, that encourages a convolutional network to produce a representation of the image that can easily be clustered into instances with a simple post-processing step. The loss function encourages the network to map each pixel to a point in feature space so that pixels belonging to the same instance lie close together while different instances are separated by a wide margin. Our approach of combining an off-the-shelf network with a principled loss function inspired by a metric learning objective is conceptually simple and distinct from recent efforts in instance segmentation. In contrast to previous works, our method does not rely on object proposals or recurrent mechanisms. A key contribution of our work is to demonstrate that such a simple setup without bells and whistles is effective and can perform on par with more complex methods. Moreover, we show that it does not suffer from some of the limitations of the popular detect-and-segment approaches. We achieve competitive performance on the Cityscapes and CVPPP leaf segmentation benchmarks.

...read moreread less

449 citations

Proceedings Article•DOI•

DSLR-Quality Photos on Mobile Devices with Deep Convolutional Networks

[...]

Andrey Ignatov¹, Nikolay Kobyshev¹, Radu Timofte¹, Kenneth Vanhoey¹, Luc Van Gool¹ - Show less +1 more•Institutions (1)

ETH Zurich¹

01 Oct 2017

TL;DR: An end-to-end deep learning approach that bridges the gap by translating ordinary photos into DSLR-quality images by learning the translation function using a residual convolutional neural network that improves both color rendition and image sharpness.

...read moreread less

Abstract: Despite a rapid rise in the quality of built-in smartphone cameras, their physical limitations – small sensor size, compact lenses and the lack of specific hardware, – impede them to achieve the quality results of DSLR cameras. In this work we present an end-to-end deep learning approach that bridges this gap by translating ordinary photos into DSLR-quality images. We propose learning the translation function using a residual convolutional neural network that improves both color rendition and image sharpness. Since the standard mean squared loss is not well suited for measuring perceptual image quality, we introduce a composite perceptual error function that combines content, color and texture losses. The first two losses are defined analytically, while the texture loss is learned in an adversarial fashion. We also present DPED, a large-scale dataset that consists of real photos captured from three different phones and one high-end reflex camera. Our quantitative and qualitative assessments reveal that the enhanced image quality is comparable to that of DSLR-taken photos, while the methodology is generalized to any type of digital camera.

...read moreread less

423 citations

Journal Article•DOI•

Semantic Foggy Scene Understanding with Synthetic Data

[...]

Christos Sakaridis¹, Dengxin Dai¹, Luc Van Gool¹, Luc Van Gool²•Institutions (2)

ETH Zurich¹, Katholieke Universiteit Leuven²

25 Aug 2017-arXiv: Computer Vision and Pattern Recognition

TL;DR: In this paper, a semi-supervised learning strategy was proposed for semantic foggy scene understanding, which combines supervised learning with an unsupervised supervision transfer from clear-weather images to their synthetic foggy counterparts.

...read moreread less

Abstract: This work addresses the problem of semantic foggy scene understanding (SFSU). Although extensive research has been performed on image dehazing and on semantic scene understanding with clear-weather images, little attention has been paid to SFSU. Due to the difficulty of collecting and annotating foggy images, we choose to generate synthetic fog on real images that depict clear-weather outdoor scenes, and then leverage these partially synthetic data for SFSU by employing state-of-the-art convolutional neural networks (CNN). In particular, a complete pipeline to add synthetic fog to real, clear-weather images using incomplete depth information is developed. We apply our fog synthesis on the Cityscapes dataset and generate Foggy Cityscapes with 20550 images. SFSU is tackled in two ways: 1) with typical supervised learning, and 2) with a novel type of semi-supervised learning, which combines 1) with an unsupervised supervision transfer from clear-weather images to their synthetic foggy counterparts. In addition, we carefully study the usefulness of image dehazing for SFSU. For evaluation, we present Foggy Driving, a dataset with 101 real-world images depicting foggy driving scenes, which come with ground truth annotations for semantic segmentation and object detection. Extensive experiments show that 1) supervised learning with our synthetic data significantly improves the performance of state-of-the-art CNN for SFSU on Foggy Driving; 2) our semi-supervised learning strategy further improves performance; and 3) image dehazing marginally advances SFSU with our learning strategy. The datasets, models and code are made publicly available.

...read moreread less

416 citations

Posted Content•

WebVision Database: Visual Learning and Understanding from Web Data

[...]

Wen Li, Limin Wang, Wei Li, Eirikur Agustsson, Luc Van Gool - Show less +1 more

09 Aug 2017-arXiv: Computer Vision and Pattern Recognition

TL;DR: A new database called WebVision is built, which contains more than $2.4$ million web images crawled from the Internet by using queries generated from the 1,000 semantic concepts of the benchmark ILSVRC 2012 dataset, which means the dataset can be used as the largest benchmark dataset for visual domain adaptation.

...read moreread less

Abstract: In this paper, we present a study on learning visual recognition models from large scale noisy web data. We build a new database called WebVision, which contains more than $2.4$ million web images crawled from the Internet by using queries generated from the 1,000 semantic concepts of the benchmark ILSVRC 2012 dataset. Meta information along with those web images (e.g., title, description, tags, etc.) are also crawled. A validation set and test set containing human annotated images are also provided to facilitate algorithmic development. Based on our new database, we obtain a few interesting observations: 1) the noisy web images are sufficient for training a good deep CNN model for visual recognition; 2) the model learnt from our WebVision database exhibits comparable or even better generalization ability than the one trained from the ILSVRC 2012 dataset when being transferred to new datasets and tasks; 3) a domain adaptation issue (a.k.a., dataset bias) is observed, which means the dataset can be used as the largest benchmark dataset for visual domain adaptation. Our new WebVision database and relevant studies in this work would benefit the advance of learning state-of-the-art visual models with minimum supervision based on web data.

...read moreread less

304 citations

Posted Content•

Disentangled Person Image Generation

[...]

Liqian Ma¹, Qianru Sun², Stamatios Georgoulis¹, Luc Van Gool³, Bernt Schiele², Mario Fritz² - Show less +2 more•Institutions (3)

Katholieke Universiteit Leuven¹, Max Planck Society², ETH Zurich³

07 Dec 2017-arXiv: Computer Vision and Pattern Recognition

TL;DR: In this paper, a multi-branched reconstruction network is proposed to disentangle and encode the three image factors into embedding features, which are then combined to re-compose the input image itself.

...read moreread less

Abstract: Generating novel, yet realistic, images of persons is a challenging task due to the complex interplay between the different image factors, such as the foreground, background and pose information. In this work, we aim at generating such images based on a novel, two-stage reconstruction pipeline that learns a disentangled representation of the aforementioned image factors and generates novel person images at the same time. First, a multi-branched reconstruction network is proposed to disentangle and encode the three factors into embedding features, which are then combined to re-compose the input image itself. Second, three corresponding mapping functions are learned in an adversarial manner in order to map Gaussian noise to the learned embedding feature space, for each factor respectively. Using the proposed framework, we can manipulate the foreground, background and pose of the input image, and also sample new embedding features to generate such targeted manipulations, that provide more control over the generation process. Experiments on Market-1501 and Deepfashion datasets show that our model does not only generate realistic person images with new foregrounds, backgrounds and poses, but also manipulates the generated factors and interpolates the in-between states. Another set of experiments on Market-1501 shows that our model can also be beneficial for the person re-identification task.

...read moreread less

292 citations

Posted Content•

Pose Guided Person Image Generation

[...]

Liqian Ma¹, Xu Jia¹, Qianru Sun², Bernt Schiele², Tinne Tuytelaars¹, Luc Van Gool³ - Show less +2 more•Institutions (3)

Katholieke Universiteit Leuven¹, Max Planck Society², ETH Zurich³

25 May 2017-arXiv: Computer Vision and Pattern Recognition

TL;DR: In this paper, a pose guided person generation network (PG$^2$) is proposed to synthesize person images in arbitrary poses, based on an image of that person and a novel pose.

...read moreread less

Abstract: This paper proposes the novel Pose Guided Person Generation Network (PG$^2$) that allows to synthesize person images in arbitrary poses, based on an image of that person and a novel pose Our generation framework PG$^2$ utilizes the pose information explicitly and consists of two key stages: pose integration and image refinement In the first stage the condition image and the target pose are fed into a U-Net-like network to generate an initial but coarse image of the person with the target pose The second stage then refines the initial and blurry result by training a U-Net-like generator in an adversarial way Extensive experimental results on both 128$\times$64 re-identification images and 256$\times$256 fashion photos show that our model generates high-quality person images with convincing details

...read moreread less

285 citations

Proceedings Article•DOI•

Deep Temporal Linear Encoding Networks

[...]

Ali Diba¹, Vivek Sharma, Luc Van Gool²•Institutions (2)

Katholieke Universiteit Leuven¹, ETH Zurich²

01 Jul 2017

TL;DR: Temporal linear encoding (TLE) as discussed by the authors is proposed to encode the entire video into a compact feature representation, learning the semantics and a discriminative feature space, which is applicable to all kinds of networks like 2D and 3D CNNs.

...read moreread less

Abstract: The CNN-encoding of features from entire videos for the representation of human actions has rarely been addressed. Instead, CNN work has focused on approaches to fuse spatial and temporal networks, but these were typically limited to processing shorter sequences. We present a new video representation, called temporal linear encoding (TLE) and embedded inside of CNNs as a new layer, which captures the appearance and motion throughout entire videos. It encodes this aggregated information into a robust video feature representation, via end-to-end learning. Advantages of TLEs are: (a) they encode the entire video into a compact feature representation, learning the semantics and a discriminative feature space, (b) they are applicable to all kinds of networks like 2D and 3D CNNs for video classification, and (c) they model feature interactions in a more expressive way and without loss of information. We conduct experiments on two challenging human action datasets: HMDB51 and UCF101. The experiments show that TLE outperforms current state-of-the-art methods on both datasets.

...read moreread less

Proceedings Article•DOI•

Deep Learning on Lie Groups for Skeleton-Based Action Recognition

[...]

Zhiwu Huang¹, Chengde Wan¹, Thomas Probst¹, Luc Van Gool²•Institutions (2)

ETH Zurich¹, Katholieke Universiteit Leuven²

01 Jul 2017

TL;DR: The Lie group structure is incorporated into a deep network architecture to learn more appropriate Lie group features for 3D action recognition and a logarithm mapping layer is proposed to map the resulting manifold data into a tangent space that facilitates the application of regular output layers for the final classification.

...read moreread less

Abstract: In recent years, skeleton-based action recognition has become a popular 3D classification problem. State-of-the-art methods typically first represent each motion sequence as a high-dimensional trajectory on a Lie group with an additional dynamic time warping, and then shallowly learn favorable Lie group features. In this paper we incorporate the Lie group structure into a deep network architecture to learn more appropriate Lie group features for 3D action recognition. Within the network structure, we design rotation mapping layers to transform the input Lie group features into desirable ones, which are aligned better in the temporal domain. To reduce the high feature dimensionality, the architecture is equipped with rotation pooling layers for the elements on the Lie group. Furthermore, we propose a logarithm mapping layer to map the resulting manifold data into a tangent space that facilitates the application of regular output layers for the final classification. Evaluations of the proposed network for standard 3D human action recognition datasets clearly demonstrate its superiority over existing shallow Lie group feature learning methods as well as most conventional deep learning methods.

...read moreread less

Proceedings Article•DOI•

Weakly Supervised Cascaded Convolutional Networks

[...]

Ali Diba¹, Vivek Sharma², Ali Mohammad Pazandeh³, Hamed Pirsiavash⁴, Luc Van Gool¹ - Show less +1 more•Institutions (4)

Katholieke Universiteit Leuven¹, Karlsruhe Institute of Technology², Sharif University of Technology³, University of Maryland, Baltimore County⁴

01 Jul 2017

TL;DR: In this article, a new architecture of cascaded networks is proposed to learn a convolutional neural network (CNN) under such conditions, with either two cascade stages or three which are trained in an end-to-end pipeline.

...read moreread less

Abstract: Object detection is a challenging task in visual understanding domain, and even more so if the supervision is to be weak. Recently, few efforts to handle the task without expensive human annotations is established by promising deep neural network. A new architecture of cascaded networks is proposed to learn a convolutional neural network (CNN) under such conditions. We introduce two such architectures, with either two cascade stages or three which are trained in an end-to-end pipeline. The first stage of both architectures extracts best candidate of class specific region proposals by training a fully convolutional network. In the case of the three stage architecture, the middle stage provides object segmentation, using the output of the activation maps of first stage. The final stage of both architectures is a part of a convolutional neural network that performs multiple instance learning on proposals extracted in the previous stage(s). Our experiments on the PASCAL VOC 2007, 2010, 2012 and large scale object datasets, ILSVRC 2013, 2014 datasets show improvements in the areas of weakly-supervised object detection, classification and localization.

...read moreread less

Posted Content•

UntrimmedNets for Weakly Supervised Action Recognition and Detection

[...]

Limin Wang¹, Yuanjun Xiong², Dahua Lin², Luc Van Gool¹•Institutions (2)

ETH Zurich¹, The Chinese University of Hong Kong²

09 Mar 2017-arXiv: Computer Vision and Pattern Recognition

TL;DR: In this article, a weakly supervised architecture, called UntrimmedNet, is proposed to directly learn action recognition models from untrimmed videos without the requirement of temporal annotations of action instances.

...read moreread less

Proceedings Article•

Soft-to-hard vector quantization for end-to-end learning compressible representations

[...]

Eirikur Agustsson¹, Fabian Mentzer¹, Michael Tschannen¹, Lukas Cavigelli¹, Radu Timofte¹, Luca Benini², Luc Van Gool³ - Show less +3 more•Institutions (3)

ETH Zurich¹, University of Bologna², Katholieke Universiteit Leuven³

03 Apr 2017

TL;DR: In this article, a soft relaxation of quantization and entropy is proposed to learn compressible representations in deep architectures with an end-to-end training strategy, which achieves state-of-the-art performance for image compression and neural network compression.

...read moreread less

Abstract: We present a new approach to learn compressible representations in deep architectures with an end-to-end training strategy. Our method is based on a soft (continuous) relaxation of quantization and entropy, which we anneal to their discrete counterparts throughout training. We showcase this method for two challenging applications: Image compression and neural network compression. While these tasks have typically been approached with different methods, our soft-to-hard quantization approach gives results competitive with the state-of-the-art for both.

...read moreread less

Posted Content•

Temporal 3D ConvNets: New Architecture and Transfer Learning for Video Classification.

[...]

Ali Diba, Mohsen Fayyaz, Vivek Sharma, Amir Hossein Karami, Mohammad Mahdi Arzani, Rahman Yousefzadeh, Luc Van Gool - Show less +3 more

22 Nov 2017-arXiv: Computer Vision and Pattern Recognition

TL;DR: By finetuning this network, the proposed video convolutional network T3D outperforms the performance of generic and recent methods in 3D CNNs, which were trained on large video datasets, and finetuned on the target datasets, e.g. HMDB51/UCF101.

...read moreread less

Abstract: The work in this paper is driven by the question how to exploit the temporal cues available in videos for their accurate classification, and for human action recognition in particular? Thus far, the vision community has focused on spatio-temporal approaches with fixed temporal convolution kernel depths. We introduce a new temporal layer that models variable temporal convolution kernel depths. We embed this new temporal layer in our proposed 3D CNN. We extend the DenseNet architecture - which normally is 2D - with 3D filters and pooling kernels. We name our proposed video convolutional network `Temporal 3D ConvNet'~(T3D) and its new temporal layer `Temporal Transition Layer'~(TTL). Our experiments show that T3D outperforms the current state-of-the-art methods on the HMDB51, UCF101 and Kinetics datasets. The other issue in training 3D ConvNets is about training them from scratch with a huge labeled dataset to get a reasonable performance. So the knowledge learned in 2D ConvNets is completely ignored. Another contribution in this work is a simple and effective technique to transfer knowledge from a pre-trained 2D CNN to a randomly initialized 3D CNN for a stable weight initialization. This allows us to significantly reduce the number of training samples for 3D CNNs. Thus, by finetuning this network, we beat the performance of generic and recent methods in 3D CNNs, which were trained on large video datasets, e.g. Sports-1M, and finetuned on the target datasets, e.g. HMDB51/UCF101. The T3D codes will be released

...read moreread less

Posted Content•

Appearance-and-Relation Networks for Video Classification

[...]

Limin Wang, Wei Li, Wen Li, Luc Van Gool

24 Nov 2017-arXiv: Computer Vision and Pattern Recognition

TL;DR: Wang et al. as discussed by the authors proposed an Appearance and Relation Network (ARTNet) to simultaneously model appearance and relation from RGB input in a separate and explicit manner, which decouple the spatiotemporal learning module into an appearance branch for spatial modeling and a relation branch for temporal modeling.

...read moreread less

Abstract: Spatiotemporal feature learning in videos is a fundamental problem in computer vision. This paper presents a new architecture, termed as Appearance-and-Relation Network (ARTNet), to learn video representation in an end-to-end manner. ARTNets are constructed by stacking multiple generic building blocks, called as SMART, whose goal is to simultaneously model appearance and relation from RGB input in a separate and explicit manner. Specifically, SMART blocks decouple the spatiotemporal learning module into an appearance branch for spatial modeling and a relation branch for temporal modeling. The appearance branch is implemented based on the linear combination of pixels or filter responses in each frame, while the relation branch is designed based on the multiplicative interactions between pixels or filter responses across multiple frames. We perform experiments on three action recognition benchmarks: Kinetics, UCF101, and HMDB51, demonstrating that SMART blocks obtain an evident improvement over 3D convolutions for spatiotemporal feature learning. Under the same training setting, ARTNets achieve superior performance on these three datasets to the existing state-of-the-art methods.

...read moreread less

Posted Content•

DSLR-Quality Photos on Mobile Devices with Deep Convolutional Networks

[...]

Andrey Ignatov¹, Nikolay Kobyshev¹, Radu Timofte¹, Kenneth Vanhoey¹, Luc Van Gool¹ - Show less +1 more•Institutions (1)

ETH Zurich¹

08 Apr 2017-arXiv: Computer Vision and Pattern Recognition

TL;DR: In this article, a residual convolutional neural network was proposed to translate ordinary photos into DSLR-quality images by combining content, color, and texture losses, where the first two losses are defined analytically, while the texture loss is learned in an adversarial fashion.

...read moreread less

Abstract: Despite a rapid rise in the quality of built-in smartphone cameras, their physical limitations - small sensor size, compact lenses and the lack of specific hardware, - impede them to achieve the quality results of DSLR cameras. In this work we present an end-to-end deep learning approach that bridges this gap by translating ordinary photos into DSLR-quality images. We propose learning the translation function using a residual convolutional neural network that improves both color rendition and image sharpness. Since the standard mean squared loss is not well suited for measuring perceptual image quality, we introduce a composite perceptual error function that combines content, color and texture losses. The first two losses are defined analytically, while the texture loss is learned in an adversarial fashion. We also present DPED, a large-scale dataset that consists of real photos captured from three different phones and one high-end reflex camera. Our quantitative and qualitative assessments reveal that the enhanced image quality is comparable to that of DSLR-taken photos, while the methodology is generalized to any type of digital camera.

...read moreread less

Proceedings Article•DOI•

Semantic Instance Segmentation for Autonomous Driving

[...]

Bert De Brabandere¹, Davy Neven¹, Luc Van Gool¹•Institutions (1)

Katholieke Universiteit Leuven¹

01 Jul 2017

TL;DR: This work proposes a discriminative loss function, operating at pixel level, that encourages a convolutional network to produce a representation of the image that can easily be clustered into instances with a simple post-processing step.

...read moreread less

Abstract: Semantic instance segmentation remains a challenge. We propose to tackle the problem with a discriminative loss function, operating at pixel level, that encourages a convolutional network to produce a representation of the image that can easily be clustered into instances with a simple post-processing step. Our approach of combining an offthe- shelf network with a principled loss function inspired by a metric learning objective is conceptually simple and distinct from recent efforts in instance segmentation and is well-suited for real-time applications. In contrast to previous works, our method does not rely on object proposals or recurrent mechanisms and is particularly well suited for tasks with complex occlusions. A key contribution of our work is to demonstrate that such a simple setup without bells and whistles is effective and can perform on-par with more complex methods. We achieve competitive performance on the Cityscapes segmentation benchmark.

...read moreread less

Posted Content•

Soft-to-Hard Vector Quantization for End-to-End Learning Compressible Representations

[...]

Eirikur Agustsson¹, Fabian Mentzer¹, Michael Tschannen¹, Lukas Cavigelli¹, Radu Timofte¹, Luca Benini², Luc Van Gool³ - Show less +3 more•Institutions (3)

ETH Zurich¹, University of Bologna², Katholieke Universiteit Leuven³

03 Apr 2017-arXiv: Learning

TL;DR: This work presents a new approach to learn compressible representations in deep architectures with an end-to-end training strategy based on a soft (continuous) relaxation of quantization and entropy, which is anneal to their discrete counterparts throughout training.

...read moreread less

Posted Content•

Deep Extreme Cut: From Extreme Points to Object Segmentation

[...]

Kevis-Kokitsi Maninis, Sergi Caelles, Jordi Pont-Tuset, Luc Van Gool

24 Nov 2017-arXiv: Computer Vision and Pattern Recognition

TL;DR: In this article, the authors use extreme points in an object (leftmost, right-most, top, bottom pixels) as input to obtain precise object segmentation for images and videos by adding an extra channel to the image in the input of a convolutional neural network.

...read moreread less

Abstract: This paper explores the use of extreme points in an object (left-most, right-most, top, bottom pixels) as input to obtain precise object segmentation for images and videos. We do so by adding an extra channel to the image in the input of a convolutional neural network (CNN), which contains a Gaussian centered in each of the extreme points. The CNN learns to transform this information into a segmentation of an object that matches those extreme points. We demonstrate the usefulness of this approach for guided segmentation (grabcut-style), interactive segmentation, video object segmentation, and dense segmentation annotation. We show that we obtain the most precise results to date, also with less user input, in an extensive and varied selection of benchmarks and datasets. All our models and code are publicly available on this http URL.

...read moreread less

Proceedings Article•

A Riemannian Network for SPD Matrix Learning

[...]

Zhiwu Huang¹, Luc Van Gool²•Institutions (2)

ETH Zurich¹, Katholieke Universiteit Leuven²

01 Jul 2017

TL;DR: A Riemannian network architecture is built to open up a new direction of SPD matrix non-linear learning in a deep model and it is shown that the proposed SPD matrix network can be simply trained and outperform existing SPD matrix learning and state-of-the-art methods in three typical visual classification tasks.

...read moreread less

Abstract: Symmetric Positive Definite (SPD) matrix learning methods have become popular in many image and video processing tasks, thanks to their ability to learn appropriate statistical representations while respecting Riemannian geometry of underlying SPD manifolds. In this paper we build a Riemannian network architecture to open up a new direction of SPD matrix non-linear learning in a deep model. In particular, we devise bilinear mapping layers to transform input SPD matrices to more desirable SPD matrices, exploit eigenvalue rectification layers to apply a non-linear activation function to the new SPD matrices, and design an eigenvalue logarithm layer to perform Riemannian computing on the resulting SPD matrices for regular output layers. For training the proposed deep network, we exploit a new backpropagation with a variant of stochastic gradient descent on Stiefel manifolds to update the structured connection weights and the involved SPD matrix data. We show through experiments that the proposed SPD matrix network can be simply trained and outperform existing SPD matrix learning and state-of-the-art methods in three typical visual classification tasks.

...read moreread less

Proceedings Article•DOI•

Crossing Nets: Combining GANs and VAEs with a Shared Latent Space for Hand Pose Estimation

[...]

Chengde Wan¹, Thomas Probst¹, Luc Van Gool¹, Angela Yao²•Institutions (2)

ETH Zurich¹, University of Bonn²

09 Oct 2017

TL;DR: In this article, the authors propose a semi-supervised generative model for 3D hand pose estimation from depth images, where the generator is updated with the back-propagated gradient from the discriminator to synthesize realistic depth maps.

...read moreread less

Abstract: State-of-the-art methods for 3D hand pose estimation from depth images require large amounts of annotated training data. We propose modelling the statistical relationship of 3D hand poses and corresponding depth images using two deep generative models with a shared latent space. By design, our architecture allows for learning from unlabeled image data in a semi-supervised manner. Assuming a one-to-one mapping between a pose and a depth map, any given point in the shared latent space can be projected into both a hand pose or into a corresponding depth map. Regressing the hand pose can then be done by learning a discriminator to estimate the posterior of the latent pose given some depth map. To prevent over-fitting and to better exploit unlabeled depth maps, the generator and discriminator are trained jointly. At each iteration, the generator is updated with the back-propagated gradient from the discriminator to synthesize realistic depth maps of the articulated hand, while the discriminator benefits from an augmented training set of synthesized samples and unlabeled depth maps. The proposed discriminator network architecture is highly efficient and runs at 90fps on the CPU with accuracies comparable or better than state-of-art on 3 publicly available benchmarks.

...read moreread less

Proceedings Article•DOI•

Thin-Slicing Network: A Deep Structured Model for Pose Estimation in Videos

[...]

Jie Song¹, Limin Wang¹, Luc Van Gool¹, Otmar Hilliges¹•Institutions (1)

ETH Zurich¹

21 Jul 2017

TL;DR: In this paper, a deep structured model is proposed to estimate a sequence of human poses in unconstrained videos, which can be efficiently trained in an end-to-end manner and is capable of representing the appearance of body joints and their spatio-temporal relationships simultaneously.

...read moreread less

Abstract: Deep ConvNets have been shown to be effective for the task of human pose estimation from single images. However, several challenging issues arise in the video-based case such as self-occlusion, motion blur, and uncommon poses with few or no examples in the training data. Temporal information can provide additional cues about the location of body joints and help to alleviate these issues. In this paper, we propose a deep structured model to estimate a sequence of human poses in unconstrained videos. This model can be efficiently trained in an end-to-end manner and is capable of representing the appearance of body joints and their spatio-temporal relationships simultaneously. Domain knowledge about the human body is explicitly incorporated into the network providing effective priors to regularize the skeletal structure and to enforce temporal consistency. The proposed end-to-end architecture is evaluated on two widely used benchmarks for video-based pose estimation (Penn Action and JHMDB datasets). Our approach outperforms several state-of-the-art methods.

...read moreread less

Posted Content•

Convolutional Oriented Boundaries: From Image Segmentation to High-Level Tasks

[...]

Kevis-Kokitsi Maninis¹, Jordi Pont-Tuset¹, Pablo Arbeláez², Luc Van Gool¹•Institutions (2)

ETH Zurich¹, University of Los Andes²

17 Jan 2017-arXiv: Computer Vision and Pattern Recognition

TL;DR: Convolutional Oriented Boundaries (COB) as mentioned in this paper produces multiscale oriented contours and region hierarchies starting from generic image classification Convolutional Neural Networks (CNNs).

...read moreread less

Abstract: We present Convolutional Oriented Boundaries (COB), which produces multiscale oriented contours and region hierarchies starting from generic image classification Convolutional Neural Networks (CNNs). COB is computationally efficient, because it requires a single CNN forward pass for multi-scale contour detection and it uses a novel sparse boundary representation for hierarchical segmentation; it gives a significant leap in performance over the state-of-the-art, and it generalizes very well to unseen categories and datasets. Particularly, we show that learning to estimate not only contour strength but also orientation provides more accurate results. We perform extensive experiments for low-level applications on BSDS, PASCAL Context, PASCAL Segmentation, and NYUD to evaluate boundary detection performance, showing that COB provides state-of-the-art contours and region hierarchies in all datasets. We also evaluate COB on high-level tasks when coupled with multiple pipelines for object proposals, semantic contours, semantic segmentation, and object detection on MS-COCO, SBD, and PASCAL; showing that COB also improves the results for all tasks.

...read moreread less

Posted Content•

Natural and Effective Obfuscation by Head Inpainting

[...]

Qianru Sun¹, Liqian Ma², Seong Joon Oh¹, Luc Van Gool², Bernt Schiele¹, Mario Fritz¹ - Show less +2 more•Institutions (2)

Max Planck Society¹, Katholieke Universiteit Leuven²

24 Nov 2017-arXiv: Computer Vision and Pattern Recognition

TL;DR: This work proposes a novel head inpainting obfuscation technique that generates realistic person images, while achieving superior obfuscation performance against automatic person recognizers.

...read moreread less

Abstract: As more and more personal photos are shared online, being able to obfuscate identities in such photos is becoming a necessity for privacy protection. People have largely resorted to blacking out or blurring head regions, but they result in poor user experience while being surprisingly ineffective against state of the art person recognizers. In this work, we propose a novel head inpainting obfuscation technique. Generating a realistic head inpainting in social media photos is challenging because subjects appear in diverse activities and head orientations. We thus split the task into two sub-tasks: (1) facial landmark generation from image context (e.g. body pose) for seamless hypothesis of sensible head pose, and (2) facial landmark conditioned head inpainting. We verify that our inpainting method generates realistic person images, while achieving superior obfuscation performance against automatic person recognizers.

...read moreread less

Posted Content•

ComboGAN: Unrestrained Scalability for Image Domain Translation

[...]

Asha Anoosheh, Eirikur Agustsson, Radu Timofte¹, Luc Van Gool²•Institutions (2)

ETH Zurich¹, Katholieke Universiteit Leuven²

19 Dec 2017-arXiv: Computer Vision and Pattern Recognition

TL;DR: This paper proposes a multi-component image translation model and training scheme which scales linearly - both in resource consumption and time required - with the number of domains and demonstrates its capabilities on a dataset of paintings by 14 different artists.

...read moreread less

Abstract: This year alone has seen unprecedented leaps in the area of learning-based image translation, namely CycleGAN, by Zhu et al. But experiments so far have been tailored to merely two domains at a time, and scaling them to more would require an quadratic number of models to be trained. And with two-domain models taking days to train on current hardware, the number of domains quickly becomes limited by the time and resources required to process them. In this paper, we propose a multi-component image translation model and training scheme which scales linearly - both in resource consumption and time required - with the number of domains. We demonstrate its capabilities on a dataset of paintings by 14 different artists and on images of the four different seasons in the Alps. Note that 14 data groups would need (14 choose 2) = 91 different CycleGAN models: a total of 182 generator/discriminator pairs; whereas our model requires only 14 generator/discriminator pairs.

...read moreread less

Proceedings Article•DOI•

Query-adaptive Video Summarization via Quality-aware Relevance Estimation

[...]

Arun Balajee Vasudevan¹, Michael Gygli¹, Anna Volokitin¹, Luc Van Gool¹•Institutions (1)

ETH Zurich¹

19 Oct 2017

TL;DR: This work poses query-relevant summarization as a video frame subset selection problem, which lets it optimise for summaries which are simultaneously diverse, representative of the entire video, and relevant to a text query.

...read moreread less

Abstract: Although the problem of automatic video summarization has recently received a lot of attention, the problem of creating a video summary that also highlights elements relevant to a search query has been less studied. We address this problem by posing query-relevant summarization as a video frame subset selection problem, which lets us optimise for summaries which are simultaneously diverse, representative of the entire video, and relevant to a text query. We quantify relevance by measuring the distance between frames and queries in a common textual-visual semantic embedding space induced by a neural network. In addition, we extend the model to capture query-independent properties, such as frame quality. We compare our method against previous state of the art on textual-visual embeddings for thumbnail selection and show that our model outperforms them on relevance prediction. Furthermore, we introduce a new dataset, annotated with diversity and query-specific relevance labels. On this dataset, we train and test our complete model for video summarization and show that it outperforms standard baselines such as Maximal Marginal Relevance.

...read moreread less

Proceedings Article•DOI•

Learned Multi-patch Similarity

[...]

Wilfried Hartmann¹, Silvano Galliani¹, Michal Havlena, Luc Van Gool¹, Konrad Schindler¹ - Show less +1 more•Institutions (1)

ETH Zurich¹

01 Oct 2017

TL;DR: In this article, the authors propose to learn a matching function which directly maps multiple image patches to a scalar similarity score, which has advantages over methods based on pairwise patch similarity.

...read moreread less

Abstract: Estimating a depth map from multiple views of a scene is a fundamental task in computer vision. As soon as more than two viewpoints are available, one faces the very basic question how to measure similarity across >2 image patches. Surprisingly, no direct solution exists, instead it is common to fall back to more or less robust averaging of two-view similarities. Encouraged by the success of machine learning, and in particular convolutional neural networks, we propose to learn a matching function which directly maps multiple image patches to a scalar similarity score. Experiments on several multi-view datasets demonstrate that this approach has advantages over methods based on pairwise patch similarity.

...read moreread less

Posted Content•

WESPE: Weakly Supervised Photo Enhancer for Digital Cameras

[...]

Andrey Ignatov¹, Nikolay Kobyshev¹, Radu Timofte¹, Kenneth Vanhoey¹, Luc Van Gool¹ - Show less +1 more•Institutions (1)

ETH Zurich¹

04 Sep 2017-arXiv: Computer Vision and Pattern Recognition

TL;DR: This work introduces a weakly supervised photo enhancer (WESPE) - a novel image-to-image Generative Adversarial Network-based architecture that produces comparable or improved qualitative results with state-of-the-art strongly supervised methods.

...read moreread less

Abstract: Low-end and compact mobile cameras demonstrate limited photo quality mainly due to space, hardware and budget constraints. In this work, we propose a deep learning solution that translates photos taken by cameras with limited capabilities into DSLR-quality photos automatically. We tackle this problem by introducing a weakly supervised photo enhancer (WESPE) - a novel image-to-image Generative Adversarial Network-based architecture. The proposed model is trained by under weak supervision: unlike previous works, there is no need for strong supervision in the form of a large annotated dataset of aligned original/enhanced photo pairs. The sole requirement is two distinct datasets: one from the source camera, and one composed of arbitrary high-quality images that can be generally crawled from the Internet - the visual content they exhibit may be unrelated. Hence, our solution is repeatable for any camera: collecting the data and training can be achieved in a couple of hours. In this work, we emphasize on extensive evaluation of obtained results. Besides standard objective metrics and subjective user study, we train a virtual rater in the form of a separate CNN that mimics human raters on Flickr data and use this network to get reference scores for both original and enhanced photos. Our experiments on the DPED, KITTI and Cityscapes datasets as well as pictures from several generations of smartphones demonstrate that WESPE produces comparable or improved qualitative results with state-of-the-art strongly supervised methods.

...read moreread less