Showing papers on "Feature extraction published in 2017"

PDF

Open Access

Proceedings Article•DOI•

Feature Pyramid Networks for Object Detection

[...]

Tsung-Yi Lin¹, Piotr Dollár², Ross Girshick², Kaiming He², Bharath Hariharan², Serge Belongie¹ - Show less +2 more•Institutions (2)

Cornell University¹, Facebook²

21 Jul 2017

TL;DR: This paper exploits the inherent multi-scale, pyramidal hierarchy of deep convolutional networks to construct feature pyramids with marginal extra cost and achieves state-of-the-art single-model results on the COCO detection benchmark without bells and whistles.

...read moreread less

Abstract: Feature pyramids are a basic component in recognition systems for detecting objects at different scales. But pyramid representations have been avoided in recent object detectors that are based on deep convolutional networks, partially because they are slow to compute and memory intensive. In this paper, we exploit the inherent multi-scale, pyramidal hierarchy of deep convolutional networks to construct feature pyramids with marginal extra cost. A top-down architecture with lateral connections is developed for building high-level semantic feature maps at all scales. This architecture, called a Feature Pyramid Network (FPN), shows significant improvement as a generic feature extractor in several applications. Using a basic Faster R-CNN system, our method achieves state-of-the-art single-model results on the COCO detection benchmark without bells and whistles, surpassing all existing single-model entries including those from the COCO 2016 challenge winners. In addition, our method can run at 5 FPS on a GPU and thus is a practical and accurate solution to multi-scale object detection. Code will be made publicly available.

...read moreread less

16,727 citations

Journal Article•DOI•

SegNet: A Deep Convolutional Encoder-Decoder Architecture for Image Segmentation

[...]

Vijay Badrinarayanan¹, Alex Kendall¹, Roberto Cipolla¹•Institutions (1)

University of Cambridge¹

01 Dec 2017-IEEE Transactions on Pattern Analysis and Machine Intelligence

TL;DR: Quantitative assessments show that SegNet provides good performance with competitive inference time and most efficient inference memory-wise as compared to other architectures, including FCN and DeconvNet.

...read moreread less

Abstract: We present a novel and practical deep fully convolutional neural network architecture for semantic pixel-wise segmentation termed SegNet. This core trainable segmentation engine consists of an encoder network, a corresponding decoder network followed by a pixel-wise classification layer. The architecture of the encoder network is topologically identical to the 13 convolutional layers in the VGG16 network [1] . The role of the decoder network is to map the low resolution encoder feature maps to full input resolution feature maps for pixel-wise classification. The novelty of SegNet lies is in the manner in which the decoder upsamples its lower resolution input feature map(s). Specifically, the decoder uses pooling indices computed in the max-pooling step of the corresponding encoder to perform non-linear upsampling. This eliminates the need for learning to upsample. The upsampled maps are sparse and are then convolved with trainable filters to produce dense feature maps. We compare our proposed architecture with the widely adopted FCN [2] and also with the well known DeepLab-LargeFOV [3] , DeconvNet [4] architectures. This comparison reveals the memory versus accuracy trade-off involved in achieving good segmentation performance. SegNet was primarily motivated by scene understanding applications. Hence, it is designed to be efficient both in terms of memory and computational time during inference. It is also significantly smaller in the number of trainable parameters than other competing architectures and can be trained end-to-end using stochastic gradient descent. We also performed a controlled benchmark of SegNet and other architectures on both road scenes and SUN RGB-D indoor scene segmentation tasks. These quantitative assessments show that SegNet provides good performance with competitive inference time and most efficient inference memory-wise as compared to other architectures. We also provide a Caffe implementation of SegNet and a web demo at http://mi.eng.cam.ac.uk/projects/segnet/ .

...read moreread less

13,468 citations

Proceedings Article•DOI•

PointNet: Deep Learning on Point Sets for 3D Classification and Segmentation

[...]

R. Qi Charles¹, Hao Su¹, Mo Kaichun¹, Leonidas J. Guibas¹•Institutions (1)

Stanford University¹

21 Jul 2017

TL;DR: This paper designs a novel type of neural network that directly consumes point clouds, which well respects the permutation invariance of points in the input and provides a unified architecture for applications ranging from object classification, part segmentation, to scene semantic parsing.

...read moreread less

Abstract: Point cloud is an important type of geometric data structure. Due to its irregular format, most researchers transform such data to regular 3D voxel grids or collections of images. This, however, renders data unnecessarily voluminous and causes issues. In this paper, we design a novel type of neural network that directly consumes point clouds, which well respects the permutation invariance of points in the input. Our network, named PointNet, provides a unified architecture for applications ranging from object classification, part segmentation, to scene semantic parsing. Though simple, PointNet is highly efficient and effective. Empirically, it shows strong performance on par or even better than state of the art. Theoretically, we provide analysis towards understanding of what the network has learnt and why the network is robust with respect to input perturbation and corruption.

...read moreread less

9,457 citations

Proceedings Article•DOI•

Speed/Accuracy Trade-Offs for Modern Convolutional Object Detectors

[...]

Jonathan Huang¹, Vivek Rathod¹, Chen Sun², Menglong Zhu³, Anoop Korattikara⁴, Alireza Fathi², Ian Fischer², Zbigniew Wojna⁵, Yang Song⁶, Sergio Guadarrama⁷, Kevin Murphy⁸ - Show less +7 more•Institutions (8)

Russian Academy of Sciences¹, Google², University of Pennsylvania³, University of California, Irvine⁴, University College London⁵, Chinese Center for Disease Control and Prevention⁶, University of California, Berkeley⁷, Cardiff University⁸

21 Jul 2017

TL;DR: A unified implementation of the Faster R-CNN, R-FCN and SSD systems is presented and the speed/accuracy trade-off curve created by using alternative feature extractors and varying other critical parameters such as image size within each of these meta-architectures is traced out.

...read moreread less

Abstract: The goal of this paper is to serve as a guide for selecting a detection architecture that achieves the right speed/memory/accuracy balance for a given application and platform. To this end, we investigate various ways to trade accuracy for speed and memory usage in modern convolutional object detection systems. A number of successful systems have been proposed in recent years, but apples-toapples comparisons are difficult due to different base feature extractors (e.g., VGG, Residual Networks), different default image resolutions, as well as different hardware and software platforms. We present a unified implementation of the Faster R-CNN [30], R-FCN [6] and SSD [25] systems, which we view as meta-architectures and trace out the speed/accuracy trade-off curve created by using alternative feature extractors and varying other critical parameters such as image size within each of these meta-architectures. On one extreme end of this spectrum where speed and memory are critical, we present a detector that achieves real time speeds and can be deployed on a mobile device. On the opposite end in which accuracy is critical, we present a detector that achieves state-of-the-art performance measured on the COCO detection task.

...read moreread less

2,484 citations

Proceedings Article•DOI•

SphereFace: Deep Hypersphere Embedding for Face Recognition

[...]

Weiyang Liu¹, Yandong Wen², Zhiding Yu², Ming Li³, Bhiksha Raj², Le Song¹ - Show less +2 more•Institutions (3)

Georgia Institute of Technology¹, Carnegie Mellon University², Sun Yat-sen University³

26 Apr 2017

TL;DR: In this paper, the angular softmax (A-softmax) loss was proposed to learn angularly discriminative features for deep face recognition under open-set protocol, where ideal face features are expected to have smaller maximal intra-class distance than minimal interclass distance under a suitably chosen metric space.

...read moreread less

Abstract: This paper addresses deep face recognition (FR) problem under open-set protocol, where ideal face features are expected to have smaller maximal intra-class distance than minimal inter-class distance under a suitably chosen metric space. However, few existing algorithms can effectively achieve this criterion. To this end, we propose the angular softmax (A-Softmax) loss that enables convolutional neural networks (CNNs) to learn angularly discriminative features. Geometrically, A-Softmax loss can be viewed as imposing discriminative constraints on a hypersphere manifold, which intrinsically matches the prior that faces also lie on a manifold. Moreover, the size of angular margin can be quantitatively adjusted by a parameter m. We further derive specific m to approximate the ideal feature criterion. Extensive analysis and experiments on Labeled Face in the Wild (LFW), Youtube Faces (YTF) and MegaFace Challenge 1 show the superiority of A-Softmax loss in FR tasks.

...read moreread less

2,272 citations

Journal Article•DOI•

An End-to-End Trainable Neural Network for Image-Based Sequence Recognition and Its Application to Scene Text Recognition

[...]

Baoguang Shi¹, Xiang Bai¹, Cong Yao¹•Institutions (1)

Huazhong University of Science and Technology¹

01 Nov 2017-IEEE Transactions on Pattern Analysis and Machine Intelligence

TL;DR: Zhang et al. as mentioned in this paper proposed a novel neural network architecture, which integrates feature extraction, sequence modeling and transcription into a unified framework, and achieved remarkable performances in both lexicon free and lexicon-based scene text recognition tasks.

...read moreread less

Abstract: Image-based sequence recognition has been a long-standing research topic in computer vision. In this paper, we investigate the problem of scene text recognition, which is among the most important and challenging tasks in image-based sequence recognition. A novel neural network architecture, which integrates feature extraction, sequence modeling and transcription into a unified framework, is proposed. Compared with previous systems for scene text recognition, the proposed architecture possesses four distinctive properties: (1) It is end-to-end trainable, in contrast to most of the existing algorithms whose components are separately trained and tuned. (2) It naturally handles sequences in arbitrary lengths, involving no character segmentation or horizontal scale normalization. (3) It is not confined to any predefined lexicon and achieves remarkable performances in both lexicon-free and lexicon-based scene text recognition tasks. (4) It generates an effective yet much smaller model, which is more practical for real-world application scenarios. The experiments on standard benchmarks, including the IIIT-5K, Street View Text and ICDAR datasets, demonstrate the superiority of the proposed algorithm over the prior arts. Moreover, the proposed algorithm performs well in the task of image-based music score recognition, which evidently verifies the generality of it.

...read moreread less

2,184 citations

Proceedings Article•DOI•

Unsupervised Pixel-Level Domain Adaptation with Generative Adversarial Networks

[...]

Konstantinos Bousmalis¹, Nathan Silberman¹, David Dohan¹, Dumitru Erhan¹, Dilip Krishnan¹ - Show less +1 more•Institutions (1)

Google¹

01 Jul 2017

TL;DR: In this paper, a generative adversarial network (GAN)-based method adapts source-domain images to appear as if drawn from the target domain by learning in an unsupervised manner a transformation in the pixel space from one domain to another.

...read moreread less

Abstract: Collecting well-annotated image datasets to train modern machine learning algorithms is prohibitively expensive for many tasks. One appealing alternative is rendering synthetic data where ground-truth annotations are generated automatically. Unfortunately, models trained purely on rendered images fail to generalize to real images. To address this shortcoming, prior work introduced unsupervised domain adaptation algorithms that have tried to either map representations between the two domains, or learn to extract features that are domain-invariant. In this work, we approach the problem in a new light by learning in an unsupervised manner a transformation in the pixel space from one domain to the other. Our generative adversarial network (GAN)-based method adapts source-domain images to appear as if drawn from the target domain. Our approach not only produces plausible samples, but also outperforms the state-of-the-art on a number of unsupervised domain adaptation scenarios by large margins. Finally, we demonstrate that the adaptation process generalizes to object classes unseen during training.

...read moreread less

1,549 citations

Proceedings Article•DOI•

SCA-CNN: Spatial and Channel-Wise Attention in Convolutional Networks for Image Captioning

[...]

Long Chen¹, Hanwang Zhang², Jun Xiao¹, Liqiang Nie³, Jian Shao¹, Wei Liu⁴, Tat-Seng Chua⁵ - Show less +3 more•Institutions (5)

Zhejiang University¹, Columbia University², Shandong University³, Tencent⁴, National University of Singapore⁵

21 Jul 2017

TL;DR: This paper introduces a novel convolutional neural network dubbed SCA-CNN that incorporates Spatial and Channel-wise Attentions in a CNN that significantly outperforms state-of-the-art visual attention-based image captioning methods.

...read moreread less

Abstract: Visual attention has been successfully applied in structural prediction tasks such as visual captioning and question answering. Existing visual attention models are generally spatial, i.e., the attention is modeled as spatial probabilities that re-weight the last conv-layer feature map of a CNN encoding an input image. However, we argue that such spatial attention does not necessarily conform to the attention mechanism — a dynamic feature extractor that combines contextual fixations over time, as CNN features are naturally spatial, channel-wise and multi-layer. In this paper, we introduce a novel convolutional neural network dubbed SCA-CNN that incorporates Spatial and Channel-wise Attentions in a CNN. In the task of image captioning, SCA-CNN dynamically modulates the sentence generation context in multi-layer feature maps, encoding where (i.e., attentive spatial locations at multiple layers) and what (i.e., attentive channels) the visual attention is. We evaluate the proposed SCA-CNN architecture on three benchmark image captioning datasets: Flickr8K, Flickr30K, and MSCOCO. It is consistently observed that SCA-CNN significantly outperforms state-of-the-art visual attention-based image captioning methods.

...read moreread less

1,527 citations

Proceedings Article•DOI•

OctNet: Learning Deep 3D Representations at High Resolutions

[...]

Gernot Riegler¹, Ali Osman Ulusoy², Andreas Geiger²•Institutions (2)

Graz University of Technology¹, Max Planck Society²

21 Jul 2017

TL;DR: The utility of the OctNet representation is demonstrated by analyzing the impact of resolution on several 3D tasks including 3D object classification, orientation estimation and point cloud labeling.

...read moreread less

Abstract: We present OctNet, a representation for deep learning with sparse 3D data. In contrast to existing models, our representation enables 3D convolutional networks which are both deep and high resolution. Towards this goal, we exploit the sparsity in the input data to hierarchically partition the space using a set of unbalanced octrees where each leaf node stores a pooled feature representation. This allows to focus memory allocation and computation to the relevant dense regions and enables deeper networks without compromising resolution. We demonstrate the utility of our OctNet representation by analyzing the impact of resolution on several 3D tasks including 3D object classification, orientation estimation and point cloud labeling.

...read moreread less

1,280 citations

Proceedings Article•DOI•

Semantic Image Inpainting with Deep Generative Models

[...]

Raymond A. Yeh¹, Chen Chen¹, Teck Yian Lim¹, Alexander G. Schwing², Alexander G. Schwing¹, Mark Hasegawa-Johnson, Minh N. Do¹ - Show less +3 more•Institutions (2)

University of Illinois at Urbana–Champaign¹, National Center for Supercomputing Applications²

01 Jul 2017

TL;DR: A novel method for semantic image inpainting, which generates the missing content by conditioning on the available data, and successfully predicts information in large missing regions and achieves pixel-level photorealism, significantly outperforming the state-of-the-art methods.

...read moreread less

Abstract: Semantic image inpainting is a challenging task where large missing regions have to be filled based on the available visual data. Existing methods which extract information from only a single image generally produce unsatisfactory results due to the lack of high level context. In this paper, we propose a novel method for semantic image inpainting, which generates the missing content by conditioning on the available data. Given a trained generative model, we search for the closest encoding of the corrupted image in the latent image manifold using our context and prior losses. This encoding is then passed through the generative model to infer the missing content. In our method, inference is possible irrespective of how the missing content is structured, while the state-of-the-art learning based method requires specific information about the holes in the training phase. Experiments on three datasets show that our method successfully predicts information in large missing regions and achieves pixel-level photorealism, significantly outperforming the state-of-the-art methods.

...read moreread less

1,258 citations

Proceedings Article•DOI•

Soft-NMS — Improving Object Detection with One Line of Code

[...]

Navaneeth Bodla¹, Bharat Singh¹, Rama Chellappa¹, Larry S. Davis¹•Institutions (1)

University of Maryland, College Park¹

01 Oct 2017

TL;DR: Soft-NMS as mentioned in this paper decays the detection scores of all other objects as a continuous function of their overlap with M. As per the design of the algorithm, if an object lies within the predefined overlap threshold, it leads to a miss.

...read moreread less

Abstract: Non-maximum suppression is an integral part of the object detection pipeline. First, it sorts all detection boxes on the basis of their scores. The detection box M with the maximum score is selected and all other detection boxes with a significant overlap (using a pre-defined threshold) with M are suppressed. This process is recursively applied on the remaining boxes. As per the design of the algorithm, if an object lies within the predefined overlap threshold, it leads to a miss. To this end, we propose Soft-NMS, an algorithm which decays the detection scores of all other objects as a continuous function of their overlap with M. Hence, no object is eliminated in this process. Soft-NMS obtains consistent improvements for the coco-style mAP metric on standard datasets like PASCAL VOC2007 (1.7% for both R-FCN and Faster-RCNN) and MS-COCO (1.3% for R-FCN and 1.1% for Faster-RCNN) by just changing the NMS algorithm without any additional hyper-parameters. Using Deformable-RFCN, Soft-NMS improves state-of-the-art in object detection from 39.8% to 40.9% with a single model. Further, the computational complexity of Soft-NMS is the same as traditional NMS and hence it can be efficiently implemented. Since Soft-NMS does not require any extra training and is simple to implement, it can be easily integrated into any object detection pipeline. Code for Soft-NMS is publicly available on GitHub http://bit.ly/2nJLNMu.

...read moreread less

Proceedings Article•DOI•

End-to-End Learning of Geometry and Context for Deep Stereo Regression

[...]

Alex Kendall, Hayk Martirosyan, Saumitro Dasgupta, Peter Henry

01 Oct 2017

TL;DR: A novel deep learning architecture for regressing disparity from a rectified pair of stereo images is proposed, leveraging knowledge of the problem’s geometry to form a cost volume using deep feature representations and incorporating contextual information using 3-D convolutions over this volume.

...read moreread less

Abstract: We propose a novel deep learning architecture for regressing disparity from a rectified pair of stereo images. We leverage knowledge of the problem’s geometry to form a cost volume using deep feature representations. We learn to incorporate contextual information using 3-D convolutions over this volume. Disparity values are regressed from the cost volume using a proposed differentiable soft argmin operation, which allows us to train our method end-to-end to sub-pixel accuracy without any additional post-processing or regularization. We evaluate our method on the Scene Flow and KITTI datasets and on KITTI we set a new stateof-the-art benchmark, while being significantly faster than competing approaches.

...read moreread less

Proceedings Article•DOI•

Learning Spatio-Temporal Representation with Pseudo-3D Residual Networks

[...]

Zhaofan Qiu¹, Ting Yao², Tao Mei²•Institutions (2)

University of Science and Technology of China¹, Microsoft²

01 Oct 2017

TL;DR: This paper devise multiple variants of bottleneck building blocks in a residual learning framework by simulating 3 x3 x 3 convolutions with 1 × 3 × 3 convolutional filters on spatial domain (equivalent to 2D CNN) plus 3 × 1 × 1 convolutions to construct temporal connections on adjacent feature maps in time.

...read moreread less

Abstract: Convolutional Neural Networks (CNN) have been regarded as a powerful class of models for image recognition problems. Nevertheless, it is not trivial when utilizing a CNN for learning spatio-temporal video representation. A few studies have shown that performing 3D convolutions is a rewarding approach to capture both spatial and temporal dimensions in videos. However, the development of a very deep 3D CNN from scratch results in expensive computational cost and memory demand. A valid question is why not recycle off-the-shelf 2D networks for a 3D CNN. In this paper, we devise multiple variants of bottleneck building blocks in a residual learning framework by simulating 3 x 3 x 3 convolutions with 1 × 3 × 3 convolutional filters on spatial domain (equivalent to 2D CNN) plus 3 × 1 × 1 convolutions to construct temporal connections on adjacent feature maps in time. Furthermore, we propose a new architecture, named Pseudo-3D Residual Net (P3D ResNet), that exploits all the variants of blocks but composes each in different placement of ResNet, following the philosophy that enhancing structural diversity with going deep could improve the power of neural networks. Our P3D ResNet achieves clear improvements on Sports-1M video classification dataset against 3D CNN and frame-based 2D CNN by 5.3% and 1.8%, respectively. We further examine the generalization performance of video representation produced by our pre-trained P3D ResNet on five different benchmarks and three different tasks, demonstrating superior performances over several state-of-the-art techniques.

...read moreread less

Journal Article•DOI•

Low-Dose CT With a Residual Encoder-Decoder Convolutional Neural Network

[...]

Hu Chen¹, Yi Zhang¹, Mannudeep K. Kalra², Feng Lin¹, Yang Chen³, Peixi Liao, Jiliu Zhou¹, Ge Wang⁴ - Show less +4 more•Institutions (4)

Sichuan University¹, Harvard University², Southeast University³, Rensselaer Polytechnic Institute⁴

13 Jun 2017-IEEE Transactions on Medical Imaging

TL;DR: This work combines the autoencoder, deconvolution network, and shortcut connections into the residual encoder–decoder convolutional neural network (RED-CNN) for low-dose CT imaging and achieves a competitive performance relative to the-state-of-art methods in both simulated and clinical cases.

...read moreread less

Abstract: Given the potential risk of X-ray radiation to the patient, low-dose CT has attracted a considerable interest in the medical imaging field. Currently, the main stream low-dose CT methods include vendor-specific sinogram domain filtration and iterative reconstruction algorithms, but they need to access raw data, whose formats are not transparent to most users. Due to the difficulty of modeling the statistical characteristics in the image domain, the existing methods for directly processing reconstructed images cannot eliminate image noise very well while keeping structural details. Inspired by the idea of deep learning, here we combine the autoencoder, deconvolution network, and shortcut connections into the residual encoder–decoder convolutional neural network (RED-CNN) for low-dose CT imaging. After patch-based training, the proposed RED-CNN achieves a competitive performance relative to the-state-of-art methods in both simulated and clinical cases. Especially, our method has been favorably evaluated in terms of noise suppression, structural preservation, and lesion detection.

...read moreread less

Proceedings Article•DOI•

RMPE: Regional Multi-person Pose Estimation

[...]

Hao-Shu Fang¹, Shuqin Xie, Yu-Wing Tai², Cewu Lu³•Institutions (3)

Shanghai Jiao Tong University¹, Tencent², Stanford University³

01 Oct 2017

TL;DR: In this paper, a regional multi-person pose estimation (RMPE) framework is proposed to facilitate pose estimation in the presence of inaccurate human bounding boxes, which achieves state-of-the-art performance on the MPII dataset.

...read moreread less

Abstract: Multi-person pose estimation in the wild is challenging. Although state-of-the-art human detectors have demonstrated good performance, small errors in localization and recognition are inevitable. These errors can cause failures for a single-person pose estimator (SPPE), especially for methods that solely depend on human detection results. In this paper, we propose a novel regional multi-person pose estimation (RMPE) framework to facilitate pose estimation in the presence of inaccurate human bounding boxes. Our framework consists of three components: Symmetric Spatial Transformer Network (SSTN), Parametric Pose Non-Maximum-Suppression (NMS), and Pose-Guided Proposals Generator (PGPG). Our method is able to handle inaccurate bounding boxes and redundant detections, allowing it to achieve 76:7 mAP on the MPII (multi person) dataset[3]. Our model and source codes are made publicly available.

...read moreread less

Proceedings Article•DOI•

Look Closer to See Better: Recurrent Attention Convolutional Neural Network for Fine-Grained Image Recognition

[...]

Jianlong Fu¹, Heliang Zheng², Tao Mei¹•Institutions (2)

Microsoft¹, University of Science and Technology of China²

21 Jul 2017

TL;DR: Li et al. as discussed by the authors proposed a recurrent attention convolutional neural network (RA-CNN) which recursively learns discriminative region attention and region-based feature representation at multiple scales in a mutual reinforced way.

...read moreread less

Abstract: Recognizing fine-grained categories (e.g., bird species) is difficult due to the challenges of discriminative region localization and fine-grained feature learning. Existing approaches predominantly solve these challenges independently, while neglecting the fact that region detection and fine-grained feature learning are mutually correlated and thus can reinforce each other. In this paper, we propose a novel recurrent attention convolutional neural network (RA-CNN) which recursively learns discriminative region attention and region-based feature representation at multiple scales in a mutual reinforced way. The learning at each scale consists of a classification sub-network and an attention proposal sub-network (APN). The APN starts from full images, and iteratively generates region attention from coarse to fine by taking previous prediction as a reference, while the finer scale network takes as input an amplified attended region from previous scale in a recurrent way. The proposed RA-CNN is optimized by an intra-scale classification loss and an inter-scale ranking loss, to mutually learn accurate region attention and fine-grained representation. RA-CNN does not need bounding box/part annotations and can be trained end-to-end. We conduct comprehensive experiments and show that RA-CNN achieves the best performance in three fine-grained tasks, with relative accuracy gains of 3.3%, 3.7%, 3.8%, on CUB Birds, Stanford Dogs and Stanford Cars, respectively.

...read moreread less

Posted Content•

VoxelNet: End-to-End Learning for Point Cloud Based 3D Object Detection

[...]

Yin Zhou¹, Oncel Tuzel¹•Institutions (1)

Apple Inc.¹

17 Nov 2017-arXiv: Computer Vision and Pattern Recognition

TL;DR: VoxelNet is proposed, a generic 3D detection network that unifies feature extraction and bounding box prediction into a single stage, end-to-end trainable deep network and learns an effective discriminative representation of objects with various geometries, leading to encouraging results in3D detection of pedestrians and cyclists.

...read moreread less

Abstract: Accurate detection of objects in 3D point clouds is a central problem in many applications, such as autonomous navigation, housekeeping robots, and augmented/virtual reality. To interface a highly sparse LiDAR point cloud with a region proposal network (RPN), most existing efforts have focused on hand-crafted feature representations, for example, a bird's eye view projection. In this work, we remove the need of manual feature engineering for 3D point clouds and propose VoxelNet, a generic 3D detection network that unifies feature extraction and bounding box prediction into a single stage, end-to-end trainable deep network. Specifically, VoxelNet divides a point cloud into equally spaced 3D voxels and transforms a group of points within each voxel into a unified feature representation through the newly introduced voxel feature encoding (VFE) layer. In this way, the point cloud is encoded as a descriptive volumetric representation, which is then connected to a RPN to generate detections. Experiments on the KITTI car detection benchmark show that VoxelNet outperforms the state-of-the-art LiDAR based 3D detection methods by a large margin. Furthermore, our network learns an effective discriminative representation of objects with various geometries, leading to encouraging results in 3D detection of pedestrians and cyclists, based on only LiDAR.

...read moreread less

Proceedings Article•DOI•

Spindle Net: Person Re-identification with Human Body Region Guided Feature Decomposition and Fusion

[...]

Haiyu Zhao¹, Tian Maoqing², Shuyang Sun², Jing Shao², Junjie Yan², Shuai Yi², Xiaogang Wang¹, Xiaoou Tang¹ - Show less +4 more•Institutions (2)

The Chinese University of Hong Kong¹, SenseTime²

01 Jul 2017

TL;DR: This study proposes a novel Convolutional Neural Network, called Spindle Net, based on human body region guided multi-stage feature decomposition and tree-structured competitive feature fusion, which is the first time human body structure information is considered in a CNN framework to facilitate feature learning.

...read moreread less

Abstract: Person re-identification (ReID) is an important task in video surveillance and has various applications. It is non-trivial due to complex background clutters, varying illumination conditions, and uncontrollable camera settings. Moreover, the person body misalignment caused by detectors or pose variations is sometimes too severe for feature matching across images. In this study, we propose a novel Convolutional Neural Network (CNN), called Spindle Net, based on human body region guided multi-stage feature decomposition and tree-structured competitive feature fusion. It is the first time human body structure information is considered in a CNN framework to facilitate feature learning. The proposed Spindle Net brings unique advantages: 1) it separately captures semantic features from different body regions thus the macro-and micro-body features can be well aligned across images, 2) the learned region features from different semantic regions are merged with a competitive scheme and discriminative features can be well preserved. State of the art performance can be achieved on multiple datasets by large margins. We further demonstrate the robustness and effectiveness of the proposed Spindle Net on our proposed dataset SenseReID without fine-tuning.

...read moreread less

Journal Article•DOI•

Learning Traffic as Images: A Deep Convolutional Neural Network for Large-Scale Transportation Network Speed Prediction.

[...]

Xiaolei Ma¹, Zhuang Dai¹, Zhengbing He², Jihui Ma², Yong Wang³, Yunpeng Wang¹ - Show less +2 more•Institutions (3)

Beihang University¹, Beijing Jiaotong University², Chongqing Jiaotong University³

10 Apr 2017-Sensors

TL;DR: Wang et al. as mentioned in this paper proposed a convolutional neural network (CNN)-based method that learns traffic as images and predicts large-scale, network-wide traffic speed with a high accuracy.

...read moreread less

Abstract: This paper proposes a convolutional neural network (CNN)-based method that learns traffic as images and predicts large-scale, network-wide traffic speed with a high accuracy. Spatiotemporal traffic dynamics are converted to images describing the time and space relations of traffic flow via a two-dimensional time-space matrix. A CNN is applied to the image following two consecutive steps: abstract traffic feature extraction and network-wide traffic speed prediction. The effectiveness of the proposed method is evaluated by taking two real-world transportation networks, the second ring road and north-east transportation network in Beijing, as examples, and comparing the method with four prevailing algorithms, namely, ordinary least squares, k-nearest neighbors, artificial neural network, and random forest, and three deep learning architectures, namely, stacked autoencoder, recurrent neural network, and long-short-term memory network. The results show that the proposed method outperforms other algorithms by an average accuracy improvement of 42.91% within an acceptable execution time. The CNN can train the model in a reasonable time and, thus, is suitable for large-scale transportation networks.

...read moreread less

Proceedings Article•DOI•

Temporal Convolutional Networks for Action Segmentation and Detection

[...]

Colin Lea¹, Michael D. Flynn¹, René Vidal¹, Austin Reiter¹, Gregory D. Hager¹ - Show less +1 more•Institutions (1)

Johns Hopkins University¹

01 Jul 2017

TL;DR: A class of temporal models that use a hierarchy of temporal convolutions to perform fine-grained action segmentation or detection, which are capable of capturing action compositions, segment durations, and long-range dependencies, and are over a magnitude faster to train than competing LSTM-based Recurrent Neural Networks.

...read moreread less

Abstract: The ability to identify and temporally segment fine-grained human actions throughout a video is crucial for robotics, surveillance, education, and beyond. Typical approaches decouple this problem by first extracting local spatiotemporal features from video frames and then feeding them into a temporal classifier that captures high-level temporal patterns. We describe a class of temporal models, which we call Temporal Convolutional Networks (TCNs), that use a hierarchy of temporal convolutions to perform fine-grained action segmentation or detection. Our Encoder-Decoder TCN uses pooling and upsampling to efficiently capture long-range temporal patterns whereas our Dilated TCN uses dilated convolutions. We show that TCNs are capable of capturing action compositions, segment durations, and long-range dependencies, and are over a magnitude faster to train than competing LSTM-based Recurrent Neural Networks. We apply these models to three challenging fine-grained datasets and show large improvements over the state of the art.

...read moreread less

Journal Article•DOI•

Automated Melanoma Recognition in Dermoscopy Images via Very Deep Residual Networks

[...]

Lequan Yu¹, Hao Chen¹, Qi Dou¹, Jing Qin², Pheng-Ann Heng¹ - Show less +1 more•Institutions (2)

The Chinese University of Hong Kong¹, Hong Kong Polytechnic University²

01 Apr 2017-IEEE Transactions on Medical Imaging

TL;DR: This study corroborates that very deep CNNs with effective training mechanisms can be employed to solve complicated medical image analysis tasks, even with limited training data.

...read moreread less

Abstract: Automated melanoma recognition in dermoscopy images is a very challenging task due to the low contrast of skin lesions, the huge intraclass variation of melanomas, the high degree of visual similarity between melanoma and non-melanoma lesions, and the existence of many artifacts in the image. In order to meet these challenges, we propose a novel method for melanoma recognition by leveraging very deep convolutional neural networks (CNNs). Compared with existing methods employing either low-level hand-crafted features or CNNs with shallower architectures, our substantially deeper networks (more than 50 layers) can acquire richer and more discriminative features for more accurate recognition. To take full advantage of very deep networks, we propose a set of schemes to ensure effective training and learning under limited training data. First, we apply the residual learning to cope with the degradation and overfitting problems when a network goes deeper. This technique can ensure that our networks benefit from the performance gains achieved by increasing network depth. Then, we construct a fully convolutional residual network (FCRN) for accurate skin lesion segmentation, and further enhance its capability by incorporating a multi-scale contextual information integration scheme. Finally, we seamlessly integrate the proposed FCRN (for segmentation) and other very deep residual networks (for classification) to form a two-stage framework. This framework enables the classification network to extract more representative and specific features based on segmented results instead of the whole dermoscopy images, further alleviating the insufficiency of training data. The proposed framework is extensively evaluated on ISBI 2016 Skin Lesion Analysis Towards Melanoma Detection Challenge dataset. Experimental results demonstrate the significant performance gains of the proposed framework, ranking the first in classification and the second in segmentation among 25 teams and 28 teams, respectively. This study corroborates that very deep CNNs with effective training mechanisms can be employed to solve complicated medical image analysis tasks, even with limited training data.

...read moreread less

Proceedings Article•DOI•

Deeply Supervised Salient Object Detection with Short Connections

[...]

Qibin Hou¹, Ming-Ming Cheng¹, Xiaowei Hu¹, Ali Borji², Zhuowen Tu³, Philip H. S. Torr⁴ - Show less +2 more•Institutions (4)

Nankai University¹, University of Central Florida², University of California, San Diego³, University of Oxford⁴

01 Jul 2017

TL;DR: This paper proposes a new salient object detection method by introducing short connections to the skip-layer structures within the HED architecture, which takes full advantage of multi-level and multi-scale features extracted from FCNs, providing more advanced representations at each layer, a property that is critically needed to perform segment detection.

...read moreread less

Abstract: Recent progress on saliency detection is substantial, benefiting mostly from the explosive development of Convolutional Neural Networks (CNNs). Semantic segmentation and saliency detection algorithms developed lately have been mostly based on Fully Convolutional Neural Networks (FCNs). There is still a large room for improvement over the generic FCN models that do not explicitly deal with the scale-space problem. Holisitcally-Nested Edge Detector (HED) provides a skip-layer structure with deep supervision for edge and boundary detection, but the performance gain of HED on saliency detection is not obvious. In this paper, we propose a new saliency method by introducing short connections to the skip-layer structures within the HED architecture. Our framework provides rich multi-scale feature maps at each layer, a property that is critically needed to perform segment detection. Our method produces state-of-the-art results on 5 widely tested salient object detection benchmarks, with advantages in terms of efficiency (0.08 seconds per image), effectiveness, and simplicity over the existing algorithms.

...read moreread less

Proceedings Article•DOI•

High-Resolution Image Inpainting Using Multi-scale Neural Patch Synthesis

[...]

Chao Yang¹, Xin Lu², Zhe Lin², Eli Shechtman², Oliver Wang², Hao Li³ - Show less +2 more•Institutions (3)

University of Southern California¹, Adobe Systems², Institute for Creative Technologies³

01 Jul 2017

TL;DR: This work proposes a multi-scale neural patch synthesis approach based on joint optimization of image content and texture constraints, which not only preserves contextual structures but also produces high-frequency details by matching and adapting patches with the most similar mid-layer feature correlations of a deep classification network.

...read moreread less

Abstract: Recent advances in deep learning have shown exciting promise in filling large holes in natural images with semantically plausible and context aware details, impacting fundamental image manipulation tasks such as object removal. While these learning-based methods are significantly more effective in capturing high-level features than prior techniques, they can only handle very low-resolution inputs due to memory limitations and difficulty in training. Even for slightly larger images, the inpainted regions would appear blurry and unpleasant boundaries become visible. We propose a multi-scale neural patch synthesis approach based on joint optimization of image content and texture constraints, which not only preserves contextual structures but also produces high-frequency details by matching and adapting patches with the most similar mid-layer feature correlations of a deep classification network. We evaluate our method on the ImageNet and Paris Streetview datasets and achieved state-of-the-art inpainting accuracy. We show our approach produces sharper and more coherent results than prior methods, especially for high-resolution images.

...read moreread less

Posted Content•

Learning Traffic as Images: A Deep Convolutional Neural Network for Large-Scale Transportation Network Speed Prediction

[...]

Xiaolei Ma, Zhuang Dai, Zhengbing He, Jihui Na, Yong Wang, Yunpeng Wang - Show less +2 more

16 Jan 2017-arXiv: Learning

TL;DR: The CNN can train the model in a reasonable time and, thus, is suitable for large-scale transportation networks and outperforms other algorithms by an average accuracy improvement of 42.91% within an acceptable execution time.

...read moreread less

Proceedings Article•DOI•

Amulet: Aggregating Multi-level Convolutional Features for Salient Object Detection

[...]

Pingping Zhang¹, Dong Wang², Huchuan Lu³, Hongyu Wang³, Xiang Ruan⁴ - Show less +1 more•Institutions (4)

Jinan Military Region¹, Dartmouth College², Dalian University of Technology³, Omron⁴

01 Oct 2017

TL;DR: Amulet is presented, a generic aggregating multi-level convolutional feature framework for salient object detection that provides accurate salient object labeling and performs favorably against state-of-the-art approaches in terms of near all compared evaluation metrics.

...read moreread less

Abstract: Fully convolutional neural networks (FCNs) have shown outstanding performance in many dense labeling problems. One key pillar of these successes is mining relevant information from features in convolutional layers. However, how to better aggregate multi-level convolutional feature maps for salient object detection is underexplored. In this work, we present Amulet, a generic aggregating multi-level convolutional feature framework for salient object detection. Our framework first integrates multi-level feature maps into multiple resolutions, which simultaneously incorporate coarse semantics and fine details. Then it adaptively learns to combine these feature maps at each resolution and predict saliency maps with the combined features. Finally, the predicted results are efficiently fused to generate the final saliency map. In addition, to achieve accurate boundary inference and semantic enhancement, edge-aware feature maps in low-level layers and the predicted results of low resolution features are recursively embedded into the learning framework. By aggregating multi-level convolutional features in this efficient and flexible manner, the proposed saliency model provides accurate salient object labeling. Comprehensive experiments demonstrate that our method performs favorably against state-of-the-art approaches in terms of near all compared evaluation metrics.

...read moreread less

Journal Article•DOI•

Richer Convolutional Features for Edge Detection

[...]

Yun Liu¹, Ming-Ming Cheng¹, Xiaowei Hu¹, Jia-Wang Bian¹, Le Zhang², Xiang Bai³, Jinhui Tang⁴ - Show less +3 more•Institutions (4)

Nankai University¹, Agency for Science, Technology and Research², Huazhong University of Science and Technology³, Nanjing University of Science and Technology⁴

21 Jul 2017

TL;DR: RCF as mentioned in this paper encapsulates all convolutional features into more discriminative representation, which makes good usage of rich feature hierarchies, and is amenable to training via backpropagation.

...read moreread less

Abstract: Edge detection is a fundamental problem in computer vision. Recently, convolutional neural networks (CNNs) have pushed forward this field significantly. Existing methods which adopt specific layers of deep CNNs may fail to capture complex data structures caused by variations of scales and aspect ratios. In this paper, we propose an accurate edge detector using richer convolutional features (RCF). RCF encapsulates all convolutional features into more discriminative representation, which makes good usage of rich feature hierarchies, and is amenable to training via backpropagation. RCF fully exploits multiscale and multilevel information of objects to perform the image-to-image prediction holistically. Using VGG16 network, we achieve state-of-the-art performance on several available datasets. When evaluating on the well-known BSDS500 benchmark, we achieve ODS F-measure of 0.811 while retaining a fast speed (8 FPS). Besides, our fast version of RCF achieves ODS F-measure of 0.806 with 30 FPS. We also demonstrate the versatility of the proposed method by applying RCF edges for classical image segmentation.

...read moreread less

Proceedings Article•DOI•

Low-Shot Visual Recognition by Shrinking and Hallucinating Features

[...]

Bharath Hariharan¹, Ross Girshick¹•Institutions (1)

Facebook¹

01 Oct 2017

TL;DR: This work presents a low-shot learning benchmark on complex images that mimics challenges faced by recognition systems in the wild, and proposes representation regularization techniques and techniques to hallucinate additional training examples for data-starved classes.

...read moreread less

Abstract: Low-shot visual learning–the ability to recognize novel object categories from very few examples–is a hallmark of human visual intelligence. Existing machine learning approaches fail to generalize in the same way. To make progress on this foundational problem, we present a low-shot learning benchmark on complex images that mimics challenges faced by recognition systems in the wild. We then propose (1) representation regularization techniques, and (2) techniques to hallucinate additional training examples for data-starved classes. Together, our methods improve the effectiveness of convolutional networks in low-shot learning, improving the one-shot accuracy on novel classes by 2.3× on the challenging ImageNet dataset.

...read moreread less

Proceedings Article•DOI•

Learning Multi-attention Convolutional Neural Network for Fine-Grained Image Recognition

[...]

Heliang Zheng¹, Jianlong Fu², Tao Mei², Jiebo Luo³•Institutions (3)

University of Science and Technology of China¹, Microsoft², University of Rochester³

01 Oct 2017

TL;DR: This paper proposes a novel part learning approach by a multi-attention convolutional neural network (MA-CNN), where part generation and feature learning can reinforce each other, and shows the best performances on three challenging published fine-grained datasets.

...read moreread less

Abstract: Recognizing fine-grained categories (e.g., bird species) highly relies on discriminative part localization and part-based fine-grained feature learning. Existing approaches predominantly solve these challenges independently, while neglecting the fact that part localization (e.g., head of a bird) and fine-grained feature learning (e.g., head shape) are mutually correlated. In this paper, we propose a novel part learning approach by a multi-attention convolutional neural network (MA-CNN), where part generation and feature learning can reinforce each other. MA-CNN consists of convolution, channel grouping and part classification sub-networks. The channel grouping network takes as input feature channels from convolutional layers, and generates multiple parts by clustering, weighting and pooling from spatially-correlated channels. The part classification network further classifies an image by each individual part, through which more discriminative fine-grained features can be learned. Two losses are proposed to guide the multi-task learning of channel grouping and part classification, which encourages MA-CNN to generate more discriminative parts from feature channels and learn better fine-grained features from parts in a mutual reinforced way. MA-CNN does not need bounding box/part annotation and can be trained end-to-end. We incorporate the learned parts from MA-CNN with part-CNN for recognition, and show the best performances on three challenging published fine-grained datasets, e.g., CUB-Birds, FGVC-Aircraft and Stanford-Cars.

...read moreread less

Journal Article•DOI•

SVO: Semidirect Visual Odometry for Monocular and Multicamera Systems

[...]

Christian Forster¹, Zichao Zhang¹, Michael Gassner¹, Manuel Werlberger¹, Davide Scaramuzza¹ - Show less +1 more•Institutions (1)

University of Zurich¹

01 Apr 2017-IEEE Transactions on Robotics

TL;DR: A semidirect VO that uses direct methods to track and triangulate pixels that are characterized by high image gradients, but relies on proven feature-based methods for joint optimization of structure and motion is proposed.

...read moreread less

Abstract: Direct methods for visual odometry (VO) have gained popularity for their capability to exploit information from all intensity gradients in the image. However, low computational speed as well as missing guarantees for optimality and consistency are limiting factors of direct methods, in which established feature-based methods succeed instead. Based on these considerations, we propose a semidirect VO (SVO) that uses direct methods to track and triangulate pixels that are characterized by high image gradients, but relies on proven feature-based methods for joint optimization of structure and motion. Together with a robust probabilistic depth estimation algorithm, this enables us to efficiently track pixels lying on weak corners and edges in environments with little or high-frequency texture. We further demonstrate that the algorithm can easily be extended to multiple cameras, to track edges, to include motion priors, and to enable the use of very large field of view cameras, such as fisheye and catadioptric ones. Experimental evaluation on benchmark datasets shows that the algorithm is significantly faster than the state of the art while achieving highly competitive accuracy.

...read moreread less

Proceedings Article•DOI•

3DMatch: Learning Local Geometric Descriptors from RGB-D Reconstructions

[...]

Andy Zeng¹, Shuran Song¹, Matthias NieBner², Matthew Fisher², Jianxiong Xiao, Thomas Funkhouser¹ - Show less +2 more•Institutions (2)

Princeton University¹, Stanford University²

01 Jul 2017

TL;DR: 3DMatch is presented, a data-driven model that learns a local volumetric patch descriptor for establishing correspondences between partial 3D data that consistently outperforms other state-of-the-art approaches by a significant margin.

...read moreread less

Abstract: Matching local geometric features on real-world depth images is a challenging task due to the noisy, low-resolution, and incomplete nature of 3D scan data. These difficulties limit the performance of current state-of-art methods, which are typically based on histograms over geometric properties. In this paper, we present 3DMatch, a data-driven model that learns a local volumetric patch descriptor for establishing correspondences between partial 3D data. To amass training data for our model, we propose a self-supervised feature learning method that leverages the millions of correspondence labels found in existing RGB-D reconstructions. Experiments show that our descriptor is not only able to match local geometry in new scenes for reconstruction, but also generalize to different tasks and spatial scales (e.g. instance-level object model alignment for the Amazon Picking Challenge, and mesh surface correspondence). Results show that 3DMatch consistently outperforms other state-of-the-art approaches by a significant margin. Code, data, benchmarks, and pre-trained models are available online at http://3dmatch.cs.princeton.edu.

...read moreread less

Collapse