Showing papers on "Feature (computer vision) published in 2018"

PDF

Open Access

Posted Content•

CBAM: Convolutional Block Attention Module

[...]

Sanghyun Woo¹, Jongchan Park, Joon-Young Lee², In So Kweon¹•Institutions (2)

17 Jul 2018-arXiv: Computer Vision and Pattern Recognition

TL;DR: The proposed Convolutional Block Attention Module (CBAM), a simple yet effective attention module for feed-forward convolutional neural networks, can be integrated into any CNN architectures seamlessly with negligible overheads and is end-to-end trainable along with base CNNs.

...read moreread less

Abstract: We propose Convolutional Block Attention Module (CBAM), a simple yet effective attention module for feed-forward convolutional neural networks. Given an intermediate feature map, our module sequentially infers attention maps along two separate dimensions, channel and spatial, then the attention maps are multiplied to the input feature map for adaptive feature refinement. Because CBAM is a lightweight and general module, it can be integrated into any CNN architectures seamlessly with negligible overheads and is end-to-end trainable along with base CNNs. We validate our CBAM through extensive experiments on ImageNet-1K, MS~COCO detection, and VOC~2007 detection datasets. Our experiments show consistent improvements in classification and detection performances with various models, demonstrating the wide applicability of CBAM. The code and models will be publicly available.

...read moreread less

5,757 citations

Book Chapter•DOI•

CBAM: Convolutional Block Attention Module

[...]

Sanghyun Woo¹, Jongchan Park, Joon-Young Lee², In So Kweon¹•Institutions (2)

KAIST¹, Adobe Systems²

08 Sep 2018

TL;DR: Convolutional Block Attention Module (CBAM) as discussed by the authors is a simple yet effective attention module for feed-forward convolutional neural networks, given an intermediate feature map, the module sequentially infers attention maps along two separate dimensions, channel and spatial, then the attention maps are multiplied to the input feature map for adaptive feature refinement.

...read moreread less

Abstract: We propose Convolutional Block Attention Module (CBAM), a simple yet effective attention module for feed-forward convolutional neural networks. Given an intermediate feature map, our module sequentially infers attention maps along two separate dimensions, channel and spatial, then the attention maps are multiplied to the input feature map for adaptive feature refinement. Because CBAM is a lightweight and general module, it can be integrated into any CNN architectures seamlessly with negligible overheads and is end-to-end trainable along with base CNNs. We validate our CBAM through extensive experiments on ImageNet-1K, MS COCO detection, and VOC 2007 detection datasets. Our experiments show consistent improvements in classification and detection performances with various models, demonstrating the wide applicability of CBAM. The code and models will be publicly available.

...read moreread less

5,335 citations

Proceedings Article•DOI•

Learning Transferable Architectures for Scalable Image Recognition

[...]

Barret Zoph¹, Vijay K. Vasudevan¹, Jonathon Shlens¹, Quoc V. Le¹•Institutions (1)

Google¹

18 Jun 2018

TL;DR: NASNet as discussed by the authors proposes to search for an architectural building block on a small dataset and then transfer the block to a larger dataset, which enables transferability and achieves state-of-the-art performance.

...read moreread less

Abstract: Developing neural network image classification models often requires significant architecture engineering. In this paper, we study a method to learn the model architectures directly on the dataset of interest. As this approach is expensive when the dataset is large, we propose to search for an architectural building block on a small dataset and then transfer the block to a larger dataset. The key contribution of this work is the design of a new search space (which we call the "NASNet search space") which enables transferability. In our experiments, we search for the best convolutional layer (or "cell") on the CIFAR-10 dataset and then apply this cell to the ImageNet dataset by stacking together more copies of this cell, each with their own parameters to design a convolutional architecture, which we name a "NASNet architecture". We also introduce a new regularization technique called ScheduledDropPath that significantly improves generalization in the NASNet models. On CIFAR-10 itself, a NASNet found by our method achieves 2.4% error rate, which is state-of-the-art. Although the cell is not searched for directly on ImageNet, a NASNet constructed from the best cell achieves, among the published works, state-of-the-art accuracy of 82.7% top-1 and 96.2% top-5 on ImageNet. Our model is 1.2% better in top-1 accuracy than the best human-invented architectures while having 9 billion fewer FLOPS - a reduction of 28% in computational demand from the previous state-of-the-art model. When evaluated at different levels of computational cost, accuracies of NASNets exceed those of the state-of-the-art human-designed models. For instance, a small version of NASNet also achieves 74% top-1 accuracy, which is 3.1% better than equivalently-sized, state-of-the-art models for mobile platforms. Finally, the image features learned from image classification are generically useful and can be transferred to other computer vision problems. On the task of object detection, the learned features by NASNet used with the Faster-RCNN framework surpass state-of-the-art by 4.0% achieving 43.1% mAP on the COCO dataset.

...read moreread less

4,384 citations

Proceedings Article•DOI•

Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering

[...]

Peter Anderson¹, Xiaodong He, Chris Buehler², Damien Teney³, Mark Johnson, Stephen Gould¹, Lei Zhang² - Show less +3 more•Institutions (3)

Australian National University¹, Microsoft², University of Adelaide³

18 Jun 2018

TL;DR: In this paper, a bottom-up and top-down attention mechanism was proposed to enable attention to be calculated at the level of objects and other salient image regions, which achieved state-of-the-art results on the MSCOCO test server.

...read moreread less

Abstract: Top-down visual attention mechanisms have been used extensively in image captioning and visual question answering (VQA) to enable deeper image understanding through fine-grained analysis and even multiple steps of reasoning. In this work, we propose a combined bottom-up and top-down attention mechanism that enables attention to be calculated at the level of objects and other salient image regions. This is the natural basis for attention to be considered. Within our approach, the bottom-up mechanism (based on Faster R-CNN) proposes image regions, each with an associated feature vector, while the top-down mechanism determines feature weightings. Applying this approach to image captioning, our results on the MSCOCO test server establish a new state-of-the-art for the task, achieving CIDEr / SPICE / BLEU-4 scores of 117.9, 21.5 and 36.9, respectively. Demonstrating the broad applicability of the method, applying the same approach to VQA we obtain first place in the 2017 VQA Challenge.

...read moreread less

2,904 citations

Posted Content•

UNet++: A Nested U-Net Architecture for Medical Image Segmentation

[...]

Zongwei Zhou¹, Mahfuzur Rahman Siddiquee¹, Nima Tajbakhsh¹, Jianming Liang¹•Institutions (1)

Arizona State University¹

18 Jul 2018-arXiv: Computer Vision and Pattern Recognition

TL;DR: This paper presents UNet++, a new, more powerful architecture for medical image segmentation where the encoder and decoder sub-networks are connected through a series of nested, dense skip pathways, and argues that the optimizer would deal with an easier learning task when the feature maps from the decoder and encoder networks are semantically similar.

...read moreread less

Abstract: In this paper, we present UNet++, a new, more powerful architecture for medical image segmentation. Our architecture is essentially a deeply-supervised encoder-decoder network where the encoder and decoder sub-networks are connected through a series of nested, dense skip pathways. The re-designed skip pathways aim at reducing the semantic gap between the feature maps of the encoder and decoder sub-networks. We argue that the optimizer would deal with an easier learning task when the feature maps from the decoder and encoder networks are semantically similar. We have evaluated UNet++ in comparison with U-Net and wide U-Net architectures across multiple medical image segmentation tasks: nodule segmentation in the low-dose CT scans of chest, nuclei segmentation in the microscopy images, liver segmentation in abdominal CT scans, and polyp segmentation in colonoscopy videos. Our experiments demonstrate that UNet++ with deep supervision achieves an average IoU gain of 3.9 and 3.4 points over U-Net and wide U-Net, respectively.

...read moreread less

2,254 citations

Book Chapter•DOI•

Unet++: A nested u-net architecture for medical image segmentation

[...]

Zongwei Zhou¹, Mahfuzur Rahman Siddiquee¹, Nima Tajbakhsh¹, Jianming Liang¹•Institutions (1)

Arizona State University¹

20 Sep 2018

TL;DR: UNet++ as discussed by the authors is a deeply-supervised encoder-decoder network where the encoder and decoder sub-networks are connected through a series of nested, dense skip pathways.

...read moreread less

2,067 citations

Proceedings Article•DOI•

VoxelNet: End-to-End Learning for Point Cloud Based 3D Object Detection

[...]

Yin Zhou¹, Oncel Tuzel¹•Institutions (1)

Apple Inc.¹

18 Jun 2018

TL;DR: Zhou et al. as mentioned in this paper propose VoxelNet, a generic 3D detection network that unifies feature extraction and bounding box prediction into a single stage, end-to-end trainable deep network.

...read moreread less

Abstract: Accurate detection of objects in 3D point clouds is a central problem in many applications, such as autonomous navigation, housekeeping robots, and augmented/virtual reality. To interface a highly sparse LiDAR point cloud with a region proposal network (RPN), most existing efforts have focused on hand-crafted feature representations, for example, a bird's eye view projection. In this work, we remove the need of manual feature engineering for 3D point clouds and propose VoxelNet, a generic 3D detection network that unifies feature extraction and bounding box prediction into a single stage, end-to-end trainable deep network. Specifically, VoxelNet divides a point cloud into equally spaced 3D voxels and transforms a group of points within each voxel into a unified feature representation through the newly introduced voxel feature encoding (VFE) layer. In this way, the point cloud is encoded as a descriptive volumetric representation, which is then connected to a RPN to generate detections. Experiments on the KITTI car detection benchmark show that VoxelNet outperforms the state-of-the-art LiDAR based 3D detection methods by a large margin. Furthermore, our network learns an effective discriminative representation of objects with various geometries, leading to encouraging results in 3D detection of pedestrians and cyclists, based on only LiDAR.

...read moreread less

1,948 citations

Proceedings Article•DOI•

Generative Image Inpainting with Contextual Attention

[...]

Jiahui Yu¹, Zhe Lin², Jimei Yang², Xiaohui Shen², Xin Lu², Thomas S. Huang¹ - Show less +2 more•Institutions (2)

University of Illinois at Urbana–Champaign¹, Adobe Systems²

18 Jun 2018

TL;DR: Yu et al. as discussed by the authors proposed a new deep generative model-based approach which can not only synthesize novel image structures but also explicitly utilize surrounding image features as references during network training to make better predictions.

...read moreread less

Abstract: Recent deep learning based approaches have shown promising results for the challenging task of inpainting large missing regions in an image. These methods can generate visually plausible image structures and textures, but often create distorted structures or blurry textures inconsistent with surrounding areas. This is mainly due to ineffectiveness of convolutional neural networks in explicitly borrowing or copying information from distant spatial locations. On the other hand, traditional texture and patch synthesis approaches are particularly suitable when it needs to borrow textures from the surrounding regions. Motivated by these observations, we propose a new deep generative model-based approach which can not only synthesize novel image structures but also explicitly utilize surrounding image features as references during network training to make better predictions. The model is a feedforward, fully convolutional neural network which can process images with multiple holes at arbitrary locations and with variable sizes during the test time. Experiments on multiple datasets including faces (CelebA, CelebA-HQ), textures (DTD) and natural images (ImageNet, Places2) demonstrate that our proposed approach generates higher-quality inpainting results than existing ones. Code, demo and models are available at: https://github.com/JiahuiYu/generative_inpainting.

...read moreread less

1,397 citations

Posted Content•

Generative Image Inpainting with Contextual Attention

[...]

Jiahui Yu¹, Zhe Lin², Jimei Yang², Xiaohui Shen², Xin Lu², Thomas S. Huang¹ - Show less +2 more•Institutions (2)

University of Illinois at Urbana–Champaign¹, Adobe Systems²

24 Jan 2018-arXiv: Computer Vision and Pattern Recognition

TL;DR: In this article, a new deep generative model-based approach is proposed which can not only synthesize novel image structures but also explicitly utilize surrounding image features as references during network training to make better predictions.

...read moreread less

Abstract: Recent deep learning based approaches have shown promising results for the challenging task of inpainting large missing regions in an image. These methods can generate visually plausible image structures and textures, but often create distorted structures or blurry textures inconsistent with surrounding areas. This is mainly due to ineffectiveness of convolutional neural networks in explicitly borrowing or copying information from distant spatial locations. On the other hand, traditional texture and patch synthesis approaches are particularly suitable when it needs to borrow textures from the surrounding regions. Motivated by these observations, we propose a new deep generative model-based approach which can not only synthesize novel image structures but also explicitly utilize surrounding image features as references during network training to make better predictions. The model is a feed-forward, fully convolutional neural network which can process images with multiple holes at arbitrary locations and with variable sizes during the test time. Experiments on multiple datasets including faces (CelebA, CelebA-HQ), textures (DTD) and natural images (ImageNet, Places2) demonstrate that our proposed approach generates higher-quality inpainting results than existing ones. Code, demo and models are available at: this https URL.

...read moreread less

1,333 citations

Proceedings Article•DOI•

Learning Discriminative Features with Multiple Granularities for Person Re-Identification

[...]

Guanshuo Wang¹, Yufeng Yuan, Xiong Chen, Jiwei Li, Xi Zhou¹ - Show less +1 more•Institutions (1)

Shanghai Jiao Tong University¹

15 Oct 2018

TL;DR: Comprehensive experiments implemented on the mainstream evaluation datasets including Market-1501, DukeMTMC-reid and CUHK03 indicate that the proposed end-to-end feature learning strategy robustly achieves state-of-the-art performances and outperforms any existing approaches by a large margin.

...read moreread less

Abstract: The combination of global and partial features has been an essential solution to improve discriminative performances in person re-identification (Re-ID) tasks. Previous part-based methods mainly focus on locating regions with specific pre-defined semantics to learn local representations, which increases learning difficulty but not efficient or robust to scenarios with large variances. In this paper, we propose an end-to-end feature learning strategy integrating discriminative information with various granularities. We carefully design the Multiple Granularity Network (MGN), a multi-branch deep network architecture consisting of one branch for global feature representations and two branches for local feature representations. Instead of learning on semantic regions, we uniformly partition the images into several stripes, and vary the number of parts in different local branches to obtain local feature representations with multiple granularities. Comprehensive experiments implemented on the mainstream evaluation datasets including Market-1501, DukeMTMC-reid and CUHK03 indicate that our method robustly achieves state-of-the-art performances and outperforms any existing approaches by a large margin. For example, on Market-1501 dataset in single query mode, we obtain a top result of Rank-1/mAP=96.6%/94.2% with this method after re-ranking.

...read moreread less

1,050 citations

Journal Article•DOI•

When Deep Learning Meets Metric Learning: Remote Sensing Image Scene Classification via Learning Discriminative CNNs

[...]

Gong Cheng¹, Ceyuan Yang¹, Xiwen Yao¹, Lei Guo¹, Junwei Han¹ - Show less +1 more•Institutions (1)

Northwestern Polytechnical University¹

09 Jan 2018-IEEE Transactions on Geoscience and Remote Sensing

TL;DR: This paper proposes a simple but effective method to learn discriminative CNNs (D-CNNs) to boost the performance of remote sensing image scene classification and comprehensively evaluates the proposed method on three publicly available benchmark data sets using three off-the-shelf CNN models.

...read moreread less

Abstract: Remote sensing image scene classification is an active and challenging task driven by many applications. More recently, with the advances of deep learning models especially convolutional neural networks (CNNs), the performance of remote sensing image scene classification has been significantly improved due to the powerful feature representations learnt through CNNs. Although great success has been obtained so far, the problems of within-class diversity and between-class similarity are still two big challenges. To address these problems, in this paper, we propose a simple but effective method to learn discriminative CNNs (D-CNNs) to boost the performance of remote sensing image scene classification. Different from the traditional CNN models that minimize only the cross entropy loss, our proposed D-CNN models are trained by optimizing a new discriminative objective function. To this end, apart from minimizing the classification error, we also explicitly impose a metric learning regularization term on the CNN features. The metric learning regularization enforces the D-CNN models to be more discriminative so that, in the new D-CNN feature spaces, the images from the same scene class are mapped closely to each other and the images of different classes are mapped as farther apart as possible. In the experiments, we comprehensively evaluate the proposed method on three publicly available benchmark data sets using three off-the-shelf CNN models. Experimental results demonstrate that our proposed D-CNN methods outperform the existing baseline methods and achieve state-of-the-art results on all three data sets.

...read moreread less

Proceedings Article•DOI•

Feature Squeezing: Detecting Adversarial Examples in Deep Neural Networks.

[...]

Weilin Xu¹, David Evans¹, Yanjun Qi¹•Institutions (1)

University of Virginia¹

01 Jan 2018

Abstract: Although deep neural networks (DNNs) have achieved great success in many tasks, they can often be fooled by \emph{adversarial examples} that are generated by adding small but purposeful distortions to natural examples. Previous studies to defend against adversarial examples mostly focused on refining the DNN models, but have either shown limited success or required expensive computation. We propose a new strategy, \emph{feature squeezing}, that can be used to harden DNN models by detecting adversarial examples. Feature squeezing reduces the search space available to an adversary by coalescing samples that correspond to many different feature vectors in the original space into a single sample. By comparing a DNN model's prediction on the original input with that on squeezed inputs, feature squeezing detects adversarial examples with high accuracy and few false positives. This paper explores two feature squeezing methods: reducing the color bit depth of each pixel and spatial smoothing. These simple strategies are inexpensive and complementary to other defenses, and can be combined in a joint detection framework to achieve high detection rates against state-of-the-art attacks.

...read moreread less

Proceedings Article•DOI•

Joint 3D Proposal Generation and Object Detection from View Aggregation

[...]

Jason Ku¹, Melissa Mozifian¹, Jungwook Lee¹, Ali Harakeh¹, Steven L. Waslander¹ - Show less +1 more•Institutions (1)

University of Waterloo¹

01 Oct 2018

TL;DR: This work presents AVOD, an Aggregate View Object Detection network for autonomous driving scenarios that uses LIDAR point clouds and RGB images to generate features that are shared by two subnetworks: a region proposal network (RPN) and a second stage detector network.

...read moreread less

Abstract: We present AVOD, an Aggregate View Object Detection network for autonomous driving scenarios. The proposed neural network architecture uses LIDAR point clouds and RGB images to generate features that are shared by two subnetworks: a region proposal network (RPN) and a second stage detector network. The proposed RPN uses a novel architecture capable of performing multimodal feature fusion on high resolution feature maps to generate reliable 3D object proposals for multiple object classes in road scenes. Using these proposals, the second stage detection network performs accurate oriented 3D bounding box regression and category classification to predict the extents, orientation, and classification of objects in 3D space. Our proposed architecture is shown to produce state of the art results on the KITTI 3D object detection benchmark [1] while running in real time with a low memory footprint, making it a suitable candidate for deployment on autonomous vehicles. Code is available at

...read moreread less

Proceedings Article•DOI•

Relation Networks for Object Detection

[...]

Han Hu¹, Jiayuan Gu², Zheng Zhang¹, Jifeng Dai¹, Yichen Wei¹ - Show less +1 more•Institutions (2)

Microsoft¹, Peking University²

01 Jun 2018

TL;DR: In this article, the authors propose an object relation module to model relations between objects, which is shown effective on improving object recognition and duplicate removal steps in the modern object detection pipeline.

...read moreread less

Abstract: Although it is well believed for years that modeling relations between objects would help object recognition, there has not been evidence that the idea is working in the deep learning era. All state-of-the-art object detection systems still rely on recognizing object instances individually, without exploiting their relations during learning. This work proposes an object relation module. It processes a set of objects simultaneously through interaction between their appearance feature and geometry, thus allowing modeling of their relations. It is lightweight and in-place. It does not require additional supervision and is easy to embed in existing networks. It is shown effective on improving object recognition and duplicate removal steps in the modern object detection pipeline. It verifies the efficacy of modeling object relations in CNN based detection. It gives rise to the first fully end-to-end object detector.

...read moreread less

Proceedings Article•DOI•

Domain Generalization with Adversarial Feature Learning

[...]

Haoliang Li¹, Sinno Jialin Pan¹, Shiqi Wang², Alex C. Kot¹•Institutions (2)

Nanyang Technological University¹, City University of Hong Kong²

01 Jun 2018

TL;DR: This paper presents a novel framework based on adversarial autoencoders to learn a generalized latent feature representation across domains for domain generalization, and proposed an algorithm to jointly train different components of the proposed framework.

...read moreread less

Abstract: In this paper, we tackle the problem of domain generalization: how to learn a generalized feature representation for an "unseen" target domain by taking the advantage of multiple seen source-domain data. We present a novel framework based on adversarial autoencoders to learn a generalized latent feature representation across domains for domain generalization. To be specific, we extend adversarial autoencoders by imposing the Maximum Mean Discrepancy (MMD) measure to align the distributions among different domains, and matching the aligned distribution to an arbitrary prior distribution via adversarial feature learning. In this way, the learned feature representation is supposed to be universal to the seen source domains because of the MMD regularization, and is expected to generalize well on the target domain because of the introduction of the prior distribution. We proposed an algorithm to jointly train different components of our proposed framework. Extensive experiments on various vision tasks demonstrate that our proposed framework can learn better generalized features for the unseen target domain compared with state-of-the-art domain generalization methods.

...read moreread less

Book Chapter•DOI•

Rethinking Spatiotemporal Feature Learning: Speed-Accuracy Trade-offs in Video Classification

[...]

Saining Xie¹, Chen Sun¹, Jonathan Huang¹, Zhuowen Tu¹, Kevin Murphy¹ - Show less +1 more•Institutions (1)

Google¹

08 Sep 2018

TL;DR: In this article, it was shown that it is possible to replace many of the expensive 3D convolutions by low-cost 2D convolution, and the best result was achieved when replacing the 3D CNNs at the bottom of the network, suggesting that temporal representation learning on high-level semantic features is more useful.

...read moreread less

Abstract: Despite the steady progress in video analysis led by the adoption of convolutional neural networks (CNNs), the relative improvement has been less drastic as that in 2D static image classification Three main challenges exist including spatial (image) feature representation, temporal information representation, and model/computation complexity It was recently shown by Carreira and Zisserman that 3D CNNs, inflated from 2D networks and pretrained on ImageNet, could be a promising way for spatial and temporal representation learning However, as for model/computation complexity, 3D CNNs are much more expensive than 2D CNNs and prone to overfit We seek a balance between speed and accuracy by building an effective and efficient video classification system through systematic exploration of critical network design choices In particular, we show that it is possible to replace many of the 3D convolutions by low-cost 2D convolutions Rather surprisingly, best result (in both speed and accuracy) is achieved when replacing the 3D convolutions at the bottom of the network, suggesting that temporal representation learning on high-level “semantic” features is more useful Our conclusion generalizes to datasets with very different properties When combined with several other cost-effective designs including separable spatial/temporal convolution and feature gating, our system results in an effective video classification system that that produces very competitive results on several action classification benchmarks (Kinetics, Something-something, UCF101 and HMDB), as well as two action detection (localization) benchmarks (JHMDB and UCF101-24)

...read moreread less

Book Chapter•DOI•

MVSNet: Depth inference for unstructured multi-view stereo

[...]

Yao Yao¹, Zixin Luo¹, Shiwei Li¹, Tian Fang, Long Quan¹ - Show less +1 more•Institutions (1)

Hong Kong University of Science and Technology¹

08 Sep 2018

TL;DR: This work presents an end-to-end deep learning architecture for depth map inference from multi-view images that flexibly adapts arbitrary N-view inputs using a variance-based cost metric that maps multiple features into one cost feature.

...read moreread less

Abstract: We present an end-to-end deep learning architecture for depth map inference from multi-view images. In the network, we first extract deep visual image features, and then build the 3D cost volume upon the reference camera frustum via the differentiable homography warping. Next, we apply 3D convolutions to regularize and regress the initial depth map, which is then refined with the reference image to generate the final output. Our framework flexibly adapts arbitrary N-view inputs using a variance-based cost metric that maps multiple features into one cost feature. The proposed MVSNet is demonstrated on the large-scale indoor DTU dataset. With simple post-processing, our method not only significantly outperforms previous state-of-the-arts, but also is several times faster in runtime. We also evaluate MVSNet on the complex outdoor Tanks and Temples dataset, where our method ranks first before April 18, 2018 without any fine-tuning, showing the strong generalization ability of MVSNet.

...read moreread less

Book Chapter•DOI•

Deep Continuous Fusion for Multi-Sensor 3D Object Detection

[...]

Ming Liang¹, Bin Yang¹, Shenlong Wang¹, Raquel Urtasun¹•Institutions (1)

Uber ¹

08 Sep 2018

TL;DR: This paper proposes a novel 3D object detector that can exploit both LIDAR as well as cameras to perform very accurate localization and designs an end-to-end learnable architecture that exploits continuous convolutions to fuse image and LIDar feature maps at different levels of resolution.

...read moreread less

Abstract: In this paper, we propose a novel 3D object detector that can exploit both LIDAR as well as cameras to perform very accurate localization. Towards this goal, we design an end-to-end learnable architecture that exploits continuous convolutions to fuse image and LIDAR feature maps at different levels of resolution. Our proposed continuous fusion layer encode both discrete-state image features as well as continuous geometric information. This enables us to design a novel, reliable and efficient end-to-end learnable 3D object detector based on multiple sensors. Our experimental evaluation on both KITTI as well as a large scale 3D object detection benchmark shows significant improvements over the state of the art.

...read moreread less

Posted Content•

Recurrent Residual Convolutional Neural Network based on U-Net (R2U-Net) for Medical Image Segmentation.

[...]

Md. Zahangir Alom, Mahmudul Hasan, Chris Yakopcic, Tarek M. Taha, Vijayan K. Asari - Show less +1 more

20 Feb 2018-arXiv: Computer Vision and Pattern Recognition

TL;DR: A Recurrent Convolutional Neural Network (RCNN) based on U-Net as well as a Recurrent Residual convolutional neural Network (RRCNN), which are named RU-Net and R2U-Net respectively are proposed, which show superior performance on segmentation tasks compared to equivalent models including U-nets and residual U- net.

...read moreread less

Abstract: Deep learning (DL) based semantic segmentation methods have been providing state-of-the-art performance in the last few years. More specifically, these techniques have been successfully applied to medical image classification, segmentation, and detection tasks. One deep learning technique, U-Net, has become one of the most popular for these applications. In this paper, we propose a Recurrent Convolutional Neural Network (RCNN) based on U-Net as well as a Recurrent Residual Convolutional Neural Network (RRCNN) based on U-Net models, which are named RU-Net and R2U-Net respectively. The proposed models utilize the power of U-Net, Residual Network, as well as RCNN. There are several advantages of these proposed architectures for segmentation tasks. First, a residual unit helps when training deep architecture. Second, feature accumulation with recurrent residual convolutional layers ensures better feature representation for segmentation tasks. Third, it allows us to design better U-Net architecture with same number of network parameters with better performance for medical image segmentation. The proposed models are tested on three benchmark datasets such as blood vessel segmentation in retina images, skin cancer segmentation, and lung lesion segmentation. The experimental results show superior performance on segmentation tasks compared to equivalent models including U-Net and residual U-Net (ResU-Net).

...read moreread less

Proceedings Article•DOI•

Recovering Realistic Texture in Image Super-Resolution by Deep Spatial Feature Transform

[...]

Xintao Wang¹, Ke Yu¹, Chao Dong², Chen Change Loy¹•Institutions (2)

The Chinese University of Hong Kong¹, SenseTime²

18 Jun 2018

TL;DR: In this paper, a spatial feature transform (SFT) layer was proposed to generate affine transformation parameters for spatial-wise feature modulation in a single-image super-resolution network.

...read moreread less

Abstract: Despite that convolutional neural networks (CNN) have recently demonstrated high-quality reconstruction for single-image super-resolution (SR), recovering natural and realistic texture remains a challenging problem. In this paper, we show that it is possible to recover textures faithful to semantic classes. In particular, we only need to modulate features of a few intermediate layers in a single network conditioned on semantic segmentation probability maps. This is made possible through a novel Spatial Feature Transform (SFT) layer that generates affine transformation parameters for spatial-wise feature modulation. SFT layers can be trained end-to-end together with the SR network using the same loss function. During testing, it accepts an input image of arbitrary size and generates a high-resolution image with just a single forward pass conditioned on the categorical priors. Our final results show that an SR network equipped with SFT can generate more realistic and visually pleasing textures in comparison to state-of-the-art SRGAN [27] and EnhanceNet [38].

...read moreread less

Proceedings Article•DOI•

The iNaturalist Species Classification and Detection Dataset

[...]

Grant Van Horn¹, Oisin Mac Aodha¹, Yang Song², Yin Cui³, Chen Sun², Alex Shepard, Hartwig Adam², Pietro Perona¹, Serge Belongie³ - Show less +5 more•Institutions (3)

California Institute of Technology¹, Google², Cornell University³

18 Jun 2018

TL;DR: The iNaturalist dataset as discussed by the authors contains 859,000 images from over 5,000 different species of plants and animals captured in a wide variety of situations from all over the world.

...read moreread less

Abstract: Existing image classification datasets used in computer vision tend to have a uniform distribution of images across object categories. In contrast, the natural world is heavily imbalanced, as some species are more abundant and easier to photograph than others. To encourage further progress in challenging real world conditions we present the iNaturalist species classification and detection dataset, consisting of 859,000 images from over 5,000 different species of plants and animals. It features visually similar species, captured in a wide variety of situations, from all over the world. Images were collected with different camera types, have varying image quality, feature a large class imbalance, and have been verified by multiple citizen scientists. We discuss the collection of the dataset and present extensive baseline experiments using state-of-the-art computer vision classification and detection models. Results show that current non-ensemble based methods achieve only 67% top one classification accuracy, illustrating the difficulty of the dataset. Specifically, we observe poor results for classes with small numbers of training examples suggesting more attention is needed in low-shot learning.

...read moreread less

Book Chapter•DOI•

Multi-scale Residual Network for Image Super-Resolution

[...]

Juncheng Li¹, Faming Fang¹, Kangfu Mei², Guixu Zhang¹•Institutions (2)

East China Normal University¹, Jiangxi Normal University²

08 Sep 2018

TL;DR: A novel multi-scale residual network (MSRN) to fully exploit the image features, which outperform most of the state-of-the-art methods.

...read moreread less

Abstract: Recent studies have shown that deep neural networks can significantly improve the quality of single-image super-resolution. Current researches tend to use deeper convolutional neural networks to enhance performance. However, blindly increasing the depth of the network cannot ameliorate the network effectively. Worse still, with the depth of the network increases, more problems occurred in the training process and more training tricks are needed. In this paper, we propose a novel multi-scale residual network (MSRN) to fully exploit the image features, which outperform most of the state-of-the-art methods. Based on the residual block, we introduce convolution kernels of different sizes to adaptively detect the image features in different scales. Meanwhile, we let these features interact with each other to get the most efficacious image information, we call this structure Multi-scale Residual Block (MSRB). Furthermore, the outputs of each MSRB are used as the hierarchical features for global feature fusion. Finally, all these features are sent to the reconstruction module for recovering the high-quality image.

...read moreread less

Journal Article•DOI•

Machine Health Monitoring Using Local Feature-Based Gated Recurrent Unit Networks

[...]

Rui Zhao¹, Dongzhe Wang¹, Ruqiang Yan², Kezhi Mao¹, Fei Shen², Jinjiang Wang³ - Show less +2 more•Institutions (3)

Nanyang Technological University¹, Southeast University², China University of Petroleum³

01 Feb 2018-IEEE Transactions on Industrial Electronics

TL;DR: Inspired by the success of deep learning methods that redefine representation learning from raw data, this work proposes local feature-based gated recurrent unit (LFGRU) networks, a hybrid approach that combines handcrafted feature design with automatic feature learning for machine health monitoring.

...read moreread less

Abstract: In modern industries, machine health monitoring systems (MHMS) have been applied wildly with the goal of realizing predictive maintenance including failures tracking, downtime reduction, and assets preservation. In the era of big machinery data, data-driven MHMS have achieved remarkable results in the detection of faults after the occurrence of certain failures (diagnosis) and prediction of the future working conditions and the remaining useful life (prognosis). The numerical representation for raw sensory data is the key stone for various successful MHMS. Conventional methods are the labor-extensive as they usually depend on handcrafted features, which require expert knowledge. Inspired by the success of deep learning methods that redefine representation learning from raw data, we propose local feature-based gated recurrent unit (LFGRU) networks. It is a hybrid approach that combines handcrafted feature design with automatic feature learning for machine health monitoring. First, features from windows of input time series are extracted. Then, an enhanced bidirectional GRU network is designed and applied on the generated sequence of local features to learn the representation. A supervised learning layer is finally trained to predict machine condition. Experiments on three machine health monitoring tasks: tool wear prediction, gearbox fault diagnosis, and incipient bearing fault detection verify the effectiveness and generalization of the proposed LFGRU.

...read moreread less

Journal Article•DOI•

Whale optimization approaches for wrapper feature selection

[...]

Majdi Mafarja¹, Seyedali Mirjalili²•Institutions (2)

Birzeit University¹, Griffith University²

01 Jan 2018-Applied Soft Computing

TL;DR: A new wrapper feature selection approach is proposed based on Whale Optimization Algorithm based on the influence of using the Tournament and Roulette Wheel selection mechanisms instead of using a random operator in the searching process to search the optimal feature subsets for classification purposes.

...read moreread less

Book Chapter•DOI•

Triplet Loss in Siamese Network for Object Tracking

[...]

Xingping Dong¹, Jianbing Shen¹•Institutions (1)

Beijing Institute of Technology¹

08 Sep 2018

TL;DR: A novel triplet loss is proposed to extract expressive deep feature for object tracking by adding it into Siamese network framework instead of pairwise loss for training.

...read moreread less

Abstract: Object tracking is still a critical and challenging problem with many applications in computer vision. For this challenge, more and more researchers pay attention to applying deep learning to get powerful feature for better tracking accuracy. In this paper, a novel triplet loss is proposed to extract expressive deep feature for object tracking by adding it into Siamese network framework instead of pairwise loss for training. Without adding any inputs, our approach is able to utilize more elements for training to achieve more powerful feature via the combination of original samples. Furthermore, we propose a theoretical analysis by combining comparison of gradients and back-propagation, to prove the effectiveness of our method. In experiments, we apply the proposed triplet loss for three real-time trackers based on Siamese network. And the results on several popular tracking benchmarks show our variants operate at almost the same frame-rate with baseline trackers and achieve superior tracking performance than them, as well as the comparable accuracy with recent state-of-the-art real-time trackers.

...read moreread less

Proceedings Article•DOI•

Multi-level Wavelet-CNN for Image Restoration

[...]

Pengju Liu¹, Hongzhi Zhang¹, Kai Zhang¹, Liang Lin², Wangmeng Zuo¹ - Show less +1 more•Institutions (2)

Harbin Institute of Technology¹, Sun Yat-sen University²

18 Jun 2018

TL;DR: In this article, a multi-level wavelet CNN (MWCNN) model is proposed for image denoising, single image super-resolution, and JPEG image artifacts removal, which can be applied to many image restoration tasks.

...read moreread less

Abstract: The tradeoff between receptive field size and efficiency is a crucial issue in low level vision. Plain convolutional networks (CNNs) generally enlarge the receptive field at the expense of computational cost. Recently, dilated filtering has been adopted to address this issue. But it suffers from gridding effect, and the resulting receptive field is only a sparse sampling of input image with checkerboard patterns. In this paper, we present a novel multi-level wavelet CNN (MWCNN) model for better tradeoff between receptive field size and computational efficiency. With the modified U-Net architecture, wavelet transform is introduced to reduce the size of feature maps in the contracting subnetwork. Furthermore, another convolutional layer is further used to decrease the channels of feature maps. In the expanding subnetwork, inverse wavelet transform is then deployed to reconstruct the high resolution feature maps. Our MWCNN can also be explained as the generalization of dilated filtering and subsampling, and can be applied to many image restoration tasks. The experimental results clearly show the effectiveness of MWCNN for image denoising, single image super-resolution, and JPEG image artifacts removal.

...read moreread less

Journal Article•DOI•

Unsupervised Person Re-identification: Clustering and Fine-tuning

[...]

Hehe Fan¹, Liang Zheng¹, Chenggang Yan², Yi Yang¹•Institutions (2)

University of Technology, Sydney¹, Hangzhou Dianzi University²

10 Oct 2018-ACM Transactions on Multimedia Computing, Communications, and Applications

TL;DR: A progressive unsupervised learning (PUL) method to transfer pretrained deep representations to unseen domains and demonstrates that PUL outputs discriminative features that improve the re-ID accuracy.

...read moreread less

Abstract: The superiority of deeply learned pedestrian representations has been reported in very recent literature of person re-identification (re-ID). In this article, we consider the more pragmatic issue of learning a deep feature with no or only a few labels. We propose a progressive unsupervised learning (PUL) method to transfer pretrained deep representations to unseen domains. Our method is easy to implement and can be viewed as an effective baseline for unsupervised re-ID feature learning. Specifically, PUL iterates between (1) pedestrian clustering and (2) fine-tuning of the convolutional neural network (CNN) to improve the initialization model trained on the irrelevant labeled dataset. Since the clustering results can be very noisy, we add a selection operation between the clustering and fine-tuning. At the beginning, when the model is weak, CNN is fine-tuned on a small amount of reliable examples that locate near to cluster centroids in the feature space. As the model becomes stronger, in subsequent iterations, more images are being adaptively selected as CNN training samples. Progressively, pedestrian clustering and the CNN model are improved simultaneously until algorithm convergence. This process is naturally formulated as self-paced learning. We then point out promising directions that may lead to further improvement. Extensive experiments on three large-scale re-ID datasets demonstrate that PUL outputs discriminative features that improve the re-ID accuracy. Our code has been released at https://github.com/hehefan/Unsupervised-Person-Re-identification-Clustering-and-Fine-tuning.

...read moreread less

Journal Article•DOI•

Face detection using deep learning: An improved faster RCNN approach

[...]

Xudong Sun, Pengcheng Wu, Steven C. H. Hoi¹•Institutions (1)

Singapore Management University¹

19 Jul 2018-Neurocomputing

TL;DR: This work improves the state-of-the-art Faster RCNN framework by combining a number of strategies, including feature concatenation, hard negative mining, multi-scale training, model pre-training, and proper calibration of key parameters.

...read moreread less

Journal Article•DOI•

Beyond Bilinear: Generalized Multimodal Factorized High-Order Pooling for Visual Question Answering

[...]

Zhou Yu¹, Jun Yu¹, Chenchao Xiang¹, Jianping Fan², Dacheng Tao³ - Show less +1 more•Institutions (3)

Hangzhou Dianzi University¹, University of North Carolina at Charlotte², University of Sydney³

09 Apr 2018-IEEE Transactions on Neural Networks

TL;DR: Zhang et al. as mentioned in this paper proposed a coattention mechanism using a deep neural network (DNN) architecture to jointly learn the attentions for both the image and the question, which can reduce the irrelevant features effectively and obtain more discriminative features for image and question representations.

...read moreread less

Abstract: Visual question answering (VQA) is challenging, because it requires a simultaneous understanding of both visual content of images and textual content of questions. To support the VQA task, we need to find good solutions for the following three issues: 1) fine-grained feature representations for both the image and the question; 2) multimodal feature fusion that is able to capture the complex interactions between multimodal features; and 3) automatic answer prediction that is able to consider the complex correlations between multiple diverse answers for the same question. For fine-grained image and question representations, a “coattention” mechanism is developed using a deep neural network (DNN) architecture to jointly learn the attentions for both the image and the question, which can allow us to reduce the irrelevant features effectively and obtain more discriminative features for image and question representations. For multimodal feature fusion, a generalized multimodal factorized high-order pooling approach (MFH) is developed to achieve more effective fusion of multimodal features by exploiting their correlations sufficiently, which can further result in superior VQA performance as compared with the state-of-the-art approaches. For answer prediction, the Kullback–Leibler divergence is used as the loss function to achieve precise characterization of the complex correlations between multiple diverse answers with the same or similar meaning, which can allow us to achieve faster convergence rate and obtain slightly better accuracy on answer prediction. A DNN architecture is designed to integrate all these aforementioned modules into a unified model for achieving superior VQA performance. With an ensemble of our MFH models, we achieve the state-of-the-art performance on the large-scale VQA data sets and win the runner-up in VQA Challenge 2017.

...read moreread less

Posted Content•

M2Det: A Single-Shot Object Detector based on Multi-Level Feature Pyramid Network.

[...]

Qijie Zhao¹, Tao Sheng¹, Yongtao Wang¹, Zhi Tang¹, Ying Chen², Ling Cai², Haibin Ling³ - Show less +3 more•Institutions (3)

Peking University¹, Alibaba Group², Temple University³

12 Nov 2018-arXiv: Computer Vision and Pattern Recognition

TL;DR: Wang et al. as mentioned in this paper proposed a multi-level feature pyramid network (MLFPN) to construct more effective feature pyramids for detecting objects of different scales, which achieved state-of-the-art results among one-stage detectors.

...read moreread less

Abstract: Feature pyramids are widely exploited by both the state-of-the-art one-stage object detectors (e.g., DSSD, RetinaNet, RefineDet) and the two-stage object detectors (e.g., Mask R-CNN, DetNet) to alleviate the problem arising from scale variation across object instances. Although these object detectors with feature pyramids achieve encouraging results, they have some limitations due to that they only simply construct the feature pyramid according to the inherent multi-scale, pyramidal architecture of the backbones which are actually designed for object classification task. Newly, in this work, we present a method called Multi-Level Feature Pyramid Network (MLFPN) to construct more effective feature pyramids for detecting objects of different scales. First, we fuse multi-level features (i.e. multiple layers) extracted by backbone as the base feature. Second, we feed the base feature into a block of alternating joint Thinned U-shape Modules and Feature Fusion Modules and exploit the decoder layers of each u-shape module as the features for detecting objects. Finally, we gather up the decoder layers with equivalent scales (sizes) to develop a feature pyramid for object detection, in which every feature map consists of the layers (features) from multiple levels. To evaluate the effectiveness of the proposed MLFPN, we design and train a powerful end-to-end one-stage object detector we call M2Det by integrating it into the architecture of SSD, which gets better detection performance than state-of-the-art one-stage detectors. Specifically, on MS-COCO benchmark, M2Det achieves AP of 41.0 at speed of 11.8 FPS with single-scale inference strategy and AP of 44.2 with multi-scale inference strategy, which is the new state-of-the-art results among one-stage detectors. The code will be made available on \url{this https URL.

...read moreread less

Collapse