scispace - formally typeset
Search or ask a question

Showing papers by "Jian Sun published in 2019"


Proceedings ArticleDOI
03 Apr 2019
TL;DR: This paper introduces an extremely efficient CNN architecture named DFANet for semantic segmentation under resource constraints that substantially reduces the number of parameters, but still obtains sufficient receptive field and enhances the model learning ability, which strikes a balance between the speed and segmentation performance.
Abstract: This paper introduces an extremely efficient CNN architecture named DFANet for semantic segmentation under resource constraints. Our proposed network starts from a single lightweight backbone and aggregates discriminative features through sub-network and sub-stage cascade respectively. Based on the multi-scale feature propagation, DFANet substantially reduces the number of parameters, but still obtains sufficient receptive field and enhances the model learning ability, which strikes a balance between the speed and segmentation performance. Experiments on Cityscapes and CamVid datasets demonstrate the superior performance of DFANet with 8$\times$ less FLOPs and 2$\times$ faster than the existing state-of-the-art real-time semantic segmentation methods while providing comparable accuracy. Specifically, it achieves 70.3\% Mean IOU on the Cityscapes test dataset with only 1.7 GFLOPs and a speed of 160 FPS on one NVIDIA Titan X card, and 71.3\% Mean IOU with 3.4 GFLOPs while inferring on a higher resolution image.

409 citations


Proceedings ArticleDOI
01 Oct 2019
TL;DR: Object365 can serve as a better feature learning dataset for localization-sensitive tasks like object detection and semantic segmentation and better generalization ability of Object365 has been verified on CityPersons, VOC segmentation, and ADE tasks.
Abstract: In this paper, we introduce a new large-scale object detection dataset, Objects365, which has 365 object categories over 600K training images. More than 10 million, high-quality bounding boxes are manually labeled through a three-step, carefully designed annotation pipeline. It is the largest object detection dataset (with full annotation) so far and establishes a more challenging benchmark for the community. Objects365 can serve as a better feature learning dataset for localization-sensitive tasks like object detection and semantic segmentation. The Objects365 pre-trained models significantly outperform ImageNet pre-trained models with 5.6 points gain (42 vs 36.4) based on the standard setting of 90K iterations on COCO benchmark. Even compared with much long training time like 540K iterations, our Objects365 pretrained model with 90K iterations still have 2.7 points gain (42 vs 39.3). Meanwhile, the finetuning time can be greatly reduced (up to 10 times) when reaching the same accuracy. Better generalization ability of Object365 has also been verified on CityPersons, VOC segmentation, and ADE tasks. The dataset as well as the pretrained-models have been released at www.objects365.org.

331 citations


Posted Content
TL;DR: In this article, a meta learning approach for channel pruning of deep neural networks is proposed, where the weights are directly generated by the trained PruningNet and do not need any finetuning at search time.
Abstract: In this paper, we propose a novel meta learning approach for automatic channel pruning of very deep neural networks. We first train a PruningNet, a kind of meta network, which is able to generate weight parameters for any pruned structure given the target network. We use a simple stochastic structure sampling method for training the PruningNet. Then, we apply an evolutionary procedure to search for good-performing pruned networks. The search is highly efficient because the weights are directly generated by the trained PruningNet and we do not need any finetuning at search time. With a single PruningNet trained for the target network, we can search for various Pruned Networks under different constraints with little human participation. Compared to the state-of-the-art pruning methods, we have demonstrated superior performances on MobileNet V1/V2 and ResNet. Codes are available on this https URL.

291 citations


Proceedings ArticleDOI
01 Oct 2019
TL;DR: A novel meta learning approach for automatic channel pruning of very deep neural networks by training a PruningNet, a kind of meta network, which is able to generate weight parameters for any pruned structure given the target network.
Abstract: In this paper, we propose a novel meta learning approach for automatic channel pruning of very deep neural networks. We first train a PruningNet, a kind of meta network, which is able to generate weight parameters for any pruned structure given the target network. We use a simple stochastic structure sampling method for training the PruningNet. Then, we apply an evolutionary procedure to search for good-performing pruned networks. The search is highly efficient because the weights are directly generated by the trained PruningNet and we do not need any finetuning at search time. With a single PruningNet trained for the target network, we can search for various Pruned Networks under different constraints with little human participation. Compared to the state-of-the-art pruning methods, we have demonstrated superior performances on MobileNet V1/V2 and ResNet. Codes are available on https://github.com/liuzechun/MetaPruning.

286 citations


Proceedings ArticleDOI
01 Oct 2019
TL;DR: benefit from the highly efficient backbone and detection part design, ThunderNet surpasses previous lightweight one-stage detectors with only 40% of the computational cost on PASCAL VOC and COCO benchmarks.
Abstract: Real-time generic object detection on mobile platforms is a crucial but challenging computer vision task. Prior lightweight CNN-based detectors are inclined to use one-stage pipeline. In this paper, we investigate the effectiveness of two-stage detectors in real-time generic detection and propose a lightweight two-stage detector named ThunderNet. In the backbone part, we analyze the drawbacks in previous lightweight backbones and present a lightweight backbone designed for object detection. In the detection part, we exploit an extremely efficient RPN and detection head design. To generate more discriminative feature representation, we design two efficient architecture blocks, Context Enhancement Module and Spatial Attention Module. At last, we investigate the balance between the input resolution, the backbone, and the detection head. Benefit from the highly efficient backbone and detection part design, ThunderNet surpasses previous lightweight one-stage detectors with only 40% of the computational cost on PASCAL VOC and COCO benchmarks. Without bells and whistles, ThunderNet runs at 24.1 fps on an ARM-based device with 19.2 AP on COCO. To the best of our knowledge, this is the first real-time detector reported on ARM platforms. Code will be released for paper reproduction.

179 citations


Journal ArticleDOI
TL;DR: Experimental results obtained on clinical patient datasets demonstrate that the proposed deep learning-based strategy for MBIR can achieve promising gains over existing algorithms for LdCT image reconstruction in terms of noise-induced artifact suppression and edge detail preservation.
Abstract: Reducing the exposure to X-ray radiation while maintaining a clinically acceptable image quality is desirable in various CT applications. To realize low-dose CT (LdCT) imaging, model-based iterative reconstruction (MBIR) algorithms are widely adopted, but they require proper prior knowledge assumptions in the sinogram and/or image domains and involve tedious manual optimization of multiple parameters. In this paper, we propose a deep learning (DL)-based strategy for MBIR to simultaneously address prior knowledge design and MBIR parameter selection in one optimization framework. Specifically, a parameterized plug-and-play alternating direction method of multipliers (3pADMM) is proposed for the general penalized weighted least-squares model, and then, by adopting the basic idea of DL, the parameterized plug-and-play (3p) prior and the related parameters are optimized simultaneously in a single framework using a large number of training data. The main contribution of this paper is that the 3p prior and the related parameters in the proposed 3pADMM framework can be supervised and optimized simultaneously to achieve robust LdCT reconstruction performance. Experimental results obtained on clinical patient datasets demonstrate that the proposed method can achieve promising gains over existing algorithms for LdCT image reconstruction in terms of noise-induced artifact suppression and edge detail preservation.

137 citations


Posted Content
TL;DR: DFANet as discussed by the authors proposes an efficient CNN architecture based on multi-scale feature propagation, which substantially reduces the number of parameters, but still obtains sufficient receptive field and enhances the model learning ability.
Abstract: This paper introduces an extremely efficient CNN architecture named DFANet for semantic segmentation under resource constraints. Our proposed network starts from a single lightweight backbone and aggregates discriminative features through sub-network and sub-stage cascade respectively. Based on the multi-scale feature propagation, DFANet substantially reduces the number of parameters, but still obtains sufficient receptive field and enhances the model learning ability, which strikes a balance between the speed and segmentation performance. Experiments on Cityscapes and CamVid datasets demonstrate the superior performance of DFANet with 8$\times$ less FLOPs and 2$\times$ faster than the existing state-of-the-art real-time semantic segmentation methods while providing comparable accuracy. Specifically, it achieves 70.3\% Mean IOU on the Cityscapes test dataset with only 1.7 GFLOPs and a speed of 160 FPS on one NVIDIA Titan X card, and 71.3\% Mean IOU with 3.4 GFLOPs while inferring on a higher resolution image.

118 citations


Posted Content
26 Mar 2019
TL;DR: This paper proposes DetNAS to automatically search neural architectures for the backbones of object detectors, formulated into a supernet and the search method relies on evolution algorithm (EA).
Abstract: Object detectors are usually equipped with networks designed for image classification as backbones, e.g., ResNet. Although it is publicly known that there is a gap between the task of image classification and object detection, designing a suitable detector backbone is still manually exhaustive. In this paper, we propose DetNAS to automatically search neural architectures for the backbones of object detectors. In DetNAS, the search space is formulated into a supernet and the search method relies on evolution algorithm (EA). In experiments, we show the effectiveness of DetNAS on various detectors, the one-stage detector, RetinaNet, and the twostage detector, FPN. For each case, we search in both training from scratch scheme and ImageNet pre-training scheme. There is a consistent superiority compared to the architectures searched on ImageNet classification. Our main result architecture achieves better performance than ResNet-101 on COCO with the FPN detector. In addition, we illustrate the architectures searched by DetNAS and find some meaningful patterns.

112 citations


Journal ArticleDOI
TL;DR: A graph-based semisupervised deep learning model for PolSAR image classification that enforces the category label constraints on the human-labeled pixels and encourages class label smoothness and the alignment of class label boundaries with the image edges.
Abstract: Aiming at improving the classification accuracy with limited numbers of labeled pixels in polarimetric synthetic aperture radar (PolSAR) image classification task, this paper presents a graph-based semisupervised deep learning model for PolSAR image classification. It models the PolSAR image as an undirected graph, where the nodes correspond to the labeled and unlabeled pixels, and the weighted edges represent similarities between the pixels. Upon the graph, we design an energy function incorporating a semisupervision term, a convolutional neural network (CNN) term, and a pairwise smoothness term. The employed CNN extracts abstract and data-driven polarimetric features and outputs class label predictions to the graph model. The semisupervision term enforces the category label constraints on the human-labeled pixels. The pairwise smoothness term encourages class label smoothness and the alignment of class label boundaries with the image edges. Starting from an initialized class label map generated based on $K$ -Wishart distribution hypothesis or superpixel segmentation of PauliRGB images, we iteratively and alternately optimize the defined energy function until it converges. We conducted experiments on two real benchmark PolSAR images, and extensive experiments demonstrated that our approach achieved the state-of-the-art results for PolSAR image classification.

102 citations


Proceedings ArticleDOI
01 Oct 2019
TL;DR: AdaMatting as discussed by the authors disentangles the matting problem into two sub-tasks: trimap adaptation and alpha estimation, which is a pixel-wise classification problem that infers the global structure of the input image.
Abstract: Most previous image matting methods require a roughly-specificed trimap as input, and estimate fractional alpha values for all pixels that are in the unknown region of the trimap. In this paper, we argue that directly estimating the alpha matte from a coarse trimap is a major limitation of previous methods, as this practice tries to address two difficult and inherently different problems at the same time: identifying true blending pixels inside the trimap region, and estimate accurate alpha values for them. We propose AdaMatting, a new end-to-end matting framework that disentangles this problem into two sub-tasks: trimap adaptation and alpha estimation. Trimap adaptation is a pixel-wise classification problem that infers the global structure of the input image by identifying definite foreground, background, and semi-transparent image regions. Alpha estimation is a regression problem that calculates the opacity value of each blended pixel. Our method separately handles these two sub-tasks within a single deep convolutional neural network (CNN). Extensive experiments show that AdaMatting has additional structure awareness and trimap fault-tolerance. Our method achieves the state-of-the-art performance on Adobe Composition-1k dataset both qualitatively and quantitatively. It is also the current best-performing method on the alphamatting.com online evaluation for all commonly-used metrics.

77 citations


Posted Content
TL;DR: This paper proposes AdaMatting, a new end-to-end matting framework that disentangles this problem into two sub-tasks: trimap adaptation and alpha estimation, which achieves the state-of-the-art performance on Adobe Composition-1k dataset both qualitatively and quantitatively.
Abstract: Most previous image matting methods require a roughly-specificed trimap as input, and estimate fractional alpha values for all pixels that are in the unknown region of the trimap. In this paper, we argue that directly estimating the alpha matte from a coarse trimap is a major limitation of previous methods, as this practice tries to address two difficult and inherently different problems at the same time: identifying true blending pixels inside the trimap region, and estimate accurate alpha values for them. We propose AdaMatting, a new end-to-end matting framework that disentangles this problem into two sub-tasks: trimap adaptation and alpha estimation. Trimap adaptation is a pixel-wise classification problem that infers the global structure of the input image by identifying definite foreground, background, and semi-transparent image regions. Alpha estimation is a regression problem that calculates the opacity value of each blended pixel. Our method separately handles these two sub-tasks within a single deep convolutional neural network (CNN). Extensive experiments show that AdaMatting has additional structure awareness and trimap fault-tolerance. Our method achieves the state-of-the-art performance on Adobe Composition-1k dataset both qualitatively and quantitatively. It is also the current best-performing method on the this http URL online evaluation for all commonly-used metrics.

Posted Content
TL;DR: This paper investigates the effectiveness of two- stage detectors in real-time generic detection and proposes a lightweight two-stage detector named ThunderNet, which achieves superior performance with only 40% of the computational cost on PASCAL VOC and COCO benchmarks.
Abstract: Real-time generic object detection on mobile platforms is a crucial but challenging computer vision task. However, previous CNN-based detectors suffer from enormous computational cost, which hinders them from real-time inference in computation-constrained scenarios. In this paper, we investigate the effectiveness of two-stage detectors in real-time generic detection and propose a lightweight two-stage detector named ThunderNet. In the backbone part, we analyze the drawbacks in previous lightweight backbones and present a lightweight backbone designed for object detection. In the detection part, we exploit an extremely efficient RPN and detection head design. To generate more discriminative feature representation, we design two efficient architecture blocks, Context Enhancement Module and Spatial Attention Module. At last, we investigate the balance between the input resolution, the backbone, and the detection head. Compared with lightweight one-stage detectors, ThunderNet achieves superior performance with only 40% of the computational cost on PASCAL VOC and COCO benchmarks. Without bells and whistles, our model runs at 24.1 fps on an ARM-based device. To the best of our knowledge, this is the first real-time detector reported on ARM platforms. Code will be released for paper reproduction.

Posted Content
TL;DR: This work proposes an unsupervised deep homography method with a new architecture design that outperforms the state-of-the-art including deep solutions and feature-based solutions.
Abstract: Homography estimation is a basic image alignment method in many applications. It is usually conducted by extracting and matching sparse feature points, which are error-prone in low-light and low-texture images. On the other hand, previous deep homography approaches use either synthetic images for supervised learning or aerial images for unsupervised learning, both ignoring the importance of handling depth disparities and moving objects in real world applications. To overcome these problems, in this work we propose an unsupervised deep homography method with a new architecture design. In the spirit of the RANSAC procedure in traditional methods, we specifically learn an outlier mask to only select reliable regions for homography estimation. We calculate loss with respect to our learned deep features instead of directly comparing image content as did previously. To achieve the unsupervised training, we also formulate a novel triplet loss customized for our network. We verify our method by conducting comprehensive comparisons on a new dataset that covers a wide range of scenes with varying degrees of difficulties for the task. Experimental results reveal that our method outperforms the state-of-the-art including deep solutions and feature-based solutions.

Journal ArticleDOI
17 Jul 2019
TL;DR: HyperAdam as discussed by the authors combines the idea of learning to optimize and traditional Adam optimizer, which is the state-of-the-art for various network training, such as multilayer perceptron, CNN and LSTM.
Abstract: Deep neural networks are traditionally trained using humandesigned stochastic optimization algorithms, such as SGD and Adam. Recently, the approach of learning to optimize network parameters has emerged as a promising research topic. However, these learned black-box optimizers sometimes do not fully utilize the experience in human-designed optimizers, therefore have limitation in generalization ability. In this paper, a new optimizer, dubbed as HyperAdam, is proposed that combines the idea of “learning to optimize” and traditional Adam optimizer. Given a network for training, its parameter update in each iteration generated by HyperAdam is an adaptive combination of multiple updates generated by Adam with varying decay rates . The combination weights and decay rates in HyperAdam are adaptively learned depending on the task. HyperAdam is modeled as a recurrent neural network with AdamCell, WeightCell and StateCell. It is justified to be state-of-the-art for various network training, such as multilayer perceptron, CNN and LSTM.

Posted Content
TL;DR: A deep Hough voting network is proposed to detect 3D keypoints of objects and then estimate the 6D pose parameters within a least-squares fitting manner, which is a natural extension of 2D-keypoint approaches that successfully work on RGB based 6DoF estimation.
Abstract: In this work, we present a novel data-driven method for robust 6DoF object pose estimation from a single RGBD image. Unlike previous methods that directly regressing pose parameters, we tackle this challenging task with a keypoint-based approach. Specifically, we propose a deep Hough voting network to detect 3D keypoints of objects and then estimate the 6D pose parameters within a least-squares fitting manner. Our method is a natural extension of 2D-keypoint approaches that successfully work on RGB based 6DoF estimation. It allows us to fully utilize the geometric constraint of rigid objects with the extra depth information and is easy for a network to learn and optimize. Extensive experiments were conducted to demonstrate the effectiveness of 3D-keypoint detection in the 6D pose estimation task. Experimental results also show our method outperforms the state-of-the-art methods by large margins on several benchmarks. Code and video are available at this https URL.

Book ChapterDOI
13 Oct 2019
TL;DR: A novel deep network is proposed, dubbed as Blind-PMRI-Net, to simultaneously reconstruct the MR image and sensitivity maps in a blind setting for parallel imaging, which naturally combines the physical constraint of parallel imaging and prior learning in a single deep architecture.
Abstract: Parallel imaging is a fast magnetic resonance imaging technique through spatial sensitivity coding using multi-coils. To reconstruct a high quality MR image from under-sampled k-space data, we propose a novel deep network, dubbed as Blind-PMRI-Net, to simultaneously reconstruct the MR image and sensitivity maps in a blind setting for parallel imaging. The Blind-PMRI-Net is a novel deep architecture inspired by the iterative algorithm optimizing a novel energy model for joint image and sensitivity estimation based on image and sensitivity priors. The network is designed to be able to automatically learn these two priors by learning their corresponding proximal operators using convolutional neural networks. Blind-PMRI-Net naturally combines the physical constraint of parallel imaging and prior learning in a single deep architecture. Experiments on a knee MRI dataset show that our network can effectively reconstruct MR image with improved accuracy than previous methods, with fast computational speed. For example, Blind-PMRI-Net takes 0.72 s on GPU to reconstruct 15-channel sensitivity maps and a complex-valued MR image in size of \(320\times 320\).

Proceedings Article
01 Jan 2019
TL;DR: With the learned diffusion distance, a hierarchical image segmentation method outperforming previous segmentation methods is proposed and achieved promising results on PASCAL VOC 2012 segmentation dataset.
Abstract: Diffusion distance is a spectral method for measuring distance among nodes on graph considering global data structure. In this work, we propose a spec-diff-net for computing diffusion distance on graph based on approximate spectral decomposition. The network is a differentiable deep architecture consisting of feature extraction and diffusion distance modules for computing diffusion distance on image by end-to-end training. We design low resolution kernel matching loss and high resolution segment matching loss to enforce the network's output to be consistent with human-labeled image segments. To compute high-resolution diffusion distance or segmentation mask, we design an up-sampling strategy by feature-attentional interpolation which can be learned when training spec-diff-net. With the learned diffusion distance, we propose a hierarchical image segmentation method outperforming previous segmentation methods. Moreover, a weakly supervised semantic segmentation network is designed using diffusion distance and achieved promising results on PASCAL VOC 2012 segmentation dataset.

Posted Content
TL;DR: This work builds a hierarchical relational graph embedding network (HRGE-Net) to aggregate the multi-view features extracted from 2D images to be a global shape descriptor and proposes a novel feature aggregation network by fully investigating the relations among views.
Abstract: View-based approach that recognizes 3D shape through its projected 2D images achieved state-of-the-art performance for 3D shape recognition. One essential challenge for view-based approach is how to aggregate the multi-view features extracted from 2D images to be a global 3D shape descriptor. In this work, we propose a novel feature aggregation network by fully investigating the relations among views. We construct a relational graph with multi-view images as nodes, and design relational graph embedding by modeling pairwise and neighboring relations among views. By gradually coarsening the graph, we build a hierarchical relational graph embedding network (HRGE-Net) to aggregate the multi-view features to be a global shape descriptor. Extensive experiments show that HRGE-Net achieves stateof-the-art performance for 3D shape classification and retrieval on benchmark datasets.