scispace - formally typeset
Search or ask a question

Showing papers by "Jian Sun published in 2018"


Proceedings ArticleDOI
18 Jun 2018
TL;DR: ShuffleNet as discussed by the authors utilizes two new operations, pointwise group convolution and channel shuffle, to greatly reduce computation cost while maintaining accuracy, and achieves an actual speedup over AlexNet while maintaining comparable accuracy.
Abstract: We introduce an extremely computation-efficient CNN architecture named ShuffleNet, which is designed specially for mobile devices with very limited computing power (e.g., 10-150 MFLOPs). The new architecture utilizes two new operations, pointwise group convolution and channel shuffle, to greatly reduce computation cost while maintaining accuracy. Experiments on ImageNet classification and MS COCO object detection demonstrate the superior performance of ShuffleNet over other structures, e.g. lower top-1 error (absolute 7.8%) than recent MobileNet [12] on ImageNet classification task, under the computation budget of 40 MFLOPs. On an ARM-based mobile device, ShuffleNet achieves ~13A— actual speedup over AlexNet while maintaining comparable accuracy.

4,503 citations


Book ChapterDOI
08 Sep 2018
TL;DR: ShuffleNet V2 as discussed by the authors proposes to evaluate the direct metric on the target platform, beyond only considering FLOPs, based on a series of controlled experiments, and derives several practical guidelines for efficient network design.
Abstract: Currently, the neural network architecture design is mostly guided by the indirect metric of computation complexity, i.e., FLOPs. However, the direct metric, e.g., speed, also depends on the other factors such as memory access cost and platform characterics. Thus, this work proposes to evaluate the direct metric on the target platform, beyond only considering FLOPs. Based on a series of controlled experiments, this work derives several practical guidelines for efficient network design. Accordingly, a new architecture is presented, called ShuffleNet V2. Comprehensive ablation experiments verify that our model is the state-of-the-art in terms of speed and accuracy tradeoff.

3,393 citations


Posted Content
TL;DR: The cross-dataset generalization results of CrowdHuman dataset demonstrate state-of-the-art performance on previous dataset including Caltech-USA, CityPersons, and Brainwash without bells and whistles.
Abstract: Human detection has witnessed impressive progress in recent years. However, the occlusion issue of detecting human in highly crowded environments is far from solved. To make matters worse, crowd scenarios are still under-represented in current human detection benchmarks. In this paper, we introduce a new dataset, called CrowdHuman, to better evaluate detectors in crowd scenarios. The CrowdHuman dataset is large, rich-annotated and contains high diversity. There are a total of $470K$ human instances from the train and validation subsets, and $~22.6$ persons per image, with various kinds of occlusions in the dataset. Each human instance is annotated with a head bounding-box, human visible-region bounding-box and human full-body bounding-box. Baseline performance of state-of-the-art detection frameworks on CrowdHuman is presented. The cross-dataset generalization results of CrowdHuman dataset demonstrate state-of-the-art performance on previous dataset including Caltech-USA, CityPersons, and Brainwash without bells and whistles. We hope our dataset will serve as a solid baseline and help promote future research in human detection tasks.

386 citations


Book ChapterDOI
Zhenli Zhang1, Xiangyu Zhang, Chao Peng, Xiangyang Xue1, Jian Sun 
08 Sep 2018
TL;DR: A new framework, named ExFuse, is proposed to bridge the gap between low-level and high-level features and significantly improve the segmentation quality, which outperforms the previous state-of-the-art results.
Abstract: Modern semantic segmentation frameworks usually combine low-level and high-level features from pre-trained backbone convolutional models to boost performance. In this paper, we first point out that a simple fusion of low-level and high-level features could be less effective because of the gap in semantic levels and spatial resolution. We find that introducing semantic information into low-level features and high-resolution details into high-level features is more effective for the later fusion. Based on this observation, we propose a new framework, named ExFuse, to bridge the gap between low-level and high-level features thus significantly improve the segmentation quality by 4.0% in total. Furthermore, we evaluate our approach on the challenging PASCAL VOC 2012 segmentation benchmark and achieve 87.9% mean IoU, which outperforms the previous state-of-the-art results.

349 citations


Posted Content
TL;DR: Xia et al. as mentioned in this paper proposed a new framework, named ExFuse, to bridge the gap between low-level and high-level features to improve the segmentation quality.
Abstract: Modern semantic segmentation frameworks usually combine low-level and high-level features from pre-trained backbone convolutional models to boost performance. In this paper, we first point out that a simple fusion of low-level and high-level features could be less effective because of the gap in semantic levels and spatial resolution. We find that introducing semantic information into low-level features and high-resolution details into high-level features is more effective for the later fusion. Based on this observation, we propose a new framework, named ExFuse, to bridge the gap between low-level and high-level features thus significantly improve the segmentation quality by 4.0\% in total. Furthermore, we evaluate our approach on the challenging PASCAL VOC 2012 segmentation benchmark and achieve 87.9\% mean IoU, which outperforms the previous state-of-the-art results.

296 citations


Posted Content
TL;DR: State-of-the-art results have been obtained for both object detection and instance segmentation on the MSCOCO benchmark based on the DetNet~(4.8G FLOPs) backbone.
Abstract: Recent CNN based object detectors, no matter one-stage methods like YOLO, SSD, and RetinaNe or two-stage detectors like Faster R-CNN, R-FCN and FPN are usually trying to directly finetune from ImageNet pre-trained models designed for image classification. There has been little work discussing on the backbone feature extractor specifically designed for the object detection. More importantly, there are several differences between the tasks of image classification and object detection. 1. Recent object detectors like FPN and RetinaNet usually involve extra stages against the task of image classification to handle the objects with various scales. 2. Object detection not only needs to recognize the category of the object instances but also spatially locate the position. Large downsampling factor brings large valid receptive field, which is good for image classification but compromises the object location ability. Due to the gap between the image classification and object detection, we propose DetNet in this paper, which is a novel backbone network specifically designed for object detection. Moreover, DetNet includes the extra stages against traditional backbone network for image classification, while maintains high spatial resolution in deeper layers. Without any bells and whistles, state-of-the-art results have been obtained for both object detection and instance segmentation on the MSCOCO benchmark based on our DetNet~(4.8G FLOPs) backbone. The code will be released for the reproduction.

238 citations


Book ChapterDOI
08 Sep 2018
TL;DR: A novel deep learning approach for single image dehazing by learning dark channel and transmission priors and incorporating haze-related prior learning into deep network is proposed.
Abstract: Photos taken in hazy weather are usually covered with white masks and often lose important details. In this paper, we propose a novel deep learning approach for single image dehazing by learning dark channel and transmission priors. First, we build an energy model for dehazing using dark channel and transmission priors and design an iterative optimization algorithm using proximal operators for these two priors. Second, we unfold the iterative algorithm to be a deep network, dubbed as proximal dehaze-net, by learning the proximal operators using convolutional neural networks. Our network combines the advantages of traditional prior-based dehazing methods and deep learning methods by incorporating haze-related prior learning into deep network. Experiments show that our method achieves state-of-the-art performance for single image dehazing.

234 citations


Book ChapterDOI
Zeming Li1, Chao Peng, Gang Yu, Xiangyu Zhang, Yangdong Deng1, Jian Sun 
08 Sep 2018
TL;DR: DetNet is proposed, which is a novel backbone network specifically designed for object detection that includes the extra stages against traditional backbone network for image classification, while maintains high spatial resolution in deeper layers.
Abstract: Recent CNN based object detectors, either one-stage methods like YOLO, SSD, and RetinaNet, or two-stage detectors like Faster R-CNN, R-FCN and FPN, are usually trying to directly finetune from ImageNet pre-trained models designed for the task of image classification. However, there has been little work discussing the backbone feature extractor specifically designed for the task of object detection. More importantly, there are several differences between the tasks of image classification and object detection. (i) Recent object detectors like FPN and RetinaNet usually involve extra stages against the task of image classification to handle the objects with various scales. (ii) Object detection not only needs to recognize the category of the object instances but also spatially locate them. Large downsampling factors bring large valid receptive field, which is good for image classification, but compromises the object location ability. Due to the gap between the image classification and object detection, we propose DetNet in this paper, which is a novel backbone network specifically designed for object detection. Moreover, DetNet includes the extra stages against traditional backbone network for image classification, while maintains high spatial resolution in deeper layers. Without any bells and whistles, state-of-the-art results have been obtained for both object detection and instance segmentation on the MSCOCO benchmark based on our DetNet (4.8G FLOPs) backbone. Codes will be released (https://github.com/zengarden/DetNet).

233 citations


Posted Content
TL;DR: This work proposes to evaluate the direct metric on the target platform, beyond only considering FLOPs, and derives several practical guidelines for efficient network design, called ShuffleNet V2.
Abstract: Currently, the neural network architecture design is mostly guided by the \emph{indirect} metric of computation complexity, i.e., FLOPs. However, the \emph{direct} metric, e.g., speed, also depends on the other factors such as memory access cost and platform characterics. Thus, this work proposes to evaluate the direct metric on the target platform, beyond only considering FLOPs. Based on a series of controlled experiments, this work derives several practical \emph{guidelines} for efficient network design. Accordingly, a new architecture is presented, called \emph{ShuffleNet V2}. Comprehensive ablation experiments verify that our model is the state-of-the-art in terms of speed and accuracy tradeoff.

157 citations


Journal ArticleDOI
TL;DR: This letter unrolls the computational pipeline of BM3D algorithm into a convolutional neural network structure, with “extraction” and “aggregation” layers to model block matching stage in BM2D, and proposes a new convolutionAL neural network inspired by the classical BM3d algorithm, dubbed as BM3 D-Net.
Abstract: Denoising is a fundamental task in image processing with wide applications for enhancing image qualities. BM3D is considered as an effective baseline for image denoising. Although learning-based methods have been dominant in this area recently, the traditional methods are still valuable to inspire new ideas by combining with learning-based approaches. In this letter, we propose a new convolutional neural network inspired by the classical BM3D algorithm, dubbed as BM3D-Net. We unroll the computational pipeline of BM3D algorithm into a convolutional neural network structure, with “extraction” and “aggregation” layers to model block matching stage in BM3D. We apply our network to three denoising tasks: gray-scale image denoising, color image denoising, and depth map denoising. Experiments show that BM3D-Net significantly outperforms the basic BM3D method, and achieves competitive results compared with state of the art on these tasks.

139 citations


Journal ArticleDOI
TL;DR: In this paper, a notion of discrete conformality for hyperbolic polyhedral surfaces is introduced, which is shown to be computable and can be obtained using a discrete Yamabe flow with surgery.
Abstract: A notion of discrete conformality for hyperbolic polyhedral surfaces is introduced in this paper. This discrete conformality is shown to be computable. It is proved that each hyperbolic polyhedral metric on a closed surface is discrete conformal to a unique hyperbolic polyhedral metric with a given discrete curvature satisfying Gauss–Bonnet formula. Furthermore, the hyperbolic polyhedral metric with given curvature can be obtained using a discrete Yamabe flow with surgery. In particular, each hyperbolic polyhedral metric on a closed surface with negative Euler characteristic is discrete conformal to a unique hyperbolic metric.

Book ChapterDOI
20 Sep 2018
TL;DR: In this article, a structure-constrained cycleGAN was proposed for brain MR-to-CT synthesis using unpaired data that defines an extra structure-consistency loss based on the modality independent neighborhood descriptor to constrain structural consistency.
Abstract: The cycleGAN is becoming an influential method in medical image synthesis However, due to a lack of direct constraints between input and synthetic images, the cycleGAN cannot guarantee structural consistency between these two images, and such consistency is of extreme importance in medical imaging To overcome this, we propose a structure-constrained cycleGAN for brain MR-to-CT synthesis using unpaired data that defines an extra structure-consistency loss based on the modality independent neighborhood descriptor to constrain structural consistency Additionally, we use a position-based selection strategy for selecting training images instead of a completely random selection scheme Experimental results on synthesizing CT images from brain MR images demonstrate that our method is better than the conventional cycleGAN and approximates the cycleGAN trained with paired data


Posted Content
TL;DR: A structure-constrained cycleGAN is proposed for brain MR-to-CT synthesis using unpaired data that defines an extra structure-consistency loss based on the modality independent neighborhood descriptor to constrain structural consistency.
Abstract: The cycleGAN is becoming an influential method in medical image synthesis. However, due to a lack of direct constraints between input and synthetic images, the cycleGAN cannot guarantee structural consistency between these two images, and such consistency is of extreme importance in medical imaging. To overcome this, we propose a structure-constrained cycleGAN for brain MR-to-CT synthesis using unpaired data that defines an extra structure-consistency loss based on the modality independent neighborhood descriptor to constrain structural consistency. Additionally, we use a position-based selection strategy for selecting training images instead of a completely random selection scheme. Experimental results on synthesizing CT images from brain MR images demonstrate that our method is better than the conventional cycleGAN and approximates the cycleGAN trained with paired data.

Proceedings ArticleDOI
15 May 2018
TL;DR: Experimental results demonstrate that the proposed unsupervised domain adaptation with regularized optimal transport for multimodal 2D+3D Facial Expression Recognition can achieve superior performance compared with the state-of-the-art methods.
Abstract: Since human expressions have strong flexibility and personality, subject-independent facial expression recognition is a typical data bias problem. To address this problem, we propose a novel approach, namely unsupervised domain adaptation with regularized optimal transport for multimodal 2D+3D Facial Expression Recognition (FER). In particular, Wasserstein distance is employed to measure the distribution inconsistency between the training samples (i.e. source domain) and test samples (i.e. target domain). Minimization of this Wasserstein distance is equivalent to finding an optimal transport mapping from training to test samples. Once we find this mapping, original training samples can be transformed into a new space in which the distributions of the mapped training samples and the test samples can be well-aligned. In this case, classifier learned from the transformed training samples can be well generalized to the test samples for expression prediction. In practice, approximate optimal transport can be effectively solved by adding entropy regularization. To fully explore the class label information of training samples, group sparsity regularizer is also used to enforce that the training samples from the same expression class can be mapped to the same group. Experimental results evaluated on the BU-3DFE and Bosphorus databases demonstrate that the proposed approach can achieve superior performance compared with the state-of-the-art methods.

Journal ArticleDOI
TL;DR: The proposed novel multi‐atlas segmentation method, dubbed deep fusion net (DFN), is a deep architecture that integrates a feature extraction subnet and a non‐local patch‐based label fusion (NL‐PLF) subnet in a single network.

Book ChapterDOI
08 Sep 2018
TL;DR: This paper proposes a method, called GridFace, to reduce facial geometric variations and improve the recognition performance, which rectifies the face by local homography transformations, which are estimated by a face rectification network.
Abstract: In this paper, we propose a method, called GridFace, to reduce facial geometric variations and improve the recognition performance. Our method rectifies the face by local homography transformations, which are estimated by a face rectification network. To encourage the image generation with canonical views, we apply a regularization based on the natural face distribution. We learn the rectification network and recognition network in an end-to-end manner. Extensive experiments show our method greatly reduces geometric variations, and gains significant improvements in unconstrained face recognition scenarios.

Journal ArticleDOI
Jian Sun1, Ting Liu1, Y. Yan, K. Huo1, Wanggang Zhang1, Hongli Liu1, Zhongqi Shi1 
TL;DR: The hypothesis that elastin exposure might serve as an antigen to initiate the stimulation of CD4 + Th1‐CXCR3 immune inflammation pathway is confirmed and the CD4+Th1‐specific conversion and activation may be an initiator of COPD immune inflammatory response.
Abstract: CD4 + Th1-CXCR3 signalling pathway may play a key role in chronic obstructive pulmonary disease (COPD). The aim of this study was to explore Th1/Th2 cytokines ratio differences in patients in different stages of COPD and to confirm the hypothesis that elastin exposure might serve as an antigen to initiate the stimulation of CD4 + Th1-CXCR3 immune inflammation pathway. Patients of COPD in different stages and normal individuals were enrolled. Ten millilitres of peripheral blood was drawn from patients. The concentration of CXCR3, IFN-γ, IL-2, IL-4 and IL-13 in plasma was detected by ELISA. The Naive CD4+ T cells were isolated from the peripheral blood mononuclear cells, which were stimulated by elastin and collagen before determining the level of IFN-γ secretion by ELISPOT. Compared with control group, the concentration of CXCR3 in the acute exacerbation COPD (AECOPD) group was higher (P < .05). The concentration of IFN-γ and IL-2 in AECOPD group was lower than that in remission (P < .05). The concentration of IFN-γ in the AECOPD and remission was higher than that in controls (P < .05), while IL-2 was opposite (P < .01). The concentration of IL-4 and IL-13 in AECOPD group was higher than that in the controls (P < .05). The CD4+ Th1 cells stimulated by the elastin as antigen secreted more IFN-γ than that by collagen (P < .01). CXCR3 was highly expressed in patients with COPD. There were different Th1/Th2 cytokines in different stages of COPD. The CD4+Th1-specific conversion and activation may be an initiator of COPD immune inflammatory response.

Proceedings ArticleDOI
09 Mar 2018
TL;DR: This work presents a low-dose CT image reconstruction strategy driven by a deep dual network (LdCT-Net) to yield high-quality CT images by incorporating both projection information and image information simultaneously.
Abstract: High radiation dose in CT imaging is a major concern, which could result in increased lifetime risk of cancers. Therefore, to reduce the radiation dose at the same time maintaining clinically acceptable CT image quality is desirable in CT application. One of the most successful strategies is to apply statistical iterative reconstruction (SIR) to obtain promising CT images at low dose. Although the SIR algorithms are effective, they usually have three disadvantages: 1) desired-image prior design; 2) optimal parameters selection; and 3) high computation burden. To address these three issues, in this work, inspired by the deep learning network for inverse problem, we present a low-dose CT image reconstruction strategy driven by a deep dual network (LdCT-Net) to yield high-quality CT images by incorporating both projection information and image information simultaneously. Specifically, the present LdCT-Net effectively reconstructs CT images by adequately taking into account the information learned in dual-domain, i.e., projection domain and image domain, simultaneously. The experiment results on patients data demonstrated the present LdCT-Net can achieve promising gains over other existing algorithms in terms of noise-induced artifacts suppression and edge details preservation.

Journal ArticleDOI
TL;DR: A surface reconstruction method that has excellent performance despite nonuniformly distributed, noisy, and sparse data is introduced and can be parallelized with small overhead and shows compelling performance in a GPU version by implementing this direct and simple approach.
Abstract: In this article, we introduce a surface reconstruction method that has excellent performance despite nonuniformly distributed, noisy, and sparse data. We reconstruct the surface by estimating an implicit function and then obtain a triangle mesh by extracting an iso-surface. Our implicit function takes advantage of both the indicator function and the signed distance function. The implicit function is dominated by the indicator function at the regions away from the surface and is approximated (up to scaling) by the signed distance function near the surface. On one hand, the implicit function is well defined over the entire space for the extracted iso-surface to remain near the underlying true surface. On the other hand, a smooth iso-surface can be extracted using the marching cubes algorithm with simple linear interpolations due to the properties of the signed distance function. Moreover, our implicit function can be estimated directly from an explicit integral formula without solving any linear system. An approach called disk integration is also incorporated to improve the accuracy of the implicit function. Our method can be parallelized with small overhead and shows compelling performance in a GPU version by implementing this direct and simple approach. We apply our method to synthetic and real-world scanned data to demonstrate the accuracy, noise resilience, and efficiency of this method. The performance of the proposed method is also compared with several state-of-the-art methods.

Journal ArticleDOI
TL;DR: The harmonic extension problem is considered, which is widely used in many applications of machine learning, and is formulated as solving a Laplace--Beltrami equation.
Abstract: In this paper, we consider the harmonic extension problem, which is widely used in many applications of machine learning. We formulate the harmonic extension as solving a Laplace--Beltrami equation...

Journal ArticleDOI
Huibin Li1, Yibao Li1, Ruixuan Yu1, Jian Sun1, Junseok Kim2 
TL;DR: A novel efficient and fast method by using l0 gradient minimization, which can directly measure the sparsity of a solution and produce sharper surfaces is proposed, which is particularly effective for sharpening major edges and removing noise.

Posted Content
TL;DR: A new optimizer, dubbed as HyperAdam, is proposed that combines the idea of "learning to optimize" and traditional Adam optimizer and is justified to be state-of-the-art for various network training, such as multilayer perceptron, CNN and LSTM.
Abstract: Deep neural networks are traditionally trained using human-designed stochastic optimization algorithms, such as SGD and Adam. Recently, the approach of learning to optimize network parameters has emerged as a promising research topic. However, these learned black-box optimizers sometimes do not fully utilize the experience in human-designed optimizers, therefore have limitation in generalization ability. In this paper, a new optimizer, dubbed as \textit{HyperAdam}, is proposed that combines the idea of "learning to optimize" and traditional Adam optimizer. Given a network for training, its parameter update in each iteration generated by HyperAdam is an adaptive combination of multiple updates generated by Adam with varying decay rates. The combination weights and decay rates in HyperAdam are adaptively learned depending on the task. HyperAdam is modeled as a recurrent neural network with AdamCell, WeightCell and StateCell. It is justified to be state-of-the-art for various network training, such as multilayer perceptron, CNN and LSTM.

Posted Content
TL;DR: Zhang et al. as discussed by the authors proposed a method, called GridFace, to reduce facial geometric variations and improve the recognition performance, which rectifies the face by local homography transformations, which are estimated by a face rectification network.
Abstract: In this paper, we propose a method, called GridFace, to reduce facial geometric variations and improve the recognition performance. Our method rectifies the face by local homography transformations, which are estimated by a face rectification network. To encourage the image generation with canonical views, we apply a regularization based on the natural face distribution. We learn the rectification network and recognition network in an end-to-end manner. Extensive experiments show our method greatly reduces geometric variations, and gains significant improvements in unconstrained face recognition scenarios.

Journal ArticleDOI
TL;DR: Extensive experimental results demonstrate that the proposed regularizer is systematically superior over other competing local and nonlocal regularization approaches, both quantitatively and visually.

Book ChapterDOI
08 Sep 2018
TL;DR: This work proposes a novel spectral transform network on 3D surface to learn shape descriptors that achieved the highest accuracies on SHREC’14, 15 datasets as well as the “range” subset of SHREC'17 dataset.
Abstract: Designing a network on 3D surface for non-rigid shape analysis is a challenging task. In this work, we propose a novel spectral transform network on 3D surface to learn shape descriptors. The proposed network architecture consists of four stages: raw descriptor extraction, surface second-order pooling, mixture of power function-based spectral transform, and metric learning. The proposed network is simple and shallow. Quantitative experiments on challenging benchmarks show its effectiveness for non-rigid shape retrieval and classification, e.g., it achieved the highest accuracies on SHREC’14, 15 datasets as well as the “range” subset of SHREC’17 dataset.

Posted Content
TL;DR: Wang et al. as discussed by the authors proposed a spectral transform network on 3D surface to learn shape descriptors, which achieved the highest accuracies on SHREC14, 15 datasets as well as the Range subset of SHREC17 dataset.
Abstract: Designing a network on 3D surface for non-rigid shape analysis is a challenging task. In this work, we propose a novel spectral transform network on 3D surface to learn shape descriptors. The proposed network architecture consists of four stages: raw descriptor extraction, surface second-order pooling, mixture of power function-based spectral transform, and metric learning. The proposed network is simple and shallow. Quantitative experiments on challenging benchmarks show its effectiveness for non-rigid shape retrieval and classification, e.g., it achieved the highest accuracies on SHREC14, 15 datasets as well as the Range subset of SHREC17 dataset.