scispace - formally typeset
Search or ask a question

Showing papers by "Jian Sun published in 2020"


Journal ArticleDOI
TL;DR: Two versions of a novel deep learning architecture are proposed, dubbed as ADMM-CSNet, by combining the traditional model-based CS method and data-driven deep learning method for image reconstruction from sparsely sampled measurements, which achieved favorable reconstruction accuracy in fast computational speed compared with the traditional and the other deep learning methods.
Abstract: Compressive sensing (CS) is an effective technique for reconstructing image from a small amount of sampled data. It has been widely applied in medical imaging, remote sensing, image compression, etc. In this paper, we propose two versions of a novel deep learning architecture, dubbed as ADMM-CSNet, by combining the traditional model-based CS method and data-driven deep learning method for image reconstruction from sparsely sampled measurements. We first consider a generalized CS model for image reconstruction with undetermined regularizations in undetermined transform domains, and then two efficient solvers using Alternating Direction Method of Multipliers (ADMM) algorithm for optimizing the model are proposed. We further unroll and generalize the ADMM algorithm to be two deep architectures, in which all parameters of the CS model and the ADMM algorithm are discriminatively learned by end-to-end training. For both applications of fast CS complex-valued MR imaging and CS imaging of real-valued natural images, the proposed ADMM-CSNet achieved favorable reconstruction accuracy in fast computational speed compared with the traditional and the other deep learning methods.

470 citations


Proceedings ArticleDOI
14 Jun 2020
TL;DR: PVN3D as mentioned in this paper proposes a deep Hough voting network to detect 3D keypoints of objects and then estimate the 6D pose parameters within a least-squares fitting manner.
Abstract: In this work, we present a novel data-driven method for robust 6DoF object pose estimation from a single RGBD image. Unlike previous methods that directly regressing pose parameters, we tackle this challenging task with a keypoint-based approach. Specifically, we propose a deep Hough voting network to detect 3D keypoints of objects and then estimate the 6D pose parameters within a least-squares fitting manner. Our method is a natural extension of 2D-keypoint approaches that successfully work on RGB based 6DoF estimation. It allows us to fully utilize the geometric constraint of rigid objects with the extra depth information and is easy for a network to learn and optimize. Extensive experiments were conducted to demonstrate the effectiveness of 3D-keypoint detection in the 6D pose estimation task. Experimental results also show our method outperforms the state-of-the-art methods by large margins on several benchmarks. Code and video are available at https://github.com/ethnhe/PVN3D.git.

323 citations


Proceedings ArticleDOI
14 Jun 2020
TL;DR: A novel view-based Graph Convolutional Neural Network, dubbed as view-GCN, to recognize 3D shape based on graph representation of multiple views in flexible view configurations, which is a hierarchical network based on local and non-local graph convolution for feature transform, and selective view-sampling for graph coarsening.
Abstract: View-based approach that recognizes 3D shape through its projected 2D images has achieved state-of-the-art results for 3D shape recognition. The major challenge for view-based approach is how to aggregate multi-view features to be a global shape descriptor. In this work, we propose a novel view-based Graph Convolutional Neural Network, dubbed as view-GCN, to recognize 3D shape based on graph representation of multiple views in flexible view configurations. We first construct view-graph with multiple views as graph nodes, then design a graph convolutional neural network over view-graph to hierarchically learn discriminative shape descriptor considering relations of multiple views. The view-GCN is a hierarchical network based on local and non-local graph convolution for feature transform, and selective view-sampling for graph coarsening. Extensive experiments on benchmark datasets show that view-GCN achieves state-of-the-art results for 3D shape classification and retrieval.

162 citations


Posted Content
TL;DR: An anchor-free object detector with a fully differentiable label assignment strategy, named AutoAssign, that automatically determines positive/negative samples by generating positive and negative weight maps to modify each location's prediction dynamically.
Abstract: Determining positive/negative samples for object detection is known as label assignment. Here we present an anchor-free detector named AutoAssign. It requires little human knowledge and achieves appearance-aware through a fully differentiable weighting mechanism. During training, to both satisfy the prior distribution of data and adapt to category characteristics, we present Center Weighting to adjust the category-specific prior distributions. To adapt to object appearances, Confidence Weighting is proposed to adjust the specific assign strategy of each instance. The two weighting modules are then combined to generate positive and negative weights to adjust each location's confidence. Extensive experiments on the MS COCO show that our method steadily surpasses other best sampling strategies by large margins with various backbones. Moreover, our best model achieves 52.1% AP, outperforming all existing one-stage detectors. Besides, experiments on other datasets, e.g., PASCAL VOC, Objects365, and WiderFace, demonstrate the broad applicability of AutoAssign.

129 citations


Proceedings ArticleDOI
14 Jun 2020
TL;DR: A novel adversarial DA approach completely defined in spherical feature space is proposed, in which spherical classifier for label prediction and spherical domain discriminator for discriminating domain labels are defined and a robust pseudo-label loss is developed.
Abstract: Adversarial domain adaptation (DA) has been an effective approach for learning domain-invariant features by adversarial training. In this paper, we propose a novel adversarial DA approach completely defined in spherical feature space, in which we define spherical classifier for label prediction and spherical domain discriminator for discriminating domain labels. To utilize pseudo-label robustly, we develop a robust pseudo-label loss in the spherical feature space, which weights the importance of estimated labels of target data by posterior probability of correct labeling, modeled by Gaussian-uniform mixture model in spherical feature space. Extensive experiments show that our method achieves state-of-the-art results, and also confirm effectiveness of spherical classifier, spherical discriminator and spherical robust pseudo-label loss.

99 citations


Journal ArticleDOI
TL;DR: This paper proposes a structure-constrained cycleGAN for unsupervised MR-to-CT synthesis by defining an extra structure-consistency loss based on the modality independent neighborhood descriptor and utilizes a spectral normalization technique to stabilize the training process and a self-attention module to model the long-range spatial dependencies in the synthetic images.
Abstract: Synthesizing a CT image from an available MR image has recently emerged as a key goal in radiotherapy treatment planning for cancer patients. CycleGANs have achieved promising results on unsupervised MR-to-CT image synthesis; however, because they have no direct constraints between input and synthetic images, cycleGANs do not guarantee structural consistency between these two images. This means that anatomical geometry can be shifted in the synthetic CT images, clearly a highly undesirable outcome in the given application. In this paper, we propose a structure-constrained cycleGAN for unsupervised MR-to-CT synthesis by defining an extra structure-consistency loss based on the modality independent neighborhood descriptor. We also utilize a spectral normalization technique to stabilize the training process and a self-attention module to model the long-range spatial dependencies in the synthetic images. Results on unpaired brain and abdomen MR-to-CT image synthesis show that our method produces better synthetic CT images in both accuracy and visual quality as compared to other unsupervised synthesis methods. We also show that an approximate affine pre-registration for unpaired training data can improve synthesis results.

74 citations


Book ChapterDOI
23 Aug 2020
TL;DR: Zhang et al. as discussed by the authors propose an unsupervised deep homography method with a new architecture design, which learns an outlier mask to only select reliable regions for homography estimation.
Abstract: Homography estimation is a basic image alignment method in many applications. It is usually conducted by extracting and matching sparse feature points, which are error-prone in low-light and low-texture images. On the other hand, previous deep homography approaches use either synthetic images for supervised learning or aerial images for unsupervised learning, both ignoring the importance of handling depth disparities and moving objects in real world applications. To overcome these problems, in this work we propose an unsupervised deep homography method with a new architecture design. In the spirit of the RANSAC procedure in traditional methods, we specifically learn an outlier mask to only select reliable regions for homography estimation. We calculate loss with respect to our learned deep features instead of directly comparing image content as did previously. To achieve the unsupervised training, we also formulate a novel triplet loss customized for our network. We verify our method by conducting comprehensive comparisons on a new dataset that covers a wide range of scenes with varying degrees of difficulties for the task. Experimental results reveal that our method outperforms the state-of-the-art including deep solutions and feature-based solutions.

62 citations


Posted Content
TL;DR: This paper proposes a novel self-supervised representation learning method, Self-EMD, for object detection, directly trained on unlabeled non-iconic image dataset like COCO, instead of commonly used iconic-object image datasets like ImageNet.
Abstract: In this paper, we propose a novel self-supervised representation learning method, Self-EMD, for object detection. Our method directly trained on unlabeled non-iconic image dataset like COCO, instead of commonly used iconic-object image dataset like ImageNet. We keep the convolutional feature maps as the image embedding to preserve spatial structures and adopt Earth Mover's Distance (EMD) to compute the similarity between two embeddings. Our Faster R-CNN (ResNet50-FPN) baseline achieves 39.8% mAP on COCO, which is on par with the state of the art self-supervised methods pre-trained on ImageNet. More importantly, it can be further improved to 40.4% mAP with more unlabeled images, showing its great potential for leveraging more easily obtained unlabeled data. Code will be made available.

58 citations


Book ChapterDOI
23 Aug 2020
TL;DR: A conceptually simple but effective funnel activation for image recognition tasks, called Funnel activation (FReLU), that extends ReLU and PReLU to a 2D activation by adding a negligible overhead of spatial condition.
Abstract: We present a conceptually simple but effective funnel activation for image recognition tasks, called Funnel activation (FReLU), that extends ReLU and PReLU to a 2D activation by adding a negligible overhead of spatial condition. The forms of ReLU and PReLU are \(y=max(x,0)\) and \(y=max(x,px)\), respectively, while FReLU is in the form of \(y=max(x, \mathbb {T}(x))\), where \(\mathbb {T}(\cdot )\) is the 2D spatial condition. Moreover, the spatial condition achieves a pixel-wise modeling capacity in a simple way, capturing complicated visual layouts with regular convolutions. We conduct experiments on ImageNet, COCO detection, and semantic segmentation tasks, showing great improvements and robustness of FReLU in the visual recognition tasks. Code is available at https://github.com/megvii-model/FunnelAct.

43 citations


Posted Content
TL;DR: The WeightNet, composed entirely of (grouped) fully-connected layers, is used to directly output the convolutional weight, and outperforms existing approaches on both ImageNet and COCO detection tasks, achieving better Accuracy-FLOPs and Accuracy-Parameter trade-offs.
Abstract: We present a conceptually simple, flexible and effective framework for weight generating networks. Our approach is general that unifies two current distinct and extremely effective SENet and CondConv into the same framework on weight space. The method, called WeightNet, generalizes the two methods by simply adding one more grouped fully-connected layer to the attention activation layer. We use the WeightNet, composed entirely of (grouped) fully-connected layers, to directly output the convolutional weight. WeightNet is easy and memory-conserving to train, on the kernel space instead of the feature space. Because of the flexibility, our method outperforms existing approaches on both ImageNet and COCO detection tasks, achieving better Accuracy-FLOPs and Accuracy-Parameter trade-offs. The framework on the flexible weight space has the potential to further improve the performance. Code is available at this https URL.

39 citations


Posted Content
26 Apr 2020
TL;DR: Stitcher is presented, a feedback-driven data provider, which aims to train object detectors in a balanced way, and steadily improves performance by a large margin in all settings, especially for small objects, with nearly no additional computation in both training and testing stages.
Abstract: Object detectors commonly vary quality according to scales, where the performance on small objects is the least satisfying. In this paper, we investigate this phenomenon and discover that: in the majority of training iterations, small objects contribute barely to the total loss, causing poor performance with imbalanced optimization. Inspired by this finding, we present Stitcher, a feedback-driven data provider, which aims to train object detectors in a balanced way. In Stitcher, images are resized into smaller components and then stitched into the same size to regular images. Stitched images contain inevitable smaller objects, which would be beneficial with our core idea, to exploit the loss statistics as feedback to guide next-iteration update. Experiments have been conducted on various detectors, backbones, training periods, datasets, and even on instance segmentation. Stitcher steadily improves performance by a large margin in all settings, especially for small objects, with nearly no additional computation in both training and testing stages.

Book ChapterDOI
23 Aug 2020
TL;DR: WeightNet as discussed by the authors unifies SENet and CondConv into the same framework on weight space, which is easy and memory-conserving to train, on the kernel space instead of the feature space.
Abstract: We present a conceptually simple, flexible and effective framework for weight generating networks. Our approach is general that unifies two current distinct and extremely effective SENet and CondConv into the same framework on weight space. The method, called WeightNet, generalizes the two methods by simply adding one more grouped fully-connected layer to the attention activation layer. We use the WeightNet, composed entirely of (grouped) fully-connected layers, to directly output the convolutional weight. WeightNet is easy and memory-conserving to train, on the kernel space instead of the feature space. Because of the flexibility, our method outperforms existing approaches on both ImageNet and COCO detection tasks, achieving better Accuracy-FLOPs and Accuracy-Parameter trade-offs. The framework on the flexible weight space has the potential to further improve the performance. Code is available at https://github.com/megvii-model/WeightNet.

Posted Content
TL;DR: This work designs a self-guided upsample module to tackle the interpolation blur problem caused by bilinear upsampling between pyramid levels, and proposes a pyramid distillation loss to add supervision for intermediate levels via distilling the finest flow as pseudo labels.
Abstract: We present an unsupervised learning approach for optical flow estimation by improving the upsampling and learning of pyramid network. We design a self-guided upsample module to tackle the interpolation blur problem caused by bilinear upsampling between pyramid levels. Moreover, we propose a pyramid distillation loss to add supervision for intermediate levels via distilling the finest flow as pseudo labels. By integrating these two components together, our method achieves the best performance for unsupervised optical flow learning on multiple leading benchmarks, including MPI-SIntel, KITTI 2012 and KITTI 2015. In particular, we achieve EPE=1.4 on KITTI 2012 and F1=9.38% on KITTI 2015, which outperform the previous state-of-the-art methods by 22.2% and 15.7%, respectively.

Book ChapterDOI
23 Aug 2020
TL;DR: This work proposes to precondition the Richardson solver using approximate inverse filters of the (known) blur and natural image prior kernels to allow extremely efficient parameter sharing across the image, and leads to significant gains in accuracy and/or speed compared to classical FFT and conjugate-gradient methods.
Abstract: Non-blind image deblurring is typically formulated as a linear least-squares problem regularized by natural priors on the corresponding sharp picture’s gradients, which can be solved, for example, using a half-quadratic splitting method with Richardson fixed-point iterations for its least-squares updates and a proximal operator for the auxiliary variable updates. We propose to precondition the Richardson solver using approximate inverse filters of the (known) blur and natural image prior kernels. Using convolutions instead of a generic linear preconditioner allows extremely efficient parameter sharing across the image, and leads to significant gains in accuracy and/or speed compared to classical FFT and conjugate-gradient methods. More importantly, the proposed architecture is easily adapted to learning both the preconditioner and the proximal operator using CNN embeddings. This yields a simple and efficient algorithm for non-blind image deblurring which is fully interpretable, can be learned end to end, and whose accuracy matches or exceeds the state of the art, quite significantly, in the non-uniform case.

Posted Content
TL;DR: In this paper, the authors proposed a rotation-invariant deep network for point clouds analysis, which incorporates positional information of points as inputs while yielding rotation invariance, and the network is hierarchical and relies on two modules: a positional feature embedding block and a relational feature embeddedding block.
Abstract: In this paper we propose a rotation-invariant deep network for point clouds analysis. Point-based deep networks are commonly designed to recognize roughly aligned 3D shapes based on point coordinates, but suffer from performance drops with shape rotations. Some geometric features, e.g., distances and angles of points as inputs of network, are rotation-invariant but lose positional information of points. In this work, we propose a novel deep network for point clouds by incorporating positional information of points as inputs while yielding rotation-invariance. The network is hierarchical and relies on two modules: a positional feature embedding block and a relational feature embedding block. Both modules and the whole network are proven to be rotation-invariant when processing point clouds as input. Experiments show state-of-the-art classification and segmentation performances on benchmark datasets, and ablation studies demonstrate effectiveness of the network design.

Book ChapterDOI
23 Aug 2020
TL;DR: A novel deep network for point clouds analysis is proposed by incorporating positional information of points as inputs while yielding rotation-invariance, and experiments show state-of-the-art classification and segmentation performances on benchmark datasets.
Abstract: In this paper we propose a rotation-invariant deep network for point clouds analysis. Point-based deep networks are commonly designed to recognize roughly aligned 3D shapes based on point coordinates, but suffer from performance drops with shape rotations. Some geometric features, e.g., distances and angles of points as inputs of network, are rotation-invariant but lose positional information of points. In this work, we propose a novel deep network for point clouds by incorporating positional information of points as inputs while yielding rotation-invariance. The network is hierarchical and relies on two modules: a positional feature embedding block and a relational feature embedding block. Both modules and the whole network are proven to be rotation-invariant when processing point clouds as input. Experiments show state-of-the-art classification and segmentation performances on benchmark datasets, and ablation studies demonstrate effectiveness of the network design .

Posted Content
15 Jun 2020
TL;DR: It is rigorously proved that the angular update is only determined by pre-defined hyper-parameters, and can perfectly match the empirical observation in complex and large scale computer vision tasks like ImageNet and COCO with standard training schemes.
Abstract: We comprehensively reveal the learning dynamics of deep neural networks (DNN) with batch normalization (BN) and weight decay (WD), named as Spherical Motion Dynamics (SMD). Our theorem on SMD is based on the scale-invariant property of weights caused by BN, and regularization effect of WD. SMD shows the optimization trajectory of weights is like a spherical motion; and a new indicator, angular update is proposed to measure the update efficiency of DNN with BN and WD. We rigorously prove that the angular update is only determined by pre-defined hyper-parameters (i.e. learning rate, WD parameter and momentum coefficient), and provide their quantitative relationship. Most importantly, the quantitative result of SMD can perfectly match the empirical observation in complex and large scale computer vision tasks like ImageNet and COCO with standard training schemes. SMD can also yield reasonable interpretations on some phenomena about BN from an entirely new perspective, including avoidance of vanishing and exploding gradient, no risk of being trapped into sharp minima, and sudden drop of loss when shrinking learning rate. Further, to present the practical significance of SMD, we discuss the connection between SMD and commonly used learning rate tuning scheme: Linear Scaling Principle.

Journal ArticleDOI
TL;DR: This paper tries to identify diverse abnormalities from a retinal test image by finely learning its individualized retinal background (IRB) on which retinal lesions superimpose, using a multi-scale sparse coding based learning (MSSCL) algorithm and a repeated learning strategy.

Book ChapterDOI
Yan Yang1, Na Wang1, Heran Yang1, Jian Sun1, Zongben Xu1 
04 Oct 2020
TL;DR: A model-driven deep attention network is proposed, dubbed as MD-DAN, to reconstruct highly under-sampled long-time sequence MR image with the guidance of a certain short-time sequences MR image to accelerate MRI.
Abstract: Speeding up Magnetic Resonance Imaging (MRI) is an inevitable task in capturing multi-contrast MR images for medical diagnosis. In MRI, some sequences, e.g., in T2 weighted imaging, require long scanning time, while T1 weighted images are captured by short-time sequences. To accelerate MRI, in this paper, we propose a model-driven deep attention network, dubbed as MD-DAN, to reconstruct highly under-sampled long-time sequence MR image with the guidance of a certain short-time sequence MR image. MD-DAN is a novel deep architecture inspired by the iterative algorithm optimizing a novel MRI reconstruction model regularized by cross-contrast prior using a guided contrast image. The network is designed to automatically learn cross-contrast prior by learning corresponding proximal operator. The backbone network to model the proximal operator is designed as a dual-path convolutional network with channel and spatial attention modules. Experimental results on a brain MRI dataset substantiate the superiority of our method with significantly improved accuracy. For example, MD-DAN achieves PSNR up to 35.04 dB at the ultra-fast 1/32 sampling rate.

Posted Content
TL;DR: This paper proposes an instance-aware convolution, a simple and practical instance segmentation approach, building upon dense one-stage detectors that gives a comparable performance to the region-based Mask R-CNN with faster inference.
Abstract: A single-point feature has shown its effectiveness in object detection. However, for instance segmentation, it does not lead to satisfactory results. The reasons are two folds. Firstly, it has limited representation capacity. Secondly, it could be misaligned with potential instances. To address the above issues, we propose a new point-based framework, namely PointINS, to segment instances from single points. The core module of our framework is instance-aware convolution, including the instance-agnostic feature and instance-aware weights. Instance-agnostic feature for each Point-of-Interest (PoI) serves as a template for potential instance masks. In this way, instance-aware features are computed by convolving this template with instance-aware weights for following mask prediction. Given the independence of instance-aware convolution, PointINS is general and practical as a one-stage detector for anchor-based and anchor-free frameworks. In our extensive experiments, we show the effectiveness of our framework on RetinaNet and FCOS. With ResNet101 backbone, PointINS achieves 38.3 mask mAP on challenging COCO dataset, outperforming its competitors by a large margin. The code will be made publicly available.

Posted Content
TL;DR: FReLU as mentioned in this paper extends ReLU and PReLU to a 2D activation by adding a negligible overhead of spatial condition, achieving a pixel-wise modeling capacity in a simple way, capturing complicated visual layouts with regular convolutions.
Abstract: We present a conceptually simple but effective funnel activation for image recognition tasks, called Funnel activation (FReLU), that extends ReLU and PReLU to a 2D activation by adding a negligible overhead of spatial condition. The forms of ReLU and PReLU are y = max(x, 0) and y = max(x, px), respectively, while FReLU is in the form of y = max(x,T(x)), where T(x) is the 2D spatial condition. Moreover, the spatial condition achieves a pixel-wise modeling capacity in a simple way, capturing complicated visual layouts with regular convolutions. We conduct experiments on ImageNet, COCO detection, and semantic segmentation tasks, showing great improvements and robustness of FReLU in the visual recognition tasks. Code is available at this https URL.

Journal ArticleDOI
TL;DR: A novel network block that generalizes the second-order pooling to 3D surface by designing a learnable non-linear transform on the spectrum of the pooled descriptor is proposed, showing its superiority compared with traditional second- order pooling methods.
Abstract: In this paper, we propose a novel network block, dubbed as second-order spectral transform block, for 3D shape retrieval and classification. This network block generalizes the second-order pooling to 3D surface by designing a learnable non-linear transform on the spectrum of the pooled descriptor. The proposed block consists of following two components. First, the second-order average (SO-Avr) and max-pooling (SO-Max) operations are designed on 3D surface to aggregate local descriptors, which are shown to be more discriminative than the popular average-pooling or max-pooling. Second, a learnable spectral transform parameterized by mixture of power function is proposed to perform non-linear feature mapping in the space of pooled descriptors, i.e., manifold of symmetric positive definite matrix for SO-Avr, and space of symmetric matrix for SO-Max. The proposed block can be plugged into existing network architectures to aggregate local shape descriptors for boosting their performance. We apply it to a shallow network for non-rigid 3D shape analysis and to existing networks for rigid shape analysis, where it improves the first-tier retrieval accuracy by 7.2% on SHREC’14 Real dataset and achieves state-of-the-art classification accuracy on ModelNet40. As an extension, we apply our block to 2D image classification, showing its superiority compared with traditional second-order pooling methods. We also provide theoretical and experimental analysis on stability of the proposed second-order spectral transform block.

Posted Content
TL;DR: This report presents a object detection/instance segmentation system, MegDetV2, which works in a two-pass fashion, first to detect instances then to obtain segmentation, and achieves the best results in COCO Challenge 2019 and 2020.
Abstract: In this report, we present our object detection/instance segmentation system, MegDetV2, which works in a two-pass fashion, first to detect instances then to obtain segmentation. Our baseline detector is mainly built on a new designed RPN, called RPN++. On the COCO-2019 detection/instance-segmentation test-dev dataset, our system achieves 61.0/53.1 mAP, which surpassed our 2018 winning results by 5.0/4.2 respectively. We achieve the best results in COCO Challenge 2019 and 2020.

Posted Content
TL;DR: Joint multi-dimension pruning is presented, a new perspective of pruning a network on three crucial aspects: spatial, depth and channel simultaneously, as this method is optimized collaboratively across the three dimensions in a single end-to-end training.
Abstract: We present joint multi-dimension pruning (named as JointPruning), a new perspective of pruning a network on three crucial aspects: spatial, depth and channel simultaneously. The joint strategy enables to search a better status than previous studies that focused on individual dimension solely, as our method is optimized collaboratively across the three dimensions in a single end-to-end training. Moreover, each dimension that we consider can promote to get better performance through colluding with the other two. Our method is realized by the adapted stochastic gradient estimation. Extensive experiments on large-scale ImageNet dataset across a variety of network architectures MobileNet V1&V2 and ResNet demonstrate the effectiveness of our proposed method. For instance, we achieve significant margins of 2.5% and 2.6% improvement over the state-of-the-art approach on the already compact MobileNet V1&V2 under an extremely large compression ratio.

Posted Content
TL;DR: In this article, an implicit feature pyramid network (i-FPN) is proposed to use an implicit function, recently introduced in deep equilibrium model (DEQ), to model the transformation of FPN.
Abstract: In this paper, we present an implicit feature pyramid network (i-FPN) for object detection. Existing FPNs stack several cross-scale blocks to obtain large receptive field. We propose to use an implicit function, recently introduced in deep equilibrium model (DEQ), to model the transformation of FPN. We develop a residual-like iteration to updates the hidden states efficiently. Experimental results on MS COCO dataset show that i-FPN can significantly boost detection performance compared to baseline detectors with ResNet-50-FPN: +3.4, +3.2, +3.5, +4.2, +3.2 mAP on RetinaNet, Faster-RCNN, FCOS, ATSS and AutoAssign, respectively.

Posted Content
TL;DR: In this paper, the authors proposed a dynamic scale training paradigm (abbreviated as DST) to mitigate scale variation challenge in object detection by using feedback information from the optimization process to dynamically guide the data preparation.
Abstract: We propose a Dynamic Scale Training paradigm (abbreviated as DST) to mitigate scale variation challenge in object detection. Previous strategies like image pyramid, multi-scale training, and their variants are aiming at preparing scale-invariant data for model optimization. However, the preparation procedure is unaware of the following optimization process that restricts their capability in handling the scale variation. Instead, in our paradigm, we use feedback information from the optimization process to dynamically guide the data preparation. The proposed method is surprisingly simple yet obtains significant gains (2%+ Average Precision on MS COCO dataset), outperforming previous methods. Experimental results demonstrate the efficacy of our proposed DST method towards scale variation handling. It could also generalize to various backbones, benchmarks, and other challenging downstream tasks like instance segmentation. It does not introduce inference overhead and could serve as a free lunch for general detection configurations. Besides, it also facilitates efficient training due to fast convergence. Code and models are available at this http URL.

Posted Content
TL;DR: A penalized energy function regularized by a sum of terms measuring the distance between patches to be restored and clean patches from an external database gathered beforehand is optimized, with strong statistical guarantees leveraging local dependency properties of overlapping patches.
Abstract: We present a novel approach to image restoration that leverages ideas from localized structured prediction and non-linear multi-task learning. We optimize a penalized energy function regularized by a sum of terms measuring the distance between patches to be restored and clean patches from an external database gathered beforehand. The resulting estimator comes with strong statistical guarantees leveraging local dependency properties of overlapping patches. We derive the corresponding algorithms for energies based on the mean-squared and Euclidean norm errors. Finally, we demonstrate the practical effectiveness of our model on different image restoration problems using standard benchmarks.

Posted Content
TL;DR: In this paper, the authors propose to precondition the Richardson solver using approximate inverse filters of the (known) blur and natural image prior kernels, which leads to significant gains in accuracy and/or speed compared to classical FFT and conjugate-gradient methods.
Abstract: Non-blind image deblurring is typically formulated as a linear least-squares problem regularized by natural priors on the corresponding sharp picture's gradients, which can be solved, for example, using a half-quadratic splitting method with Richardson fixed-point iterations for its least-squares updates and a proximal operator for the auxiliary variable updates. We propose to precondition the Richardson solver using approximate inverse filters of the (known) blur and natural image prior kernels. Using convolutions instead of a generic linear preconditioner allows extremely efficient parameter sharing across the image, and leads to significant gains in accuracy and/or speed compared to classical FFT and conjugate-gradient methods. More importantly, the proposed architecture is easily adapted to learning both the preconditioner and the proximal operator using CNN embeddings. This yields a simple and efficient algorithm for non-blind image deblurring which is fully interpretable, can be learned end to end, and whose accuracy matches or exceeds the state of the art, quite significantly, in the non-uniform case.

Posted Content
TL;DR: Zhang et al. as mentioned in this paper proposed an orthogonal attention block (OAB) and a second-order fusion unit (SFU) to learn multi-scale information from different layers and enhance them by encouraging them to be diverse.
Abstract: In this paper, we propose an efficient human pose estimation network (DANet) by learning deeply aggregated representations. Most existing models explore multi-scale information mainly from features with different spatial sizes. Powerful multi-scale representations usually rely on the cascaded pyramid framework. This framework largely boosts the performance but in the meanwhile makes networks very deep and complex. Instead, we focus on exploiting multi-scale information from layers with different receptive-field sizes and then making full of use this information by improving the fusion method. Specifically, we propose an orthogonal attention block (OAB) and a second-order fusion unit (SFU). The OAB learns multi-scale information from different layers and enhances them by encouraging them to be diverse. The SFU adaptively selects and fuses diverse multi-scale information and suppress the redundant ones. This could maximize the effective information in final fused representations. With the help of OAB and SFU, our single pyramid network may be able to generate deeply aggregated representations that contain even richer multi-scale information and have a larger representing capacity than that of cascaded networks. Thus, our networks could achieve comparable or even better accuracy with much smaller model complexity. Specifically, our \mbox{DANet-72} achieves $70.5$ in AP score on COCO test-dev set with only $1.0G$ FLOPs. Its speed on a CPU platform achieves $58$ Persons-Per-Second~(PPS).