scispace - formally typeset
Search or ask a question

Showing papers by "Jian Sun published in 2022"


Proceedings ArticleDOI
13 Mar 2022
TL;DR: It is demonstrated that using a few large convolutional kernels instead of a stack of small kernels could be a more powerful paradigm, and proposed RepLKNet, a pure CNN architecture whose kernel size is as large as 31×31, in contrast to commonly used 3×3.
Abstract: We revisit large kernel design in modern convolutional neural networks (CNNs). Inspired by recent advances in vision transformers (ViTs), in this paper, we demonstrate that using a few large convolutional kernels instead of a stack of small kernels could be a more powerful paradigm. We suggested five guidelines, e.g., applying re-parameterized large depthwise convolutions, to design efficient high-performance large-kernel CNNs. Following the guidelines, we propose RepLKNet, a pure CNN architecture whose kernel size is as large as 31×31, in contrast to commonly used 3×3. RepLKNet greatly closes the performance gap between CNNs and ViTs, e.g., achieving comparable or superior results than Swin Transformer on ImageNet and a few typical downstream tasks, with lower latency. RepLKNet also shows nice scalability to big data and large models, obtaining 87.8% top-1 accuracy on ImageNet and 56.0% mIoU on ADE20K, which is very competitive among the state-of-the-arts with similar model sizes. Our study further reveals that, in contrast to small-kernel CNNs, large-kernel CNNs have much larger effective receptive fields and higher shape bias rather than texture bias. Code & models at https://github.com/megvii-research/RepLKNet.

213 citations


Proceedings ArticleDOI
10 Apr 2022
TL;DR: A simple baseline is proposed that exceeds the SOTA methods and is computationally efficient, and it is revealed that the nonlinear activation functions, e.g. Sigmoid, ReLU, GELU, Softmax, etc are not necessary: they could be replaced by multiplication or removed.
Abstract: Although there have been significant advances in the field of image restoration recently, the system complexity of the state-of-the-art (SOTA) methods is increasing as well, which may hinder the convenient analysis and comparison of methods. In this paper, we propose a simple baseline that exceeds the SOTA methods and is computationally efficient. To further simplify the baseline, we reveal that the nonlinear activation functions, e.g. Sigmoid, ReLU, GELU, Softmax, etc. are not necessary: they could be replaced by multiplication or removed. Thus, we derive a Nonlinear Activation Free Network, namely NAFNet, from the baseline. SOTA results are achieved on various challenging benchmarks, e.g. 33.69 dB PSNR on GoPro (for image deblurring), exceeding the previous SOTA 0.38 dB with only 8.4% of its computational costs; 40.30 dB PSNR on SIDD (for image denoising), exceeding the previous SOTA 0.28 dB with less than half of its computational costs. The code and the pre-trained models are released at https://github.com/megvii-research/NAFNet.

131 citations


Proceedings ArticleDOI
10 Mar 2022
TL;DR: Position embedding transformation (PETR) encodes the position information of 3D coordinates into image features, producing the 3D position-aware features, and achieves state-of-the-art performance on standard nuScenes dataset.
Abstract: In this paper, we develop position embedding transformation (PETR) for multi-view 3D object detection. PETR encodes the position information of 3D coordinates into image features, producing the 3D position-aware features. Object query can perceive the 3D position-aware features and perform end-to-end object detection. PETR achieves state-of-the-art performance (50.4% NDS and 44.1% mAP) on standard nuScenes dataset and ranks 1st place on the benchmark. It can serve as a simple yet strong baseline for future research. Code is available at \url{https://github.com/megvii-research/PETR}.

131 citations


Journal ArticleDOI
TL;DR: The proposed detector, called Anchor DETR, can achieve better performance and run faster than the DETR with 10x fewer training epochs, and an attention variant, which can reduce the memory cost while achieving similar or better performance than the standard attention in DETR.
Abstract: In this paper, we propose a novel query design for the transformer-based object detection. In previous transformer-based detectors, the object queries are a set of learned embeddings. However, each learned embedding does not have an explicit physical meaning and we cannot explain where it will focus on. It is difficult to optimize as the prediction slot of each object query does not have a specific mode. In other words, each object query will not focus on a specific region. To solve these problems, in our query design, object queries are based on anchor points, which are widely used in CNN-based detectors. So each object query focuses on the objects near the anchor point. Moreover, our query design can predict multiple objects at one position to solve the difficulty: ``one region, multiple objects''. In addition, we design an attention variant, which can reduce the memory cost while achieving similar or better performance than the standard attention in DETR. Thanks to the query design and the attention variant, the proposed detector that we called Anchor DETR, can achieve better performance and run faster than the DETR with 10x fewer training epochs. For example, it achieves 44.2 AP with 19 FPS on the MSCOCO dataset when using the ResNet50-DC5 feature for training 50 epochs. Extensive experiments on the MSCOCO benchmark prove the effectiveness of the proposed methods. Code is available at https://github.com/megvii-research/AnchorDETR.

113 citations


Journal ArticleDOI
TL;DR: PETRv2 is proposed, a unified framework for 3D perception from multi-view images based on PETR, which explores the effectiveness of temporal modeling, which utilizes the temporal information of previous frames to boost 3D object detection and BEV segmentation.
Abstract: In this paper, we propose PETRv2, a unified framework for 3D perception from multi-view images. Based on PETR, PETRv2 explores the effectiveness of temporal modeling, which utilizes the temporal information of previous frames to boost 3D object detection. More specifically, we extend the 3D position embedding (3D PE) in PETR for temporal modeling. The 3D PE achieves the temporal alignment on object position of different frames. A feature-guided position encoder is further introduced to improve the data adaptability of 3D PE. To support for multi-task learning (e.g., BEV segmentation and 3D lane detection), PETRv2 provides a simple yet effective solution by introducing task-specific queries, which are initialized under different spaces. PETRv2 achieves state-of-the-art performance on 3D object detection, BEV segmentation and 3D lane detection. Detailed robustness analysis is also conducted on PETR framework. We hope PETRv2 can serve as a strong baseline for 3D perception. Code is available at \url{https://github.com/megvii-research/PETR}.

83 citations


Proceedings ArticleDOI
01 Jun 2022
TL;DR: UVTR presents an early attempt to represent different modalities in a unified framework that surpasses previous work in single- or multi-modality entries and achieves leading performance in the nuScenes test set for both object detection and the following object tracking task.
Abstract: In this work, we present a unified framework for multi-modality 3D object detection, named UVTR. The proposed method aims to unify multi-modality representations in the voxel space for accurate and robust single- or cross-modality 3D detection. To this end, the modality-specific space is first designed to represent different inputs in the voxel feature space. Different from previous work, our approach preserves the voxel space without height compression to alleviate semantic ambiguity and enable spatial connections. To make full use of the inputs from different sensors, the cross-modality interaction is then proposed, including knowledge transfer and modality fusion. In this way, geometry-aware expressions in point clouds and context-rich features in images are well utilized for better performance and robustness. The transformer decoder is applied to efficiently sample features from the unified space with learnable positions, which facilitates object-level interactions. In general, UVTR presents an early attempt to represent different modalities in a unified framework. It surpasses previous work in single- or multi-modality entries. The proposed method achieves leading performance in the nuScenes test set for both object detection and the following object tracking task. Code is made publicly available at https://github.com/dvlab-research/UVTR.

63 citations


Proceedings ArticleDOI
26 Apr 2022
TL;DR: This paper introduces two new modules to enhance the capability of Sparse CNNs, both are based on making feature sparsity learnable with position-wise importance prediction and shows that spatially learnable sparsity in sparse convolution is essential for sophisticated 3D object detection.
Abstract: Non-uniformed 3D sparse data, e.g., point clouds or voxels in different spatial positions, make contribution to the task of 3D object detection in different ways. Existing basic components in sparse convolutional networks (Sparse CNNs) process all sparse data, regardless of regular or submanifold sparse convolution. In this paper, we introduce two new modules to enhance the capability of Sparse CNNs, both are based on making feature sparsity learnable with position-wise importance prediction. They are focal sparse convolution (Focals Conv) and its multi-modal variant of focal sparse convolution with fusion, or Focals Conv-F for short. The new modules can readily substitute their plain counterparts in existing Sparse CNNs and be jointly trained in an end-to-end fashion. For the first time, we show that spatially learnable sparsity in sparse convolution is essential for sophisticated 3D object detection. Extensive experiments on the KITTI, nuScenes and Waymo benchmarks validate the effectiveness of our approach. Without bells and whistles, our results outperform all existing single-model entries on the nuScenes test benchmark. Code and models are at github.com/dvlab-research/FocalsConv.

51 citations


Journal ArticleDOI
TL;DR: The spatial-wise group convolution and its large-kernel module (SW-LK block) avoids the optimization and efficiency issues of naive 3D large kernels and shows that large kernels are feasible and essential for 3D networks for the first time.
Abstract: Recent advances in 2D CNNs and vision transformers (ViTs) reveal that large kernels are essential for enough receptive fields and high performance. Inspired by this literature, we examine the feasibility and challenges of 3D large-kernel designs. We demonstrate that applying large convolutional kernels in 3D CNNs has more difficulties in both performance and efficiency. Existing techniques that work well in 2D CNNs are ineffective in 3D networks, including the popular depth-wise convolutions. To overcome these obstacles, we present the spatial-wise group convolution and its large-kernel module (SW-LK block). It avoids the optimization and efficiency issues of naive 3D large kernels. Our large-kernel 3D CNN network, i.e. , LargeKernel3D, yields non-trivial improvements on various 3D tasks, including semantic segmentation and object detection. Notably, it achieves 73.9% mIoU on the ScanNetv2 semantic segmentation and 72.8% NDS nuScenes object detection benchmarks, ranking 1 st on the nuScenes LIDAR leaderboard. It is further boosted to 74.2% NDS with a simple multi-modal fusion. LargeKernel3D attains comparable or superior results than its CNN and transformer counterparts. For the first time, we show that large kernels are feasible and essential for 3D networks. Code and models will be available at github.com/dvlab-research/LargeKernel3D.

22 citations


Proceedings ArticleDOI
28 Mar 2022
TL;DR: This work studies a new open set problem; the few-shot 6D object poses estimation: estimating the 6D pose of an unknown object by a few support views without extra training and proposes a large-scale RGBD photorealistic dataset (ShapeNet6D) for network pre-training.
Abstract: 6D object pose estimation networks are limited in their capability to scale to large numbers of object instances due to the close-set assumption and their reliance on high-fidelity object CAD models. In this work, we study a new open set problem; the few-shot 6D object poses estimation: estimating the 6D pose of an unknown object by a few support views without extra training. To tackle the problem, we point out the importance of fully exploring the appearance and geometric relationship between the given support views and query scene patches and propose a dense prototypes matching framework by extracting and matching dense RGBD prototypes with transformers. Moreover, we show that the priors from diverse appearances and shapes are crucial to the generalization capability under the problem setting and thus propose a large-scale RGBD photorealistic dataset (ShapeNet6D) for network pre-training. A simple and effective online texture blending approach is also introduced to eliminate the domain gap from the synthesis dataset, which enriches appearance diversity at a low cost. Finally, we discuss possible solutions to this problem and establish benchmarks on popular datasets to facilitate future research. [project page]

20 citations


Journal ArticleDOI
01 Jun 2022
TL;DR: The NTIRE 2022 challenge on burst super-resolution is reviewed, with the top performing methods establishing a new state-of-the-art on the burstsuper-resolution task.
Abstract: Burst super-resolution has received increased attention in recent years due to its applications in mobile photography. By merging information from multiple shifted images of a scene, burst super-resolution aims to recover details which otherwise cannot be obtained using a simple input image. This paper reviews the NTIRE 2022 challenge on burst super-resolution. In the challenge, the participants were tasked with generating a clean RGB image with 4× higher resolution, given a RAW noisy burst as input. That is, the methods need to perform joint denoising, demosaicking, and super-resolution. The challenge consisted of 2 tracks. Track 1 employed synthetic data, where pixel-accurate high-resolution ground truths are available. Track 2 on the other hand used real-world bursts captured from a handheld camera, along with approximately aligned reference images captured using a DSLR. 14 teams participated in the final testing phase. The top performing methods establish a new state-of-the-art on the burst super-resolution task.

17 citations


Proceedings ArticleDOI
15 Mar 2022
TL;DR: A progressive predicting method that first select accepted queries prone to generate true positive predictions, then refine the rest noisy queries according to the previously accepted predictions, and can significantly boost the performance of query-based detectors in crowded scenes.
Abstract: In this paper, we propose a new query-based detection framework for crowd detection. Previous query-based detectors suffer from two drawbacks: first, multiple predictions will be inferred for a single object, typically in crowded scenes; second, the performance saturates as the depth of the decoding stage increases. Benefiting from the nature of the one-to-one label assignment rule, we propose a progressive predicting method to address the above issues. Specifically, we first select accepted queries prone to generate true positive predictions, then refine the rest noisy queries according to the previously accepted predictions. Experiments show that our method can significantly boost the performance of query-based detectors in crowded scenes. Equipped with our approach, Sparse RCNN achieves 92.0% AP, 41.4% MR−2 and 83.2% JI on the challenging CrowdHuman [35] dataset, outperforming the box-based method MIP [8] that specifies in handling crowded scenarios. Moreover, the proposed method, robust to crowdedness, can still obtain consistent improvements on moderately and slightly crowded datasets like CityPersons [47] and COCO [26]. Code will be made publicly available at https://github.com/megvii-model/Iter-E2EDET.

Proceedings ArticleDOI
23 Mar 2022
TL;DR: This paper builds a simple and effective frame-work for streaming perception that achieves competitive performance on Argoverse-HD dataset and improves the AP by 4.9% compared to the strong baseline, validating its effectiveness.
Abstract: Autonomous driving requires the model to perceive the environment and (re)act within a low latency for safety. While past works ignore the inevitable changes in the environment after processing, streaming perception is proposed to jointly evaluate the latency and accuracy into a single metric for video online perception. In this paper, instead of searching trade-offs between accuracy and speed like previous works, we point out that endowing real-time models with the ability to predict the future is the key to dealing with this problem. We build a simple and effective frame-work for streaming perception. It equips a novel Dual-Flow Perception module (DFP), which includes dynamic and static flows to capture the moving trend and basic detection feature for streaming prediction. Further, we introduce a Trend-Aware Loss (TAL) combined with a trend factor to generate adaptive weights for objects with different moving speeds. Our simple method achieves competitive performance on Argoverse-HD dataset and improves the AP by 4.9% compared to the strong baseline, validating its effectiveness. Our code will be made available at https://github.com/yancie-yjr/StreamYOLO.

Proceedings ArticleDOI
21 Mar 2022
TL;DR: A novel tree energy loss for SASS is proposed by providing semantic guidance for unlabeled pixels by sequentially applying these affinities to the net-work prediction, achieving dynamic online self-training.
Abstract: Sparsely annotated semantic segmentation (SASS) aims to train a segmentation network with coarse-grained (i.e., point-, scribble-, and block-wise) supervisions, where only a small proportion of pixels are labeled in each image. In this paper, we propose a novel tree energy loss for SASS by providing semantic guidance for unlabeled pixels. The tree energy loss represents images as minimum spanning trees to model both low-level and high-level pair-wise affini-ties. By sequentially applying these affinities to the net-work prediction, soft pseudo labels for unlabeled pixels are generated in a coarse-to-fine manner, achieving dynamic online self-training. The tree energy loss is effective and easy to be incorporated into existing frameworks by com-bining it with a traditional segmentation loss. Compared with previous SASS methods, our method requires no multi-stage training strategies, alternating optimization proce-dures, additional supervised data, or time-consuming post-processing while outperforming them in all SASS settings. Code is available at https://github.com/megvii-research/TreeEnergyLoss.

Journal ArticleDOI
18 Apr 2022
TL;DR: This work combines optical flows and deformable convolutions, hence the BSRT can handle misalignment and aggregate the potential texture information in multi-frames more efficiently and wins the championship in the NTIRE2022 Burst Super-Resolution Challenge.
Abstract: This work addresses the Burst Super-Resolution (BurstSR) task using a new architecture, which requires restoring a high-quality image from a sequence of noisy, misaligned, and low-resolution RAW bursts. To over-come the challenges in BurstSR, we propose a Burst Super-Resolution Transformer (BSRT), which can significantly improve the capability of extracting inter-frame information and reconstruction. To achieve this goal, we propose a Pyramid Flow-Guided Deformable Convolution Network (Pyramid FG-DCN) and incorporate Swin Trans-former Blocks and Groups as our main backbone. More specifically, we combine optical flows and deformable convolutions, hence our BSRT can handle misalignment and aggregate the potential texture information in multi-frames more efficiently. In addition, our Transformer-based structure can capture long-range dependency to further improve the performance. The evaluation on both synthetic and real-world tracks demonstrates that our approach achieves a new state-of-the-art in BurstSR task. Further, our BSRT wins the championship in the NTIRE2022 Burst Super-Resolution Challenge.

Journal ArticleDOI
TL;DR: This work proposes a label-free method that learns to enforce the geometric consistency between category template mesh and observed object point cloud under a self-supervision manner, and finds that it outperforms the simple traditional baseline by large margins while being competitive with some fully-supervised approaches.
Abstract: . In this work, we tackle the challenging problem of category-level object pose and size estimation from a single depth image. Although previous fully-supervised works have demonstrated promising performance, collecting ground-truth pose labels is generally time-consuming and labor-intensive. Instead, we propose a label-free method that learns to enforce the geometric consistency between category template mesh and observed object point cloud under a self-supervision manner. Specif-ically, our method consists of three key components: differentiable shape deformation, registration, and rendering. In particular, shape deformation and registration are applied to the template mesh to eliminate the differences in shape, pose and scale. A differentiable renderer is then deployed to enforce geometric consistency between point clouds lifted from the rendered depth and the observed scene for self-supervision. We evaluate our approach on real-world datasets and find that our approach outperforms the simple traditional baseline by large margins while being competitive with some fully-supervised approaches. and some fully-supervised algorithms. syn.: synthesis data; R&D: RGBD; D: depth only; Ours-P: point-based; Ours-M: mesh-based; -: results not reported. SPD*: We use the official code 4 of SPD and train on REAL275 for 50 epochs as ours.

Journal ArticleDOI
TL;DR: A novel adversarial domain adaptation approach defined in the spherical feature space, in which spherical classifier for label prediction and spherical domain discriminator for discriminating domain labels is defined and a robust pseudo-label loss is developed to utilize pseudo-labels robustly.
Abstract: Adversarial domain adaptation has been an effective approach for learning domain-invariant features by adversarial training. In this paper, we propose a novel adversarial domain adaptation approach defined in the spherical feature space, in which we define spherical classifier for label prediction and spherical domain discriminator for discriminating domain labels. In the spherical feature space, we develop a robust pseudo-label loss to utilize pseudo-labels robustly, which weights the importance of the estimated labels of target data by the posterior probability of correct labeling, modeled by the Gaussian-uniform mixture model in the spherical space. Our proposed approach can be generally applied to both unsupervised and semi-supervised domain adaptation settings. In particular, to tackle the semi-supervised domain adaptation setting where a few labeled target data are available for training, we proposed a novel reweighted adversarial training strategy for effectively reducing the intra-domain discrepancy within the target domain. We also present theoretical analysis for the proposed method based on the domain adaptation theory. Extensive experiments are conducted on benchmarks for multiple applications, including object recognition, digit recognition, and face recognition. The results show that our method either surpasses or is competitive compared with recent methods for both unsupervised and semi-supervised domain adaptation.

Journal ArticleDOI
TL;DR: This work proposes Scale-aware AutoAug to learn data augmentation policies for object detection, and defines a new scale-aware search space, where both image- and instance-level augmentations are designed for maintaining scale robust feature learning.
Abstract: Data augmentation is a critical technique in object detection, especially the augmentations targeting at scale invariance training (scale-aware augmentation). However, there has been little systematic investigation of how to design scale-aware data augmentation for object detection. We propose Scale-aware AutoAug to learn data augmentation policies for object detection. We define a new scale-aware search space, where both image- and instance-level augmentations are designed for maintaining scale robust feature learning. Upon this search space, we propose a new search metric, termed Pareto Scale Balance, to facilitate efficient augmentation policy search. In experiments, Scale-aware AutoAug yields significant and consistent improvement on various object detectors (e.g., RetinaNet, Faster R-CNN, Mask R-CNN, and FCOS), even compared with strong multi-scale training baselines. Our searched augmentation policies are generalized well to other datasets and instance-level tasks beyond object detection, e.g., instance segmentation. The search cost is much less than previous automated augmentation approaches for object detection, i.e., 8 GPUs across 2.5 days versus. 800 TPU-days. In addition, meaningful patterns can be summarized from our searched policies, which intuitively provide valuable knowledge for hand-crafted data augmentation design. Based on the searched scale-aware augmentation policies, we further introduce a dynamic training paradigm to adaptively determine specific augmentation policy usage during training. The dynamic paradigm consists of an heuristic manner for image-level augmentations and a differentiable copy-paste-based method for instance-level augmentations. The dynamic paradigm achieves further performance improvements to Scale-aware AutoAug without any additional burden on the long tailed LVIS benchmarks. We also demonstrate its ability to prevent over-fitting for large models, e.g., the Swin Transformer large model. Code and models are available at https://github.com/dvlab-research/SA-AutoAug.

Proceedings ArticleDOI
08 Jan 2022
TL;DR: This paper proposes a novel Pairwise Class Balance (PCB) method, built upon a confusion matrix which is updated during training to accumulate the ongoing prediction preferences, and generates fightback soft labels for regularization during training.
Abstract: Long-tailed instance segmentation is a challenging task due to the extreme imbalance of training samples among classes. It causes severe biases of the head classes (with majority samples) against the tailed ones. This renders “how to appropriately define and alleviate the bias” one of the most important issues. Prior works mainly use label distribution or mean score information to indicate a coarse-grained bias. In this paper, we explore to excavate the confusion matrix, which carries the fine-grained misclassification details, to relieve the pairwise biases, generalizing the coarse one. To this end, we propose a novel Pairwise Class Balance (PCB) method, built upon a confusion matrix which is updated during training to accumulate the ongoing prediction preferences. PCB generates fightback soft labels for regularization during training. Besides, an iterative learning paradigm is developed to support a progressive and smooth regularization in such debiasing. PCB can be plugged and played to any existing method as a complement. Experimental results on LVIS demonstrate that our method achieves state-of-the-art performance without bells and whistles. Superior results across various architectures show the generalization ability. The code and trained models are available at https://github.com/megvii-research/PCB.

Journal ArticleDOI
TL;DR: A fast inter-robot loop closure selection method that integrates the consistency and topology relationship of inter- robot measurements, which both conform to the continuity characteristics of similar scenes and spatiotemporal consistency is proposed.
Abstract: This paper presents a robust method based on graph topology to find the topologically correct and consistent subset of inter-robot relative pose measurements for multi-robot map fusion. However, the absence of good prior on relative pose gives a severe challenge to distinguish the inliers and outliers, and once the wrong inter-robot loop closures are used to optimize the pose graph, which can seriously corrupt the fused global map. Existing works mainly rely on the consistency of spatial dimension to select inter-robot measurements, while it does not always hold. In this paper, we propose a fast inter-robot loop closure selection method that integrates the consistency and topology relationship of inter-robot measurements, which both conform to the continuity characteristics of similar scenes and spatiotemporal consistency. Firstly, a clustering method integrating topology correctness of inter-robot loop closures is proposed to split the entire measurement set into multiple clusters. Then, our method decomposes the traditional high-dimensional consistency matrix into the sub-matrix blocks corresponding to the overlapping trajectory regions. Finally, we define the weight function to find the topologically correct and consistent subset with the maximum cardinality, then convert the weight function to the maximum clique problem from graph theory and solve it. We evaluate the performance of our method in a simulation and in a real-world experiment. Compared to state-of-the-art methods, the results show that our method can achieve competitive performance in accuracy while reducing computation time by 75%.


Journal ArticleDOI
TL;DR: This work proposes an unsupervised deep homography method with a new architecture design that outperforms the state-of-the-art including deep solutions and feature-based solutions.
Abstract: Homography estimation is a basic image alignment method in many applications. It is usually done by extracting and matching sparse feature points, which are error-prone in low-light and low-texture images. On the other hand, previous deep homography approaches use either synthetic images for supervised learning or aerial images for unsupervised learning, both ignoring the importance of handling depth disparities and moving objects in real-world applications. To overcome these problems, in this work, we propose an unsupervised deep homography method with a new architecture design. In the spirit of the RANSAC procedure in traditional methods, we specifically learn an outlier mask to only select reliable regions for homography estimation. We calculate loss with respect to our learned deep features instead of directly comparing image content as did previously. To achieve the unsupervised training, we also formulate a novel triplet loss customized for our network. We verify our method by conducting comprehensive comparisons on a new dataset that covers a wide range of scenes with varying degrees of difficulties for the task. Experimental results reveal that our method outperforms the state-of-the-art, including deep solutions and feature-based solutions.

Journal ArticleDOI
TL;DR: It is claimed that supervised contrastive learning suffers a dual class-imbalance problem at both the original batch and Siamese batch levels, which is more serious than long-tailed classification learning.
Abstract: Deep neural networks perform poorly on heavily class-imbalanced datasets. Given the promising performance of contrastive learning, we propose Rebalanced Siamese Contrastive Mining (ResCom) to tackle imbalanced recognition. Based on the mathematical analysis and simulation results, we claim that supervised contrastive learning suffers a dual class-imbalance problem at both the original batch and Siamese batch levels, which is more serious than long-tailed classification learning. In this paper, at the original batch level, we introduce a class-balanced supervised contrastive loss to assign adaptive weights for different classes. At the Siamese batch level, we present a class-balanced queue, which maintains the same number of keys for all classes. Furthermore, we note that the imbalanced contrastive loss gradient with respect to the contrastive logits can be decoupled into the positives and negatives, and easy positives and easy negatives will make the contrastive gradient vanish. We propose supervised hard positive and negative pairs mining to pick up informative pairs for contrastive computation and improve representation learning. Finally, to approximately maximize the mutual information between the two views, we propose Siamese Balanced Softmax and joint it with the contrastive loss for one-stage training. Extensive experiments demonstrate that ResCom outperforms the previous methods by large margins on multiple long-tailed recognition benchmarks. Our code and models are made publicly available at: https://github.com/dvlab-research/ResCom.

Journal ArticleDOI
TL;DR: In situ magnesium isotope compositions of carbonates play an important role in tracing geological and biological processes as discussed by the authors , and matrix effects of carbonate with distinct chemical and physical properties are the main...
Abstract: In situ magnesium isotope compositions of carbonates play an important role in tracing geological and biological processes. Matrix effects of carbonates with distinct chemical and physical properties are the main...


Journal ArticleDOI
TL;DR: Wang et al. as mentioned in this paper proposed a view-based graph convolutional network (GCN) to recognize 3D shape based on graph representation of multiple views, which achieved state-of-the-art results for 3D object retrieval.
Abstract: View-based approach that recognizes 3D shape through its projected 2D images has achieved state-of-the-art results for 3D shape recognition. The major challenges are how to aggregate multi-view features and deal with 3D shapes in arbitrary poses. We propose two versions of a novel view-based Graph Convolutional Network, dubbed view-GCN and view-GCN++, to recognize 3D shape based on graph representation of multiple views. We first construct view-graph with multiple views as graph nodes, then design two graph convolutional networks over the view-graph to hierarchically learn discriminative shape descriptor considering relations of multiple views. Specifically, view-GCN is a hierarchical network based on two pivotal operations, i.e., feature transform based on local positional and non-local graph convolution, and graph coarsening based on a selective view-sampling operation. To deal with rotation sensitivity, we further propose view-GCN++ with local attentional graph convolution operation and rotation robust view-sampling operation for graph coarsening. By these designs, view-GCN++ achieves invariance to transformations under the finite subgroup of rotation group SO(3). Extensive experiments on benchmark datasets (i.e., ModelNet40, ScanObjectNN, RGBD and ShapeNet Core55) show that view-GCN and view-GCN++ achieve state-of-the-art results for 3D shape classification and retrieval tasks under aligned and rotated settings.

Journal ArticleDOI
TL;DR: Zhang et al. as mentioned in this paper proposed a new setup of differentiable architecture search (DARTS) with only training batchNorm, which dilutes the auxiliary-iary connection role of skip-connection in supernet optimization and enable search algorithm focus on fairer operation.
Abstract: Differentiable architecture search (DARTS) has significantly promoted the development of NAS techniques because of its high search efficiency and effectiveness but suffers from performance collapse. In this paper, we make efforts to alleviate the performance collapse problem for DARTS from two aspects. First, we investigate the expressive power of the supernet in DARTS and then derive a new setup of DARTS paradigm with only training BatchNorm. Second, we theoretically find that random features dilute the auxil- iary connection role of skip-connection in supernet optimization and enable search algorithm focus on fairer operation se- lection, thereby solving the performance collapse problem. We instantiate DARTS and PC-DARTS with random features to build an improved version for each named RF-DARTS and RF-PCDARTS respectively. Experimental results show that RF-DARTS obtains 94.36% test accuracy on CIFAR-10 (which is the nearest optimal result in NAS-Bench-201), and achieves the newest state-of-the-art top-1 test error of 24.0% on ImageNet when transferring from CIFAR-10. Moreover, RF-DARTS performs robustly across three datasets (CIFAR-10, CIFAR-100, and SVHN) and four search spaces (S1- S4). Besides, RF-PCDARTS achieves even better results on ImageNet, that is, 23.9% top-1 and 7.1% top-5 test error, surpassing representative methods like single-path, training-free, and partial-channel paradigms directly searched on Im- ageNet.

TL;DR: “effective learning rate” is proposed as a substitute for learning rate to measure update efficiency of normalized neural network using stochastic gradient descent (SGD), defined as effective learning rate 1.Effective learning rate
Abstract: Effective learning rate Since Batch Normalization (Ioffe & Szegedy, 2015) becomes an indispensable module of popular network structures, the scale of the weight norm does not affect the output of unit at all, Euclidean distance defined in weight space completely fails to measure the evolving of DNN during learning process. As a result, learning rate η cannot properly measure update efficiency of normalized DNN. To deal with such issue, van Laarhoven (2017); Hoffer et al. (2018); Zhang et al. (2019b) propose “effective learning rate” as a substitute for learning rate to measure update efficiency of normalized neural network using stochastic gradient descent (SGD), defined as

Journal ArticleDOI
11 Apr 2022
TL;DR: A new NAS method called TNAS (NAS with trees), which improves search efficiency by exploring only a small number of architectures while also achieving a higher search accuracy, is proposed.
Abstract: The key challenge in neural architecture search (NAS) is designing how to explore wisely in the huge search space. We propose a new NAS method called TNAS (NAS with trees), which improves search efficiency by exploring only a small number of architectures while also achieving a higher search accuracy. TNAS introduces an architecture tree and a binary operation tree, to factorize the search space and substantially reduce the exploration size. TNAS performs a modified bi-level Breadth-First Search in the proposed trees to discover a high-performance architecture. Impressively, TNAS finds the global optimal architecture on CIFAR-10 with test accuracy of 94.37% in four GPU hours in NAS-Bench-201. The average test accuracy is 94.35%, which outperforms the state-of-the-art. Code is available at: https://github.com/guochengqian/TNAS.

Journal ArticleDOI
23 Jun 2022-Sensors
TL;DR: A lightweight structure to extract local features by explicitly supplementing the distribution information of the input features to obtain distinctive features for point cloud analysis is designed and achieves on-par performance with previous state-of-the-art (SOTA) methods.
Abstract: Effectively integrating the local features and their spatial distribution information for more effective point cloud analysis is a subject that has been explored for a long time. Inspired by convolutional neural networks (CNNs), this paper studies the relationship between local features and their spatial characteristics and proposes a concise architecture to effectively integrate them instead of designing more sophisticated feature extraction modules. Different positions in the feature map of the 2D image correspond to different weights in the convolution kernel, making the obtained features that are sensitive to local distribution characteristics. Thus, the spatial distribution of the input features of the point cloud within the receptive field is critical for capturing abstract regional aggregated features. We design a lightweight structure to extract local features by explicitly supplementing the distribution information of the input features to obtain distinctive features for point cloud analysis. Compared with the baseline, our model shows improvements in accuracy and convergence speed, and these advantages facilitate the introduction of the snapshot ensemble. Aiming at the shortcomings of the commonly used cosine annealing learning schedule, we design a new annealing schedule that can be flexibly adjusted for the snapshot ensemble technology, which significantly improves the performance by a large margin. Extensive experiments on typical benchmarks verify that, although it adopts the basic shared multi-layer perceptrons (MLPs) as feature extractors, the proposed model with a lightweight structure achieves on-par performance with previous state-of-the-art (SOTA) methods (e.g., MoldeNet40 classification, 0.98 million parameters and 93.5% accuracy; S3DIS segmentation, 1.4 million parameters and 68.7% mIoU).