Showing papers by "Yulan Guo published in 2021"

PDF

Open Access

Journal Article•DOI•

Deep Learning for 3D Point Clouds: A Survey

[...]

Yulan Guo¹, Hanyun Wang², Qingyong Hu³, Hao Liu¹, Li Liu⁴, Mohammed Bennamoun⁵ - Show less +2 more•Institutions (5)

Sun Yat-sen University¹, PLA Information Engineering University², University of Oxford³, National University of Defense Technology⁴, University of Western Australia⁵

01 Dec 2021-IEEE Transactions on Pattern Analysis and Machine Intelligence

TL;DR: This paper presents a comprehensive review of recent progress in deep learning methods for point clouds, covering three major tasks, including 3D shape classification, 3D object detection and tracking, and 3D point cloud segmentation.

...read moreread less

Abstract: Point cloud learning has lately attracted increasing attention due to its wide applications in many areas, such as computer vision, autonomous driving, and robotics As a dominating technique in AI, deep learning has been successfully used to solve various 2D vision problems However, deep learning on point clouds is still in its infancy due to the unique challenges faced by the processing of point clouds with deep neural networks Recently, deep learning on point clouds has become even thriving, with numerous methods being proposed to address different problems in this area To stimulate future research, this paper presents a comprehensive review of recent progress in deep learning methods for point clouds It covers three major tasks, including 3D shape classification, 3D object detection and tracking, and 3D point cloud segmentation It also presents comparative results on several publicly available datasets, together with insightful observations and inspiring future research directions

...read moreread less

1,021 citations

Proceedings Article•DOI•

Unsupervised Degradation Representation Learning for Blind Super-Resolution

[...]

Longguang Wang¹, Yingqian Wang¹, Xiaoyu Dong², Qingyu Xu¹, Jungang Yang¹, Wei An¹, Yulan Guo¹ - Show less +3 more•Institutions (2)

National University of Defense Technology¹, University of Tokyo²

20 Jun 2021

TL;DR: Wang et al. as discussed by the authors proposed an unsupervised degradation representation learning scheme for blind super-resolution without explicit degradation estimation, which can extract discriminative representations to obtain accurate degradation information.

...read moreread less

Abstract: Most existing CNN-based super-resolution (SR) methods are developed based on an assumption that the degradation is fixed and known (e.g., bicubic downsampling). However, these methods suffer a severe performance drop when the real degradation is different from their assumption. To handle various unknown degradations in real-world applications, previous methods rely on degradation estimation to reconstruct the SR image. Nevertheless, degradation estimation methods are usually time-consuming and may lead to SR failure due to large estimation errors. In this paper, we propose an unsupervised degradation representation learning scheme for blind SR without explicit degradation estimation. Specifically, we learn abstract representations to distinguish various degradations in the representation space rather than explicit estimation in the pixel space. Moreover, we introduce a Degradation-Aware SR (DASR) network with flexible adaption to various degradations based on the learned representations. It is demonstrated that our degradation representation learning scheme can extract discriminative representations to obtain accurate degradation information. Experiments on both synthetic and real images show that our network achieves state-of-the-art performance for the blind SR task. Code is available at: https://github.com/LongguangWang/DASR.

...read moreread less

178 citations

Proceedings Article•DOI•

Exploring Sparsity in Image Super-Resolution for Efficient Inference

[...]

Longguang Wang¹, Xiaoyu Dong², Yingqian Wang¹, Xinyi Ying¹, Zaiping Lin¹, Wei An¹, Yulan Guo¹ - Show less +3 more•Institutions (2)

National University of Defense Technology¹, University of Tokyo²

20 Jun 2021

TL;DR: Wang et al. as mentioned in this paper explored the sparsity in image SR to improve inference efficiency of SR networks and developed a Sparse Mask SR (SMSR) network to learn sparse masks to prune redundant computation.

...read moreread less

Abstract: Current CNN-based super-resolution (SR) methods process all locations equally with computational resources being uniformly assigned in space. However, since missing details in low-resolution (LR) images mainly exist in regions of edges and textures, less computational resources are required for those flat regions. Therefore, existing CNN-based methods involve redundant computation in flat regions, which increases their computational cost and limits their applications on mobile devices. In this paper, we explore the sparsity in image SR to improve inference efficiency of SR networks. Specifically, we develop a Sparse Mask SR (SMSR) network to learn sparse masks to prune redundant computation. Within our SMSR, spatial masks learn to identify "important" regions while channel masks learn to mark redundant channels in those "unimportant" regions. Consequently, redundant computation can be accurately localized and skipped while maintaining comparable performance. It is demonstrated that our SMSR achieves state-of-the-art performance with 41%/33%/27% FLOPs being reduced for ×2/3/4 SR. Code is available at: https://github.com/LongguangWang/SMSR.

...read moreread less

123 citations

Proceedings Article•DOI•

SpinNet: Learning a General Surface Descriptor for 3D Point Cloud Registration

[...]

Sheng Ao¹, Qingyong Hu², Bo Yang³, Andrew Markham², Yulan Guo¹ - Show less +1 more•Institutions (3)

Sun Yat-sen University¹, University of Oxford², Hong Kong Polytechnic University³

20 Jun 2021

TL;DR: Hu et al. as mentioned in this paper proposed SpinNet to extract robust and general 3D local features which are rotationally invariant whilst sufficiently informative to enable accurate point cloud registration and reconstruction by mapping the input local surface into a carefully designed cylindrical space, enabling end-to-end optimization with SO(2) equivariant representation.

...read moreread less

Abstract: Extracting robust and general 3D local features is key to downstream tasks such as point cloud registration and reconstruction. Existing learning-based local descriptors are either sensitive to rotation transformations, or rely on classical handcrafted features which are neither general nor representative. In this paper, we introduce a new, yet conceptually simple, neural architecture, termed SpinNet, to extract local features which are rotationally invariant whilst sufficiently informative to enable accurate registration. A Spatial Point Transformer is first introduced to map the input local surface into a carefully designed cylindrical space, enabling end-to-end optimization with SO(2) equivariant representation. A Neural Feature Extractor which leverages the powerful point-based and 3D cylindrical convolutional neural layers is then utilized to derive a compact and representative descriptor for matching. Extensive experiments on both indoor and outdoor datasets demonstrate that SpinNet outperforms existing state-of-the-art techniques by a large margin. More critically, it has the best generalization ability across unseen scenarios with different sensor modalities. The code is available at https://github.com/QingyongHu/SpinNet.

...read moreread less

75 citations

Journal Article•DOI•

Stereo Matching Using Multi-Level Cost Volume and Multi-Scale Feature Constancy

[...]

Zhengfa Liang, Yulan Guo¹, Yiliu Feng², Wei Chen², Linbo Qiao², Li Zhou², Jianfeng Zhang², Hengzhu Liu² - Show less +4 more•Institutions (2)

Sun Yat-sen University¹, National University of Defense Technology²

01 Jan 2021-IEEE Transactions on Pattern Analysis and Machine Intelligence

TL;DR: This paper presents an end-to-end trainable convolution neural network to fully use cost volumes for stereo matching, and investigates the problem of developing a robust model to perform well across multiple datasets with different characteristics.

...read moreread less

Abstract: For CNNs based stereo matching methods, cost volumes play an important role in achieving good matching accuracy. In this paper, we present an end-to-end trainable convolution neural network to fully use cost volumes for stereo matching. Our network consists of three sub-modules, i.e., shared feature extraction, initial disparity estimation, and disparity refinement. Cost volumes are calculated at multiple levels using the shared features, and are used in both initial disparity estimation and disparity refinement sub-modules. To improve the efficiency of disparity refinement, multi-scale feature constancy is introduced to measure the correctness of the initial disparity in feature space. These sub-modules of our network are tightly-coupled, making it compact and easy to train. Moreover, we investigate the problem of developing a robust model to perform well across multiple datasets with different characteristics. We achieve this by introducing a two-stage finetuning scheme to gently transfer the model to target datasets. Specifically, in the first stage, the model is finetuned using both a large synthetic dataset and the target datasets with a relatively large learning rate, while in the second stage the model is trained using only the target datasets with a small learning rate. The proposed method is tested on several benchmarks including the Middlebury 2014, KITTI 2015, ETH3D 2017, and SceneFlow datasets. Experimental results show that our method achieves the state-of-the-art performance on all the datasets. The proposed method also won the 1st prize on the Stereo task of Robust Vision Challenge 2018.

...read moreread less

74 citations

Journal Article•DOI•

Learning Semantic Segmentation of Large-Scale Point Clouds with Random Sampling.

[...]

Qingyong Hu¹, Bo Yang², Linhai Xie¹, Stefano Rosa¹, Yulan Guo³, Zhihua Wang¹, Niki Trigoni¹, Andrew Markham¹ - Show less +4 more•Institutions (3)

University of Oxford¹, Hong Kong Polytechnic University², National University of Defense Technology³

25 May 2021-IEEE Transactions on Pattern Analysis and Machine Intelligence

TL;DR: In this article, a local feature aggregation module is introduced to progressively increase the receptive field for each 3D point, thereby effectively preserving geometric details for large-scale point cloud semantic segmentation.

...read moreread less

Abstract: We study the problem of efficient semantic segmentation for large-scale 3D point clouds. By relying on expensive sampling techniques or computationally heavy pre/post-processing steps, most existing approaches are only able to be trained and operate over small-scale point clouds. In this paper, we introduce RandLA-Net, an efficient and lightweight neural architecture to directly infer per-point semantics for large-scale point clouds. The key to our approach is to use random point sampling instead of more complex point selection approaches. Although remarkably computation and memory efficient, random sampling can discard key features by chance. To overcome this, we introduce a novel local feature aggregation module to progressively increase the receptive field for each 3D point, thereby effectively preserving geometric details. Comparative experiments show that our RandLA-Net can process 1 million points in a single pass with up to 200X faster than existing approaches. Moreover, extensive experiments on several large-scale point cloud datasets, including Semantic3D, SemanticKITTI, Toronto3D, S3DIS and NPM3D, demonstrate the state-of-the-art semantic segmentation performance of our RandLA-Net.

...read moreread less

70 citations

Journal Article•DOI•

Light Field Image Super-Resolution Using Deformable Convolution

[...]

Yingqian Wang¹, Jungang Yang¹, Longguang Wang¹, Xinyi Ying¹, Tianhao Wu¹, Wei An¹, Yulan Guo¹ - Show less +3 more•Institutions (1)

National University of Defense Technology¹

01 Jan 2021-IEEE Transactions on Image Processing

TL;DR: A deformable convolution network to handle the disparity problem for LF image SR and designs an angular deformable alignment module (ADAM) for feature-level alignment based on ADAM, which can be well incorporated and encoded into features of each view, which benefits the SR reconstruction of all LF images.

...read moreread less

Abstract: Light field (LF) cameras can record scenes from multiple perspectives, and thus introduce beneficial angular information for image super-resolution (SR). However, it is challenging to incorporate angular information due to disparities among LF images. In this paper, we propose a deformable convolution network (i.e., LF-DFnet) to handle the disparity problem for LF image SR. Specifically, we design an angular deformable alignment module (ADAM) for feature-level alignment. Based on ADAM, we further propose a collect-and-distribute approach to perform bidirectional alignment between the center-view feature and each side-view feature. Using our approach, angular information can be well incorporated and encoded into features of each view, which benefits the SR reconstruction of all LF images. Moreover, we develop a baseline-adjustable LF dataset to evaluate SR performance under different disparity variations. Experiments on both public and our self-developed datasets have demonstrated the superiority of our method. Our LF-DFnet can generate high-resolution images with more faithful details and achieve state-of-the-art reconstruction accuracy. Besides, our LF-DFnet is more robust to disparity variations, which has not been well addressed in literature.

...read moreread less

45 citations

Proceedings Article•DOI•

Bilateral Grid Learning for Stereo Matching Networks

[...]

Bin Xu, Yuhua Xu, Xiaoli Yang, Wei Jia¹, Yulan Guo² - Show less +1 more•Institutions (2)

Hefei University of Technology¹, Sun Yat-sen University²

01 Jun 2021

TL;DR: Xu et al. as mentioned in this paper presented a novel edge-preserving cost volume upsampling module based on the slicing operation in the learned bilateral grid, which can be seamlessly embedded into many existing stereo matching networks, such as GCNet, PSMNet, and GANet.

...read moreread less

Abstract: Real-time performance of stereo matching networks is important for many applications, such as automatic driving, robot navigation and augmented reality (AR). Although significant progress has been made in stereo matching networks in recent years, it is still challenging to balance real-time performance and accuracy. In this paper, we present a novel edge-preserving cost volume upsampling module based on the slicing operation in the learned bilateral grid. The slicing layer is parameter-free, which allows us to obtain a high quality cost volume of high resolution from a low-resolution cost volume under the guide of the learned guidance map efficiently. The proposed cost volume upsampling module can be seamlessly embedded into many existing stereo matching networks, such as GCNet, PSMNet, and GANet. The resulting networks are accelerated several times while maintaining comparable accuracy. Furthermore, we design a real-time network (named BGNet) based on this module, which outperforms existing published real-time deep stereo matching networks, as well as some complex networks on the KITTI stereo datasets. The code is available at https://github.com/YuhuaXu/BGNet.

...read moreread less

37 citations

Journal Article•DOI•

Gated Recurrent Multiattention Network for VHR Remote Sensing Image Classification

[...]

Boyang Li, Yulan Guo, Jungang Yang, Longguang Wang, Yingqian Wang, Wei An - Show less +2 more

01 Jan 2021-IEEE Transactions on Geoscience and Remote Sensing

TL;DR: A gated recurrent multiattention neural network (GRMA-Net) that uses multilevel attention modules to focus on informative regions to extract more discriminative features and is fed into a deep-gated recurrent unit (GRU) to capture long-range dependency and contextual relationship.

...read moreread less

Abstract: With the advances of deep learning, many recent CNN-based methods have yielded promising results for image classification. In very high-resolution (VHR) remote sensing images, the contributions of different regions to image classification can vary significantly, because informative areas are generally limited and scattered throughout the whole image. Therefore, how to pay more attention to these informative areas and better incorporate them over long distances are two main challenges to be addressed. In this article, we propose a gated recurrent multiattention neural network (GRMA-Net) to address these problems. Because informative features generally occur at multiple stages in a network (i.e., local texture features at shallow layers and global profile features at deep layers), we use multilevel attention modules to focus on informative regions to extract more discriminative features. Then, these features are arranged as spatial sequences and fed into a deep-gated recurrent unit (GRU) to capture long-range dependency and contextual relationship. We evaluate our method on the UC Merced (UCM), Aerial Image dataset (AID), NWPU-RESISC (NWPU), and Optimal-31 (Optimal) datasets. Experimental results have demonstrated the superior performance of our method as compared to other state-of-the-art methods.

...read moreread less

36 citations

Journal Article•DOI•

Semantic Context Encoding for Accurate 3D Point Cloud Segmentation

[...]

Hao Liu¹, Yulan Guo¹, Yanni Ma¹, Yinjie Lei², Gongjian Wen³ - Show less +1 more•Institutions (3)

Sun Yat-sen University¹, Sichuan University², National University of Defense Technology³

01 Jan 2021-IEEE Transactions on Multimedia

TL;DR: This paper proposes a simple yet effective Point Context Encoding module to capture semantic contexts of a point cloud and adaptively highlight intermediate feature maps, and introduces a Semantic Context Encoded loss (SCE-loss) to supervise the network to learn rich semantic context features.

...read moreread less

Abstract: Semantic context plays a significant role in image segmentation. However, few prior works have explored semantic contexts for 3D point cloud segmentation. In this paper, we propose a simple yet effective Point Context Encoding (PointCE) module to capture semantic contexts of a point cloud and adaptively highlight intermediate feature maps. We also introduce a Semantic Context Encoding loss (SCE-loss) to supervise the network to learn rich semantic context features. To avoid hyperparameter tuning and achieve better convergence performance, we further propose a geometric mean loss to integrate both SCE-loss and segmentation loss. Our PointCE module is general and lightweight, and can be integrated into any point cloud segmentation architecture to improve its segmentation performance with only marginal extra overheads. Experimental results on the ScanNet, S3DIS and Semantic3D datasets show that consistent and significant improvement can be achieved for several different networks by integrating our PointCE module.

...read moreread less

35 citations

Proceedings Article•

Learning a Single Network for Scale-Arbitrary Super-Resolution

[...]

Longguang Wang¹, Yingqian Wang¹, Zaiping Lin¹, Jungang Yang¹, Wei An¹, Yulan Guo¹ - Show less +2 more•Institutions (1)

National University of Defense Technology¹

01 Jan 2021

TL;DR: Wang et al. as discussed by the authors proposed a scale-aware knowledge transfer paradigm to transfer knowledge from scale-specific networks to the scale-arbitrary network, which can achieve promising results for non-integer and asymmetric image super-resolution.

...read moreread less

Abstract: Recently, the performance of single image super-resolution (SR) has been significantly improved with powerful networks. However, these networks are developed for image SR with a single specific integer scale (e.g., x2;x3,x4), and cannot be used for non-integer and asymmetric SR. In this paper, we propose to learn a scale-arbitrary image SR network from scale-specific networks. Specifically, we propose a plug-in module for existing SR networks to perform scale-arbitrary SR, which consists of multiple scale-aware feature adaption blocks and a scale-aware upsampling layer. Moreover, we introduce a scale-aware knowledge transfer paradigm to transfer knowledge from scale-specific networks to the scale-arbitrary network. Our plug-in module can be easily adapted to existing networks to achieve scale-arbitrary SR. These networks plugged with our module can achieve promising results for non-integer and asymmetric SR while maintaining state-of-the-art performance for SR with integer scale factors. Besides, the additional computational and memory cost of our module is very small.

...read moreread less

Posted Content•

Dense Nested Attention Network for Infrared Small Target Detection.

[...]

Boyang Li, Chao Xiao, Longguang Wang, Yingqian Wang, Zaiping Lin, Miao Li, Wei An, Yulan Guo¹ - Show less +4 more•Institutions (1)

National University of Defense Technology¹

01 Jun 2021-arXiv: Computer Vision and Pattern Recognition

TL;DR: Zhang et al. as mentioned in this paper proposed a dense nested interactive module (DNIM) to achieve progressive interaction among high-level and low-level features, and further proposed a cascaded channel and spatial attention module (CSAM) to adaptively enhance multilevel features.

...read moreread less

Abstract: Single-frame infrared small target (SIRST) detection aims at separating small targets from clutter backgrounds. With the advances of deep learning, CNN-based methods have yielded promising results in generic object detection due to their powerful modeling capability. However, existing CNN-based methods cannot be directly applied for infrared small targets since pooling layers in their networks could lead to the loss of targets in deep layers. To handle this problem, we propose a dense nested attention network (DNANet) in this paper. Specifically, we design a dense nested interactive module (DNIM) to achieve progressive interaction among high-level and low-level features. With the repeated interaction in DNIM, infrared small targets in deep layers can be maintained. Based on DNIM, we further propose a cascaded channel and spatial attention module (CSAM) to adaptively enhance multi-level features. With our DNANet, contextual information of small targets can be well incorporated and fully exploited by repeated fusion and enhancement. Moreover, we develop an infrared small target dataset (namely, NUDT-SIRST) and propose a set of evaluation metrics to conduct comprehensive performance evaluation. Experiments on both public and our self-developed datasets demonstrate the effectiveness of our method. Compared to other state-of-the-art methods, our method achieves better performance in terms of probability of detection (Pd), false-alarm rate (Fa), and intersection of union (IoU).

...read moreread less

Posted Content•DOI•

Detecting and Tracking Small and Dense Moving Objects in Satellite Videos: A Benchmark

[...]

Qian Yin¹, Qingyong Hu², Hao Liu³, Feng Zhang¹, Yingqian Wang¹, Zaiping Lin¹, Wei An¹, Yulan Guo¹ - Show less +4 more•Institutions (3)

National University of Defense Technology¹, University of Oxford², Sun Yat-sen University³

25 Nov 2021-arXiv: Computer Vision and Pattern Recognition

TL;DR: Hu et al. as mentioned in this paper built a large-scale satellite-video dataset with rich annotations for the task of moving object detection and tracking, and established the first public benchmark for moving object detector and tracking in satellite videos, and extensively evaluated the performance of several representative approaches on the dataset.

...read moreread less

Abstract: Satellite video cameras can provide continuous observation for a large-scale area, which is important for many remote sensing applications. However, achieving moving object detection and tracking in satellite videos remains challenging due to the insufficient appearance information of objects and lack of high-quality datasets. In this paper, we first build a large-scale satellite video dataset with rich annotations for the task of moving object detection and tracking. This dataset is collected by the Jilin-1 satellite constellation and composed of 47 high-quality videos with 1,646,038 instances of interest for object detection and 3,711 trajectories for object tracking. We then introduce a motion modeling baseline to improve the detection rate and reduce false alarms based on accumulative multi-frame differencing and robust matrix completion. Finally, we establish the first public benchmark for moving object detection and tracking in satellite videos, and extensively evaluate the performance of several representative approaches on our dataset. Comprehensive experimental analyses and insightful conclusions are also provided. The dataset is available at https://github.com/QingyongHu/VISO.

...read moreread less

Proceedings Article•DOI•

Symmetric Parallax Attention for Stereo Image Super-Resolution

[...]

Yingqian Wang¹, Xinyi Ying¹, Longguang Wang¹, Jungang Yang¹, Wei An¹, Yulan Guo¹ - Show less +2 more•Institutions (1)

National University of Defense Technology¹

01 Jun 2021

TL;DR: Wang et al. as discussed by the authors proposed a symmetric bi-directional parallax attention module (biPAM) and an inline occlusion handling scheme to effectively interact cross-view information.

...read moreread less

Abstract: Although recent years have witnessed the great advances in stereo image super-resolution (SR), the beneficial information provided by binocular systems has not been fully used. Since stereo images are highly symmetric under epipolar constraint, in this paper, we improve the performance of stereo image SR by exploiting symmetry cues in stereo image pairs. Specifically, we propose a symmetric bi-directional parallax attention module (biPAM) and an inline occlusion handling scheme to effectively interact cross-view information. Then, we design a Siamese network equipped with a biPAM to super-resolve both sides of views in a highly symmetric manner. Finally, we design several illuminance-robust losses to enhance stereo consistency. Experiments on four public datasets demonstrate the superior performance of our method. Source code is available at https://github.com/YingqianWang/iPASSR.

...read moreread less

Posted Content•

Deep Learning for Scene Classification: A Survey.

[...]

Delu Zeng, Minyu Liao, Mohammad Tavakolian, Yulan Guo, Bolei Zhou, Dewen Hu, Matti Pietikäinen, Li Liu - Show less +4 more

26 Jan 2021-arXiv: Computer Vision and Pattern Recognition

TL;DR: A comprehensive survey of recent achievements in scene classification using deep learning covering different aspects of scene classification, including challenges, benchmark datasets, taxonomy, and quantitative performance comparisons of the reviewed methods is provided.

...read moreread less

Abstract: Scene classification, aiming at classifying a scene image to one of the predefined scene categories by comprehending the entire image, is a longstanding, fundamental and challenging problem in computer vision. The rise of large-scale datasets, which constitute the corresponding dense sampling of diverse real-world scenes, and the renaissance of deep learning techniques, which learn powerful feature representations directly from big raw data, have been bringing remarkable progress in the field of scene representation and classification. To help researchers master needed advances in this field, the goal of this paper is to provide a comprehensive survey of recent achievements in scene classification using deep learning. More than 200 major publications are included in this survey covering different aspects of scene classification, including challenges, benchmark datasets, taxonomy, and quantitative performance comparisons of the reviewed methods. In retrospect of what has been achieved so far, this paper is also concluded with a list of promising research opportunities.

...read moreread less

Journal Article•DOI•

Adv-Depth: Self-Supervised Monocular Depth Estimation With an Adversarial Loss

[...]

Kunhong Li¹, Zhiheng Fu², Hanyun Wang³, Zonghao Chen⁴, Yulan Guo¹ - Show less +1 more•Institutions (4)

Sun Yat-sen University¹, University of Western Australia², PLA Information Engineering University³, Alibaba Group⁴

11 Mar 2021-IEEE Signal Processing Letters

TL;DR: Zhang et al. as mentioned in this paper leverage global distribution differences by introducing an adversarial loss into the training stage of self-supervised depth estimation, which can be back-propagated to the depth estimation module to improve its performance.

...read moreread less

Abstract: Loss function plays a key role in self-supervised monocular depth estimation methods. Current reprojection loss functions are hand-designed and mainly focus on local patch similarity but overlook the global distribution differences between a synthetic image and a target image. In this paper, we leverage global distribution differences by introducing an adversarial loss into the training stage of self-supervised depth estimation. Specifically, we formulate this task as a novel view synthesis problem. We use a depth estimation module and a pose estimation module to form a generator, and then design a discriminator to learn the global distribution differences between real and synthetic images. With the learned global distribution differences, the adversarial loss can be back-propagated to the depth estimation module to improve its performance. Experiments on the KITTI dataset have demonstrated the effectiveness of the adversarial loss. The adversarial loss is further combined with the reprojection loss to achieve the state-of-the-art performance on the KITTI dataset.

...read moreread less

Journal Article•DOI•

Pano-SfMLearner: Self-Supervised Multi-Task Learning of Depth and Semantics in Panoramic Videos

[...]

Mengyi Liu¹, Shuhui Wang², Yulan Guo³, Yuan He¹, Hui Xue¹ - Show less +1 more•Institutions (3)

Alibaba Group¹, Chinese Academy of Sciences², Sun Yat-sen University³

16 Apr 2021-IEEE Signal Processing Letters

TL;DR: Li et al. as discussed by the authors proposed a self-supervised framework for multi-task learning on depth, camera motion and semantics from panoramic videos, which is based on differentiable warping of adjacent views to the target.

...read moreread less

Abstract: With the advent of virtual reality and augment reality applications, omnidirectional imaging and $360^{\circ }$ cameras become increasingly popular in many scenarios such as entertainment and autonomous systems. In this paper, we propose a self-supervised framework for multi-task learning on depth, camera motion and semantics from panoramic videos. Specifically, our method is based on differentiable warping of adjacent views to the target. Two improvements are provided. First, we introduce a view synthesis module based on equirectangular projection to enable direct optimization on panoramic images. Second, we introduce a self-supervised segmentation branch to involve the constraint of semantic consistency for further improvement. Extensive experiments on two $360^{\circ }$ video and two $360^{\circ }$ image datasets demonstrate that our method outperforms the state-of-the-art and achieves favorable cross-modality performance.

...read moreread less

Journal Article•DOI•

Semantic segmentation of 3D indoor LiDAR point clouds through feature pyramid architecture search

[...]

Haojia Lin¹, Shangbin Wu¹, Yiping Chen¹, Wen Li¹, Zhipeng Luo¹, Yulan Guo², Cheng Wang¹, Jonathan Li¹, Jonathan Li³ - Show less +5 more•Institutions (3)

Xiamen University¹, National University of Defense Technology², University of Waterloo³

01 Jul 2021-Isprs Journal of Photogrammetry and Remote Sensing

TL;DR: A Neural Architecture Search method to search a Feature Pyramid Network (FPN) module for 3D indoor point cloud semantic segmentation is proposed, which is generic and effective as well as capable to be added to existing segmentation networks to augment the segmentation performance.

...read moreread less

Abstract: Semantic segmentation of 3D Light Detection and Ranging (LiDAR) indoor point clouds using deep learning has been an active topic in recent years. However, most deep neural networks on point clouds conduct multi-level feature fusion via a simple U-shape architecture, which lacks enough capacity on both classification and localization in the segmentation task. In this paper, we propose a Neural Architecture Search (NAS) method to search a Feature Pyramid Network (FPN) module for 3D indoor point cloud semantic segmentation. Specifically, we aim to automatically find an effective feature pyramid architecture as a feature fusion neck in a designed novel pyramidal search space covering all information communication paths for multi-level features. The searched FPN module, named SFPN, contains the most important connections among all the potential paths to fuse representations at different levels. Our proposed SFPN is generic and effective as well as capable to be added to existing segmentation networks to augment the segmentation performance. Extensive experiments on ScanNet and S3DIS show that consistent and remarkable gains of segmentation performance can be achieved by different classical networks combined with SFPN. Specially, PointNet++-SFPN achieves mIoU gains of 7.8% on ScanNet v2 and 4.7% on S3DIS, and PointConv-SFPN achieves 4.5% and 3.7% improvement respectively on the above datasets.

...read moreread less

Journal Article•DOI•

Moving Object Detection in Satellite Videos via Spatial-Temporal Tensor Model and Weighted Schatten p-norm Minimization

[...]

Qian Yin, Ting Liu, Zaiping Lin, Wei An, Yulan Guo - Show less +1 more

01 Jan 2021-IEEE Geoscience and Remote Sensing Letters

TL;DR: This letter proposes a novel object detection method based on a spatial–temporal tensor data structure to exploit the inner spatial and temporal correlation within a satellite video and extends the decomposition formulation with bounded noise to achieve robust performance under complex backgrounds.

...read moreread less

Abstract: Low-rank matrix decomposition approaches have achieved significant progress in small and dim object detection in satellite videos. However, it is still challenging to achieve robust performance and fast processing under complex and highly heterogeneous backgrounds since satellite video data can neither adequately fit the foreground structure nor the background model in the existing matrix decomposition models. In this letter, we propose a novel object detection method based on a spatial–temporal tensor data structure. First, we construct a tensor data structure to exploit the inner spatial and temporal correlation within a satellite video. Second, we extend the decomposition formulation with bounded noise to achieve robust performance under complex backgrounds. This formulation integrates low-rank background, structured sparse foreground, and their noises into a tensor decomposition problem. For background separation, a weighted Schatten $p$ -norm is incorporated to provide adaptive threshold to obtain the singular value of the background tensor. Finally, the proposed model is solved using the alternative direction method of multipliers (ADMM) scheme. Experimental results on various real scenes demonstrate the superiority of the proposed method against the compared approaches.

...read moreread less

Proceedings Article•

Sparse-to-Dense Feature Matching: Intra and Inter Domain Cross-Modal Learning in Domain Adaptation for 3D Semantic Segmentation

[...]

Duo Peng¹, Yinjie Lei¹, Wen Li², Pingping Zhang³, Yulan Guo⁴ - Show less +1 more•Institutions (4)

Sichuan University¹, Xiamen University², Dalian University of Technology³, National University of Defense Technology⁴

01 Jan 2021

TL;DR: Li et al. as mentioned in this paper proposed Dynamic sparse-to-dense cross modal learning (DsCML) to increase the sufficiency of multi-modality information interaction for domain adaptation.

...read moreread less

Abstract: Domain adaptation is critical for success when confronting with the lack of annotations in a new domain. As the huge time consumption of labeling process on 3D point cloud, domain adaptation for 3D semantic segmentation is of great expectation. With the rise of multi-modal datasets, large amount of 2D images are accessible besides 3D point clouds. In light of this, we propose to further leverage 2D data for 3D domain adaptation by intra and inter domain cross modal learning. As for intra-domain cross modal learning, most existing works sample the dense 2D pixel-wise features into the same size with sparse 3D point-wise features, resulting in the abandon of numerous useful 2D features. To address this problem, we propose Dynamic sparse-to-dense Cross Modal Learning (DsCML) to increase the sufficiency of multi-modality information interaction for domain adaptation. For inter-domain cross modal learning, we further advance Cross Modal Adversarial Learning (CMAL) on 2D and 3D data which contains different semantic content aiming to promote high-level modal complementarity. We evaluate our model under various multi-modality domain adaptation settings including day-to-night, country-to-country and dataset-to-dataset, brings large improvements over both uni-modal and multi-modal domain adaptation methods on all settings.

...read moreread less

Proceedings Article•DOI•

Fast and Accurate Lane Detection via Frequency Domain Learning

[...]

Yulin He¹, Wei Chen¹, Zhengfa Liang, Dan Chen², Yusong Tan³, Xin Luo¹, Chen Li¹, Yulan Guo¹ - Show less +4 more•Institutions (3)

National University of Defense Technology¹, Wuhan University², University of Defence³

17 Oct 2021

TL;DR: MSLD as discussed by the authors proposes a multi-spectral feature compressor based on two-dimensional (2D) discrete cosine transform (DCT) to compress features while preserving diversity information, which outperforms the state-of-the-art methods (including LaneATT and UFLD).

...read moreread less

Abstract: It is desirable to maintain both high accuracy and runtime efficiency in lane detection. State-of-the-art methods mainly address the efficiency problem by direct compression of high-dimensional features. These methods usually suffer from information loss and cannot achieve satisfactory accuracy performance. To ensure the diversity of features and subsequently maintain information as much as possible, we introduce multi-frequency analysis into lane detection. Specifically, we propose a multi-spectral feature compressor (MSFC) based on two-dimensional (2D) discrete cosine transform (DCT) to compress features while preserving diversity information. We group features and associate each group with an individual frequency component, which incurs only 1/7 overhead of one-dimensional convolution operation but preserves more information. Moreover, to further enhance the discriminability of features, we design a multi-spectral lane feature aggregator (MSFA) based on one-dimensional (1D) DCT to aggregate features from each lane according to their corresponding frequency components. The proposed method outperforms the state-of-the-art methods (including LaneATT and UFLD) on TuSimple, CULane, and LLAMAS benchmarks. For example, our method achieves 76.32% F1 at 237 FPS and 76.98% F1 at 164 FPS on CULane, which is 1.23% and 0.30% higher than LaneATT. Our code and models are available at https://github.com/harrylin-hyl/MSLD.

...read moreread less

Journal Article•DOI•

Indoor 3D Human Trajectory Reconstruction Using Surveillance Camera Videos and Point Clouds

[...]

Yudi Dai, Chenglu Wen, Hai Wu, Yulan Guo, Longbiao Chen, Cheng Wang - Show less +2 more

01 Jan 2021-IEEE Transactions on Circuits and Systems for Video Technology

TL;DR: A novel framework for 3D human trajectory reconstruction in an indoor scene using monocular surveillance videos and static point clouds without any initiative cooperation is proposed and achieves accurate trajectory reconstruction results on real-world videos.

...read moreread less

Posted Content•DOI•

Automatic detection of pituitary microadenoma from magnetic resonance imaging using deep learning algorithms

[...]

Qianlin Li¹, Yanhua Zhu¹, Minqiong Chen¹, Guo R¹, Qingyong Hu², Zhu Deng¹, Deng S¹, Hui-Quan Wen¹, Gao R¹, Yuanpeng Nie¹, Haofeng Li¹, Tao Zhang³, Jun Chen¹, Guojun Shi¹, Jun Shen¹, Cheung Ww⁴, Yulan Guo¹, Y. Chen¹ - Show less +14 more•Institutions (4)

Sun Yat-sen University¹, University of Oxford², Harbin Medical University³, University of California, Berkeley⁴

05 Mar 2021-medRxiv

TL;DR: In this article, a computer-aided Pituitary microadenoma (PM) diagnosis (PM-CAD) system based on deep learning was employed to assist radiologists in clinical workflow.

...read moreread less

Abstract: Pituitary microadenoma (PM) is often difficult to detect by MR imaging alone. We employed a computer-aided PM diagnosis (PM-CAD) system based on deep learning to assist radiologists in clinical workflow. We enrolled 1,228 participants and stratified into 3 non-overlapping cohorts for training, validation and testing purposes. Our PM-CAD system outperformed 6 existing established convolutional neural network models for detection of PM. In test dataset, diagnostic accuracy of PM-CAD system was comparable to radiologists with > 10 years of professional expertise (94% versus 95%). The diagnostic accuracy in internal and external dataset was 94% and 90%, respectively. Importantly, PM-CAD system detected the presence of PM that had been previously misdiagnosed by radiologists. This is the first report showing that PM-CAD system is a viable tool for detecting PM. Our results suggest that PM-CAD system is applicable to radiology departments, especially in primary health care institutions.

...read moreread less

Journal Article•DOI•

SiFi: Self-Updating of Indoor Semantic Floorplans for Annotated Objects

[...]

Deke Guo¹, Xiaoqiang Teng¹, Yulan Guo¹, Xiaolei Zhou¹, Zhong Liu¹ - Show less +1 more•Institutions (1)

National University of Defense Technology¹

08 Jul 2021

TL;DR: SiFi as discussed by the authors is a self-updating system for indoor semantic floorplans, which uses a crowdsourced-based task model to attract users to contribute semantic-rich videos and uses the maximum likelihood estimation method to solve the text inference problem.

...read moreread less

Abstract: Due to the rapid development of indoor location-based services, automatically deriving an indoor semantic floorplan becomes a highly promising technique for ubiquitous applications. To make an indoor semantic floorplan fully practical, it is essential to handle the dynamics of semantic information. Despite several methods proposed for automatic construction and semantic labeling of indoor floorplans, this problem has not been well studied and remains open. In this article, we present a system called SiFi to provide accurate and automatic self-updating service. It updates semantics with instant videos acquired by mobile devices in indoor scenes. First, a crowdsourced-based task model is designed to attract users to contribute semantic-rich videos. Second, we use the maximum likelihood estimation method to solve the text inferring problem as the sequential relationship of texts provides additional geometrical constraints. Finally, we formulate the semantic update as an inference problem to accurately label semantics at correct locations on the indoor floorplans. Extensive experiments have been conducted across 9 weeks in a shopping mall with more than 250 stores. Experimental results show that SiFi achieves 84.5% accuracy of semantic update.

...read moreread less

Journal Article•DOI•

Distortion-Aware Monocular Depth Estimation for Omnidirectional Images

[...]

Hong-Xiang Chen¹, Kunhong Li¹, Zhiheng Fu², Mengyi Liu³, Zonghao Chen³, Yulan Guo¹ - Show less +2 more•Institutions (3)

Sun Yat-sen University¹, University of Western Australia², Alibaba Group³

11 Jan 2021-IEEE Signal Processing Letters

TL;DR: In this article, a distortion-aware monocular omnidirectional (DAMO) network is proposed to estimate dense depth maps from indoor panoramas, which exploits deformable convolution to adjust its sampling grids to geometric distortions.

...read moreread less

Abstract: Image distortion is a main challenge for tasks on panoramas. In this work, we propose a Distortion-Aware Monocular Omnidirectional (DAMO) network to estimate dense depth maps from indoor panoramas. First, we introduce a distortion-aware module to extract semantic features from omnidirectional images. Specifically, we exploit deformable convolution to adjust its sampling grids to geometric distortions on panoramas. We also utilize a strip pooling module to sample against horizontal distortion introduced by inverse gnomonic projection. Second, we introduce a plug-and-play spherical-aware weight matrix for our loss function to handle the uneven distribution of areas projected from a sphere. Experiments on the 360D dataset show that the proposed method can effectively extract semantic features from distorted panoramas and alleviate the supervision bias caused by distortion. It achieves the state-of-the-art performance on the 360D dataset with high efficiency.

...read moreread less

Posted Content•

SQN: Weakly-Supervised Semantic Segmentation of Large-Scale 3D Point Clouds with 1000x Fewer Labels.

[...]

Qingyong Hu, Bo Yang, Guangchi Fang, Yulan Guo, Ales Leonardis, Niki Trigoni, Andrew Markham - Show less +3 more

11 Apr 2021-arXiv: Computer Vision and Pattern Recognition

TL;DR: In this article, a weak supervision method is proposed to augment the total amount of available supervision signals by leveraging the semantic similarity between neighboring points, which achieves state-of-the-art performance on six large-scale open datasets under weak supervision schemes.

...read moreread less

Abstract: We study the problem of labelling effort for semantic segmentation of large-scale 3D point clouds. Existing works usually rely on densely annotated point-level semantic labels to provide supervision for network training. However, in real-world scenarios that contain billions of points, it is impractical and extremely costly to manually annotate every single point. In this paper, we first investigate whether dense 3D labels are truly required for learning meaningful semantic representations. Interestingly, we find that the segmentation performance of existing works only drops slightly given as few as 1% of the annotations. However, beyond this point (e.g. 1 per thousand and below) existing techniques fail catastrophically. To this end, we propose a new weak supervision method to implicitly augment the total amount of available supervision signals, by leveraging the semantic similarity between neighboring points. Extensive experiments demonstrate that the proposed Semantic Query Network (SQN) achieves state-of-the-art performance on six large-scale open datasets under weak supervision schemes, while requiring only 1000x fewer labeled points for training. The code is available at this https URL.

...read moreread less

Posted Content•

Unsupervised Degradation Representation Learning for Blind Super-Resolution

[...]

Longguang Wang¹, Yingqian Wang¹, Xiaoyu Dong², Qingyu Xu¹, Jungang Yang¹, Wei An¹, Yulan Guo¹ - Show less +3 more•Institutions (2)

National University of Defense Technology¹, University of Tokyo²

01 Apr 2021-arXiv: Computer Vision and Pattern Recognition

TL;DR: Li et al. as mentioned in this paper proposed an unsupervised degradation representation learning scheme for blind super-resolution without explicit degradation estimation, which learns abstract representations to distinguish various degradations in the representation space rather than explicit estimation in the pixel space.

...read moreread less

Journal Article•DOI•

Selective Light Field Refocusing for Camera Arrays Using Bokeh Rendering and Superresolution

[...]

Yingqian Wang¹, Jungang Yang¹, Yulan Guo¹, Chao Xiao¹, Wei An¹ - Show less +1 more•Institutions (1)

National University of Defense Technology¹

09 Aug 2021-arXiv: Computer Vision and Pattern Recognition

TL;DR: In this paper, a light field refocusing method was proposed to improve the imaging quality of camera arrays by estimating the disparity of the disparity and then rendering the focused region (bokeh) by using a depth-based anisotropic filter.

...read moreread less

Abstract: Camera arrays provide spatial and angular information within a single snapshot. With refocusing methods, focal planes can be altered after exposure. In this letter, we propose a light field refocusing method to improve the imaging quality of camera arrays. In our method, the disparity is first estimated. Then, the unfocused region (bokeh) is rendered by using a depth-based anisotropic filter. Finally, the refocused image is produced by a reconstruction-based superresolution approach where the bokeh image is used as a regularization term. Our method can selectively refocus images with focused region being superresolved and bokeh being aesthetically rendered. Our method also enables postadjustment of depth of field. We conduct experiments on both public and self-developed datasets. Our method achieves superior visual performance with acceptable computational cost as compared to other state-of-the-art methods. Code is available at this https URL.

...read moreread less

Posted Content•

Spatial-Temporal Transformer for 3D Point Cloud Sequences

[...]

Yimin Wei, Hao Liu, Tingting Xie, Qiuhong Ke, Yulan Guo - Show less +1 more

19 Oct 2021-arXiv: Computer Vision and Pattern Recognition

TL;DR: Point Spatial-Temporal Transformer (PST2) as mentioned in this paper is proposed to learn spatial-temporal representations from dynamic 3D point cloud sequences, which consists of two major modules: a Spatio-Temoral Self-Attention (STSA) module and a Resolution Embedding (RE) module.

...read moreread less

Abstract: Effective learning of spatial-temporal information within a point cloud sequence is highly important for many down-stream tasks such as 4D semantic segmentation and 3D action recognition. In this paper, we propose a novel framework named Point Spatial-Temporal Transformer (PST2) to learn spatial-temporal representations from dynamic 3D point cloud sequences. Our PST2 consists of two major modules: a Spatio-Temporal Self-Attention (STSA) module and a Resolution Embedding (RE) module. Our STSA module is introduced to capture the spatial-temporal context information across adjacent frames, while the RE module is proposed to aggregate features across neighbors to enhance the resolution of feature maps. We test the effectiveness our PST2 with two different tasks on point cloud sequences, i.e., 4D semantic segmentation and 3D action recognition. Extensive experiments on three benchmarks show that our PST2 outperforms existing methods on all datasets. The effectiveness of our STSA and RE modules have also been justified with ablation experiments.

...read moreread less

Proceedings Article•DOI•

Cgan-Net: Class-Guided Asymmetric Non-Local Network for Real-Time Semantic Segmentation

[...]

Hanlin Chen¹, Qingyong Hu², Jungang Yang¹, Jing Wu¹, Yulan Guo¹ - Show less +1 more•Institutions (2)

National University of Defense Technology¹, University of Oxford²

06 Jun 2021

TL;DR: CGAN-Net as mentioned in this paper proposes to calculate the dense similarity matrix in coarse semantic prediction maps, instead of the high-dimensional latent feature map, which is not only computationally and memory efficient but helps to learn query-dependent global context.

...read moreread less

Abstract: By introducing various non-local blocks to capture the long-range dependencies, remarkable progress has been achieved in semantic segmentation recently. However, the improvement in segmentation accuracy usually comes at the price of significant reductions in network efficiency, as non-local block usually requires expensive computation and memory cost for dense pixel-to-pixel correlation. In this paper, we introduce a Class-Guided Asymmetric Non-local Network (CGAN-Net) to enhance the class-discriminability in learned feature map, while maintaining real-time efficiency. The key to our approach is to calculate the dense similarity matrix in coarse semantic prediction maps, instead of the high-dimensional latent feature map. This is not only computationally and memory efficient, but helps to learn query-dependent global context. Experiments conducted on Cityscape and CamVid demonstrate the compelling performance of our CGAN-Net. In particular, our network achieves 76.8% mean IoU on the Cityscapes test set with a speed of 38 FPS for 1024×2048 images on a single Tesla V100 GPU.

...read moreread less