scispace - formally typeset
Search or ask a question

Showing papers by "Yulan Guo published in 2021"


Journal ArticleDOI
TL;DR: This paper presents a comprehensive review of recent progress in deep learning methods for point clouds, covering three major tasks, including 3D shape classification, 3D object detection and tracking, and 3D point cloud segmentation.
Abstract: Point cloud learning has lately attracted increasing attention due to its wide applications in many areas, such as computer vision, autonomous driving, and robotics As a dominating technique in AI, deep learning has been successfully used to solve various 2D vision problems However, deep learning on point clouds is still in its infancy due to the unique challenges faced by the processing of point clouds with deep neural networks Recently, deep learning on point clouds has become even thriving, with numerous methods being proposed to address different problems in this area To stimulate future research, this paper presents a comprehensive review of recent progress in deep learning methods for point clouds It covers three major tasks, including 3D shape classification, 3D object detection and tracking, and 3D point cloud segmentation It also presents comparative results on several publicly available datasets, together with insightful observations and inspiring future research directions

1,021 citations


Proceedings ArticleDOI
20 Jun 2021
TL;DR: Wang et al. as discussed by the authors proposed an unsupervised degradation representation learning scheme for blind super-resolution without explicit degradation estimation, which can extract discriminative representations to obtain accurate degradation information.
Abstract: Most existing CNN-based super-resolution (SR) methods are developed based on an assumption that the degradation is fixed and known (e.g., bicubic downsampling). However, these methods suffer a severe performance drop when the real degradation is different from their assumption. To handle various unknown degradations in real-world applications, previous methods rely on degradation estimation to reconstruct the SR image. Nevertheless, degradation estimation methods are usually time-consuming and may lead to SR failure due to large estimation errors. In this paper, we propose an unsupervised degradation representation learning scheme for blind SR without explicit degradation estimation. Specifically, we learn abstract representations to distinguish various degradations in the representation space rather than explicit estimation in the pixel space. Moreover, we introduce a Degradation-Aware SR (DASR) network with flexible adaption to various degradations based on the learned representations. It is demonstrated that our degradation representation learning scheme can extract discriminative representations to obtain accurate degradation information. Experiments on both synthetic and real images show that our network achieves state-of-the-art performance for the blind SR task. Code is available at: https://github.com/LongguangWang/DASR.

178 citations


Proceedings ArticleDOI
20 Jun 2021
TL;DR: Wang et al. as mentioned in this paper explored the sparsity in image SR to improve inference efficiency of SR networks and developed a Sparse Mask SR (SMSR) network to learn sparse masks to prune redundant computation.
Abstract: Current CNN-based super-resolution (SR) methods process all locations equally with computational resources being uniformly assigned in space. However, since missing details in low-resolution (LR) images mainly exist in regions of edges and textures, less computational resources are required for those flat regions. Therefore, existing CNN-based methods involve redundant computation in flat regions, which increases their computational cost and limits their applications on mobile devices. In this paper, we explore the sparsity in image SR to improve inference efficiency of SR networks. Specifically, we develop a Sparse Mask SR (SMSR) network to learn sparse masks to prune redundant computation. Within our SMSR, spatial masks learn to identify "important" regions while channel masks learn to mark redundant channels in those "unimportant" regions. Consequently, redundant computation can be accurately localized and skipped while maintaining comparable performance. It is demonstrated that our SMSR achieves state-of-the-art performance with 41%/33%/27% FLOPs being reduced for ×2/3/4 SR. Code is available at: https://github.com/LongguangWang/SMSR.

123 citations


Proceedings ArticleDOI
20 Jun 2021
TL;DR: Hu et al. as mentioned in this paper proposed SpinNet to extract robust and general 3D local features which are rotationally invariant whilst sufficiently informative to enable accurate point cloud registration and reconstruction by mapping the input local surface into a carefully designed cylindrical space, enabling end-to-end optimization with SO(2) equivariant representation.
Abstract: Extracting robust and general 3D local features is key to downstream tasks such as point cloud registration and reconstruction. Existing learning-based local descriptors are either sensitive to rotation transformations, or rely on classical handcrafted features which are neither general nor representative. In this paper, we introduce a new, yet conceptually simple, neural architecture, termed SpinNet, to extract local features which are rotationally invariant whilst sufficiently informative to enable accurate registration. A Spatial Point Transformer is first introduced to map the input local surface into a carefully designed cylindrical space, enabling end-to-end optimization with SO(2) equivariant representation. A Neural Feature Extractor which leverages the powerful point-based and 3D cylindrical convolutional neural layers is then utilized to derive a compact and representative descriptor for matching. Extensive experiments on both indoor and outdoor datasets demonstrate that SpinNet outperforms existing state-of-the-art techniques by a large margin. More critically, it has the best generalization ability across unseen scenarios with different sensor modalities. The code is available at https://github.com/QingyongHu/SpinNet.

75 citations


Journal ArticleDOI
TL;DR: This paper presents an end-to-end trainable convolution neural network to fully use cost volumes for stereo matching, and investigates the problem of developing a robust model to perform well across multiple datasets with different characteristics.
Abstract: For CNNs based stereo matching methods, cost volumes play an important role in achieving good matching accuracy. In this paper, we present an end-to-end trainable convolution neural network to fully use cost volumes for stereo matching. Our network consists of three sub-modules, i.e., shared feature extraction, initial disparity estimation, and disparity refinement. Cost volumes are calculated at multiple levels using the shared features, and are used in both initial disparity estimation and disparity refinement sub-modules. To improve the efficiency of disparity refinement, multi-scale feature constancy is introduced to measure the correctness of the initial disparity in feature space. These sub-modules of our network are tightly-coupled, making it compact and easy to train. Moreover, we investigate the problem of developing a robust model to perform well across multiple datasets with different characteristics. We achieve this by introducing a two-stage finetuning scheme to gently transfer the model to target datasets. Specifically, in the first stage, the model is finetuned using both a large synthetic dataset and the target datasets with a relatively large learning rate, while in the second stage the model is trained using only the target datasets with a small learning rate. The proposed method is tested on several benchmarks including the Middlebury 2014, KITTI 2015, ETH3D 2017, and SceneFlow datasets. Experimental results show that our method achieves the state-of-the-art performance on all the datasets. The proposed method also won the 1st prize on the Stereo task of Robust Vision Challenge 2018.

74 citations


Journal ArticleDOI
TL;DR: In this article, a local feature aggregation module is introduced to progressively increase the receptive field for each 3D point, thereby effectively preserving geometric details for large-scale point cloud semantic segmentation.
Abstract: We study the problem of efficient semantic segmentation for large-scale 3D point clouds. By relying on expensive sampling techniques or computationally heavy pre/post-processing steps, most existing approaches are only able to be trained and operate over small-scale point clouds. In this paper, we introduce RandLA-Net, an efficient and lightweight neural architecture to directly infer per-point semantics for large-scale point clouds. The key to our approach is to use random point sampling instead of more complex point selection approaches. Although remarkably computation and memory efficient, random sampling can discard key features by chance. To overcome this, we introduce a novel local feature aggregation module to progressively increase the receptive field for each 3D point, thereby effectively preserving geometric details. Comparative experiments show that our RandLA-Net can process 1 million points in a single pass with up to 200X faster than existing approaches. Moreover, extensive experiments on several large-scale point cloud datasets, including Semantic3D, SemanticKITTI, Toronto3D, S3DIS and NPM3D, demonstrate the state-of-the-art semantic segmentation performance of our RandLA-Net.

70 citations


Journal ArticleDOI
TL;DR: A deformable convolution network to handle the disparity problem for LF image SR and designs an angular deformable alignment module (ADAM) for feature-level alignment based on ADAM, which can be well incorporated and encoded into features of each view, which benefits the SR reconstruction of all LF images.
Abstract: Light field (LF) cameras can record scenes from multiple perspectives, and thus introduce beneficial angular information for image super-resolution (SR). However, it is challenging to incorporate angular information due to disparities among LF images. In this paper, we propose a deformable convolution network (i.e., LF-DFnet) to handle the disparity problem for LF image SR. Specifically, we design an angular deformable alignment module (ADAM) for feature-level alignment. Based on ADAM, we further propose a collect-and-distribute approach to perform bidirectional alignment between the center-view feature and each side-view feature. Using our approach, angular information can be well incorporated and encoded into features of each view, which benefits the SR reconstruction of all LF images. Moreover, we develop a baseline-adjustable LF dataset to evaluate SR performance under different disparity variations. Experiments on both public and our self-developed datasets have demonstrated the superiority of our method. Our LF-DFnet can generate high-resolution images with more faithful details and achieve state-of-the-art reconstruction accuracy. Besides, our LF-DFnet is more robust to disparity variations, which has not been well addressed in literature.

45 citations


Proceedings ArticleDOI
01 Jun 2021
TL;DR: Xu et al. as mentioned in this paper presented a novel edge-preserving cost volume upsampling module based on the slicing operation in the learned bilateral grid, which can be seamlessly embedded into many existing stereo matching networks, such as GCNet, PSMNet, and GANet.
Abstract: Real-time performance of stereo matching networks is important for many applications, such as automatic driving, robot navigation and augmented reality (AR). Although significant progress has been made in stereo matching networks in recent years, it is still challenging to balance real-time performance and accuracy. In this paper, we present a novel edge-preserving cost volume upsampling module based on the slicing operation in the learned bilateral grid. The slicing layer is parameter-free, which allows us to obtain a high quality cost volume of high resolution from a low-resolution cost volume under the guide of the learned guidance map efficiently. The proposed cost volume upsampling module can be seamlessly embedded into many existing stereo matching networks, such as GCNet, PSMNet, and GANet. The resulting networks are accelerated several times while maintaining comparable accuracy. Furthermore, we design a real-time network (named BGNet) based on this module, which outperforms existing published real-time deep stereo matching networks, as well as some complex networks on the KITTI stereo datasets. The code is available at https://github.com/YuhuaXu/BGNet.

37 citations


Journal ArticleDOI
TL;DR: A gated recurrent multiattention neural network (GRMA-Net) that uses multilevel attention modules to focus on informative regions to extract more discriminative features and is fed into a deep-gated recurrent unit (GRU) to capture long-range dependency and contextual relationship.
Abstract: With the advances of deep learning, many recent CNN-based methods have yielded promising results for image classification. In very high-resolution (VHR) remote sensing images, the contributions of different regions to image classification can vary significantly, because informative areas are generally limited and scattered throughout the whole image. Therefore, how to pay more attention to these informative areas and better incorporate them over long distances are two main challenges to be addressed. In this article, we propose a gated recurrent multiattention neural network (GRMA-Net) to address these problems. Because informative features generally occur at multiple stages in a network (i.e., local texture features at shallow layers and global profile features at deep layers), we use multilevel attention modules to focus on informative regions to extract more discriminative features. Then, these features are arranged as spatial sequences and fed into a deep-gated recurrent unit (GRU) to capture long-range dependency and contextual relationship. We evaluate our method on the UC Merced (UCM), Aerial Image dataset (AID), NWPU-RESISC (NWPU), and Optimal-31 (Optimal) datasets. Experimental results have demonstrated the superior performance of our method as compared to other state-of-the-art methods.

36 citations


Journal ArticleDOI
TL;DR: This paper proposes a simple yet effective Point Context Encoding module to capture semantic contexts of a point cloud and adaptively highlight intermediate feature maps, and introduces a Semantic Context Encoded loss (SCE-loss) to supervise the network to learn rich semantic context features.
Abstract: Semantic context plays a significant role in image segmentation. However, few prior works have explored semantic contexts for 3D point cloud segmentation. In this paper, we propose a simple yet effective Point Context Encoding (PointCE) module to capture semantic contexts of a point cloud and adaptively highlight intermediate feature maps. We also introduce a Semantic Context Encoding loss (SCE-loss) to supervise the network to learn rich semantic context features. To avoid hyperparameter tuning and achieve better convergence performance, we further propose a geometric mean loss to integrate both SCE-loss and segmentation loss. Our PointCE module is general and lightweight, and can be integrated into any point cloud segmentation architecture to improve its segmentation performance with only marginal extra overheads. Experimental results on the ScanNet, S3DIS and Semantic3D datasets show that consistent and significant improvement can be achieved for several different networks by integrating our PointCE module.

35 citations


Proceedings Article
01 Jan 2021
TL;DR: Wang et al. as discussed by the authors proposed a scale-aware knowledge transfer paradigm to transfer knowledge from scale-specific networks to the scale-arbitrary network, which can achieve promising results for non-integer and asymmetric image super-resolution.
Abstract: Recently, the performance of single image super-resolution (SR) has been significantly improved with powerful networks. However, these networks are developed for image SR with a single specific integer scale (e.g., x2;x3,x4), and cannot be used for non-integer and asymmetric SR. In this paper, we propose to learn a scale-arbitrary image SR network from scale-specific networks. Specifically, we propose a plug-in module for existing SR networks to perform scale-arbitrary SR, which consists of multiple scale-aware feature adaption blocks and a scale-aware upsampling layer. Moreover, we introduce a scale-aware knowledge transfer paradigm to transfer knowledge from scale-specific networks to the scale-arbitrary network. Our plug-in module can be easily adapted to existing networks to achieve scale-arbitrary SR. These networks plugged with our module can achieve promising results for non-integer and asymmetric SR while maintaining state-of-the-art performance for SR with integer scale factors. Besides, the additional computational and memory cost of our module is very small.

Posted Content
TL;DR: Zhang et al. as mentioned in this paper proposed a dense nested interactive module (DNIM) to achieve progressive interaction among high-level and low-level features, and further proposed a cascaded channel and spatial attention module (CSAM) to adaptively enhance multilevel features.
Abstract: Single-frame infrared small target (SIRST) detection aims at separating small targets from clutter backgrounds. With the advances of deep learning, CNN-based methods have yielded promising results in generic object detection due to their powerful modeling capability. However, existing CNN-based methods cannot be directly applied for infrared small targets since pooling layers in their networks could lead to the loss of targets in deep layers. To handle this problem, we propose a dense nested attention network (DNANet) in this paper. Specifically, we design a dense nested interactive module (DNIM) to achieve progressive interaction among high-level and low-level features. With the repeated interaction in DNIM, infrared small targets in deep layers can be maintained. Based on DNIM, we further propose a cascaded channel and spatial attention module (CSAM) to adaptively enhance multi-level features. With our DNANet, contextual information of small targets can be well incorporated and fully exploited by repeated fusion and enhancement. Moreover, we develop an infrared small target dataset (namely, NUDT-SIRST) and propose a set of evaluation metrics to conduct comprehensive performance evaluation. Experiments on both public and our self-developed datasets demonstrate the effectiveness of our method. Compared to other state-of-the-art methods, our method achieves better performance in terms of probability of detection (Pd), false-alarm rate (Fa), and intersection of union (IoU).

Posted ContentDOI
TL;DR: Hu et al. as mentioned in this paper built a large-scale satellite-video dataset with rich annotations for the task of moving object detection and tracking, and established the first public benchmark for moving object detector and tracking in satellite videos, and extensively evaluated the performance of several representative approaches on the dataset.
Abstract: Satellite video cameras can provide continuous observation for a large-scale area, which is important for many remote sensing applications. However, achieving moving object detection and tracking in satellite videos remains challenging due to the insufficient appearance information of objects and lack of high-quality datasets. In this paper, we first build a large-scale satellite video dataset with rich annotations for the task of moving object detection and tracking. This dataset is collected by the Jilin-1 satellite constellation and composed of 47 high-quality videos with 1,646,038 instances of interest for object detection and 3,711 trajectories for object tracking. We then introduce a motion modeling baseline to improve the detection rate and reduce false alarms based on accumulative multi-frame differencing and robust matrix completion. Finally, we establish the first public benchmark for moving object detection and tracking in satellite videos, and extensively evaluate the performance of several representative approaches on our dataset. Comprehensive experimental analyses and insightful conclusions are also provided. The dataset is available at https://github.com/QingyongHu/VISO.

Proceedings ArticleDOI
01 Jun 2021
TL;DR: Wang et al. as discussed by the authors proposed a symmetric bi-directional parallax attention module (biPAM) and an inline occlusion handling scheme to effectively interact cross-view information.
Abstract: Although recent years have witnessed the great advances in stereo image super-resolution (SR), the beneficial information provided by binocular systems has not been fully used. Since stereo images are highly symmetric under epipolar constraint, in this paper, we improve the performance of stereo image SR by exploiting symmetry cues in stereo image pairs. Specifically, we propose a symmetric bi-directional parallax attention module (biPAM) and an inline occlusion handling scheme to effectively interact cross-view information. Then, we design a Siamese network equipped with a biPAM to super-resolve both sides of views in a highly symmetric manner. Finally, we design several illuminance-robust losses to enhance stereo consistency. Experiments on four public datasets demonstrate the superior performance of our method. Source code is available at https://github.com/YingqianWang/iPASSR.

Posted Content
TL;DR: A comprehensive survey of recent achievements in scene classification using deep learning covering different aspects of scene classification, including challenges, benchmark datasets, taxonomy, and quantitative performance comparisons of the reviewed methods is provided.
Abstract: Scene classification, aiming at classifying a scene image to one of the predefined scene categories by comprehending the entire image, is a longstanding, fundamental and challenging problem in computer vision. The rise of large-scale datasets, which constitute the corresponding dense sampling of diverse real-world scenes, and the renaissance of deep learning techniques, which learn powerful feature representations directly from big raw data, have been bringing remarkable progress in the field of scene representation and classification. To help researchers master needed advances in this field, the goal of this paper is to provide a comprehensive survey of recent achievements in scene classification using deep learning. More than 200 major publications are included in this survey covering different aspects of scene classification, including challenges, benchmark datasets, taxonomy, and quantitative performance comparisons of the reviewed methods. In retrospect of what has been achieved so far, this paper is also concluded with a list of promising research opportunities.

Journal ArticleDOI
TL;DR: Zhang et al. as mentioned in this paper leverage global distribution differences by introducing an adversarial loss into the training stage of self-supervised depth estimation, which can be back-propagated to the depth estimation module to improve its performance.
Abstract: Loss function plays a key role in self-supervised monocular depth estimation methods. Current reprojection loss functions are hand-designed and mainly focus on local patch similarity but overlook the global distribution differences between a synthetic image and a target image. In this paper, we leverage global distribution differences by introducing an adversarial loss into the training stage of self-supervised depth estimation. Specifically, we formulate this task as a novel view synthesis problem. We use a depth estimation module and a pose estimation module to form a generator, and then design a discriminator to learn the global distribution differences between real and synthetic images. With the learned global distribution differences, the adversarial loss can be back-propagated to the depth estimation module to improve its performance. Experiments on the KITTI dataset have demonstrated the effectiveness of the adversarial loss. The adversarial loss is further combined with the reprojection loss to achieve the state-of-the-art performance on the KITTI dataset.

Journal ArticleDOI
TL;DR: Li et al. as discussed by the authors proposed a self-supervised framework for multi-task learning on depth, camera motion and semantics from panoramic videos, which is based on differentiable warping of adjacent views to the target.
Abstract: With the advent of virtual reality and augment reality applications, omnidirectional imaging and $360^{\circ }$ cameras become increasingly popular in many scenarios such as entertainment and autonomous systems. In this paper, we propose a self-supervised framework for multi-task learning on depth, camera motion and semantics from panoramic videos. Specifically, our method is based on differentiable warping of adjacent views to the target. Two improvements are provided. First, we introduce a view synthesis module based on equirectangular projection to enable direct optimization on panoramic images. Second, we introduce a self-supervised segmentation branch to involve the constraint of semantic consistency for further improvement. Extensive experiments on two $360^{\circ }$ video and two $360^{\circ }$ image datasets demonstrate that our method outperforms the state-of-the-art and achieves favorable cross-modality performance.

Journal ArticleDOI
TL;DR: A Neural Architecture Search method to search a Feature Pyramid Network (FPN) module for 3D indoor point cloud semantic segmentation is proposed, which is generic and effective as well as capable to be added to existing segmentation networks to augment the segmentation performance.
Abstract: Semantic segmentation of 3D Light Detection and Ranging (LiDAR) indoor point clouds using deep learning has been an active topic in recent years. However, most deep neural networks on point clouds conduct multi-level feature fusion via a simple U-shape architecture, which lacks enough capacity on both classification and localization in the segmentation task. In this paper, we propose a Neural Architecture Search (NAS) method to search a Feature Pyramid Network (FPN) module for 3D indoor point cloud semantic segmentation. Specifically, we aim to automatically find an effective feature pyramid architecture as a feature fusion neck in a designed novel pyramidal search space covering all information communication paths for multi-level features. The searched FPN module, named SFPN, contains the most important connections among all the potential paths to fuse representations at different levels. Our proposed SFPN is generic and effective as well as capable to be added to existing segmentation networks to augment the segmentation performance. Extensive experiments on ScanNet and S3DIS show that consistent and remarkable gains of segmentation performance can be achieved by different classical networks combined with SFPN. Specially, PointNet++-SFPN achieves mIoU gains of 7.8% on ScanNet v2 and 4.7% on S3DIS, and PointConv-SFPN achieves 4.5% and 3.7% improvement respectively on the above datasets.

Journal ArticleDOI
Qian Yin, Ting Liu, Zaiping Lin, Wei An, Yulan Guo 
TL;DR: This letter proposes a novel object detection method based on a spatial–temporal tensor data structure to exploit the inner spatial and temporal correlation within a satellite video and extends the decomposition formulation with bounded noise to achieve robust performance under complex backgrounds.
Abstract: Low-rank matrix decomposition approaches have achieved significant progress in small and dim object detection in satellite videos. However, it is still challenging to achieve robust performance and fast processing under complex and highly heterogeneous backgrounds since satellite video data can neither adequately fit the foreground structure nor the background model in the existing matrix decomposition models. In this letter, we propose a novel object detection method based on a spatial–temporal tensor data structure. First, we construct a tensor data structure to exploit the inner spatial and temporal correlation within a satellite video. Second, we extend the decomposition formulation with bounded noise to achieve robust performance under complex backgrounds. This formulation integrates low-rank background, structured sparse foreground, and their noises into a tensor decomposition problem. For background separation, a weighted Schatten $p$ -norm is incorporated to provide adaptive threshold to obtain the singular value of the background tensor. Finally, the proposed model is solved using the alternative direction method of multipliers (ADMM) scheme. Experimental results on various real scenes demonstrate the superiority of the proposed method against the compared approaches.

Proceedings Article
01 Jan 2021
TL;DR: Li et al. as mentioned in this paper proposed Dynamic sparse-to-dense cross modal learning (DsCML) to increase the sufficiency of multi-modality information interaction for domain adaptation.
Abstract: Domain adaptation is critical for success when confronting with the lack of annotations in a new domain. As the huge time consumption of labeling process on 3D point cloud, domain adaptation for 3D semantic segmentation is of great expectation. With the rise of multi-modal datasets, large amount of 2D images are accessible besides 3D point clouds. In light of this, we propose to further leverage 2D data for 3D domain adaptation by intra and inter domain cross modal learning. As for intra-domain cross modal learning, most existing works sample the dense 2D pixel-wise features into the same size with sparse 3D point-wise features, resulting in the abandon of numerous useful 2D features. To address this problem, we propose Dynamic sparse-to-dense Cross Modal Learning (DsCML) to increase the sufficiency of multi-modality information interaction for domain adaptation. For inter-domain cross modal learning, we further advance Cross Modal Adversarial Learning (CMAL) on 2D and 3D data which contains different semantic content aiming to promote high-level modal complementarity. We evaluate our model under various multi-modality domain adaptation settings including day-to-night, country-to-country and dataset-to-dataset, brings large improvements over both uni-modal and multi-modal domain adaptation methods on all settings.

Proceedings ArticleDOI
17 Oct 2021
TL;DR: MSLD as discussed by the authors proposes a multi-spectral feature compressor based on two-dimensional (2D) discrete cosine transform (DCT) to compress features while preserving diversity information, which outperforms the state-of-the-art methods (including LaneATT and UFLD).
Abstract: It is desirable to maintain both high accuracy and runtime efficiency in lane detection. State-of-the-art methods mainly address the efficiency problem by direct compression of high-dimensional features. These methods usually suffer from information loss and cannot achieve satisfactory accuracy performance. To ensure the diversity of features and subsequently maintain information as much as possible, we introduce multi-frequency analysis into lane detection. Specifically, we propose a multi-spectral feature compressor (MSFC) based on two-dimensional (2D) discrete cosine transform (DCT) to compress features while preserving diversity information. We group features and associate each group with an individual frequency component, which incurs only 1/7 overhead of one-dimensional convolution operation but preserves more information. Moreover, to further enhance the discriminability of features, we design a multi-spectral lane feature aggregator (MSFA) based on one-dimensional (1D) DCT to aggregate features from each lane according to their corresponding frequency components. The proposed method outperforms the state-of-the-art methods (including LaneATT and UFLD) on TuSimple, CULane, and LLAMAS benchmarks. For example, our method achieves 76.32% F1 at 237 FPS and 76.98% F1 at 164 FPS on CULane, which is 1.23% and 0.30% higher than LaneATT. Our code and models are available at https://github.com/harrylin-hyl/MSLD.

Journal ArticleDOI
TL;DR: A novel framework for 3D human trajectory reconstruction in an indoor scene using monocular surveillance videos and static point clouds without any initiative cooperation is proposed and achieves accurate trajectory reconstruction results on real-world videos.

Posted ContentDOI
05 Mar 2021-medRxiv
TL;DR: In this article, a computer-aided Pituitary microadenoma (PM) diagnosis (PM-CAD) system based on deep learning was employed to assist radiologists in clinical workflow.
Abstract: Pituitary microadenoma (PM) is often difficult to detect by MR imaging alone. We employed a computer-aided PM diagnosis (PM-CAD) system based on deep learning to assist radiologists in clinical workflow. We enrolled 1,228 participants and stratified into 3 non-overlapping cohorts for training, validation and testing purposes. Our PM-CAD system outperformed 6 existing established convolutional neural network models for detection of PM. In test dataset, diagnostic accuracy of PM-CAD system was comparable to radiologists with > 10 years of professional expertise (94% versus 95%). The diagnostic accuracy in internal and external dataset was 94% and 90%, respectively. Importantly, PM-CAD system detected the presence of PM that had been previously misdiagnosed by radiologists. This is the first report showing that PM-CAD system is a viable tool for detecting PM. Our results suggest that PM-CAD system is applicable to radiology departments, especially in primary health care institutions.

Journal ArticleDOI
08 Jul 2021
TL;DR: SiFi as discussed by the authors is a self-updating system for indoor semantic floorplans, which uses a crowdsourced-based task model to attract users to contribute semantic-rich videos and uses the maximum likelihood estimation method to solve the text inference problem.
Abstract: Due to the rapid development of indoor location-based services, automatically deriving an indoor semantic floorplan becomes a highly promising technique for ubiquitous applications. To make an indoor semantic floorplan fully practical, it is essential to handle the dynamics of semantic information. Despite several methods proposed for automatic construction and semantic labeling of indoor floorplans, this problem has not been well studied and remains open. In this article, we present a system called SiFi to provide accurate and automatic self-updating service. It updates semantics with instant videos acquired by mobile devices in indoor scenes. First, a crowdsourced-based task model is designed to attract users to contribute semantic-rich videos. Second, we use the maximum likelihood estimation method to solve the text inferring problem as the sequential relationship of texts provides additional geometrical constraints. Finally, we formulate the semantic update as an inference problem to accurately label semantics at correct locations on the indoor floorplans. Extensive experiments have been conducted across 9 weeks in a shopping mall with more than 250 stores. Experimental results show that SiFi achieves 84.5% accuracy of semantic update.

Journal ArticleDOI
TL;DR: In this article, a distortion-aware monocular omnidirectional (DAMO) network is proposed to estimate dense depth maps from indoor panoramas, which exploits deformable convolution to adjust its sampling grids to geometric distortions.
Abstract: Image distortion is a main challenge for tasks on panoramas. In this work, we propose a Distortion-Aware Monocular Omnidirectional (DAMO) network to estimate dense depth maps from indoor panoramas. First, we introduce a distortion-aware module to extract semantic features from omnidirectional images. Specifically, we exploit deformable convolution to adjust its sampling grids to geometric distortions on panoramas. We also utilize a strip pooling module to sample against horizontal distortion introduced by inverse gnomonic projection. Second, we introduce a plug-and-play spherical-aware weight matrix for our loss function to handle the uneven distribution of areas projected from a sphere. Experiments on the 360D dataset show that the proposed method can effectively extract semantic features from distorted panoramas and alleviate the supervision bias caused by distortion. It achieves the state-of-the-art performance on the 360D dataset with high efficiency.

Posted Content
TL;DR: In this article, a weak supervision method is proposed to augment the total amount of available supervision signals by leveraging the semantic similarity between neighboring points, which achieves state-of-the-art performance on six large-scale open datasets under weak supervision schemes.
Abstract: We study the problem of labelling effort for semantic segmentation of large-scale 3D point clouds. Existing works usually rely on densely annotated point-level semantic labels to provide supervision for network training. However, in real-world scenarios that contain billions of points, it is impractical and extremely costly to manually annotate every single point. In this paper, we first investigate whether dense 3D labels are truly required for learning meaningful semantic representations. Interestingly, we find that the segmentation performance of existing works only drops slightly given as few as 1% of the annotations. However, beyond this point (e.g. 1 per thousand and below) existing techniques fail catastrophically. To this end, we propose a new weak supervision method to implicitly augment the total amount of available supervision signals, by leveraging the semantic similarity between neighboring points. Extensive experiments demonstrate that the proposed Semantic Query Network (SQN) achieves state-of-the-art performance on six large-scale open datasets under weak supervision schemes, while requiring only 1000x fewer labeled points for training. The code is available at this https URL.

Posted Content
TL;DR: Li et al. as mentioned in this paper proposed an unsupervised degradation representation learning scheme for blind super-resolution without explicit degradation estimation, which learns abstract representations to distinguish various degradations in the representation space rather than explicit estimation in the pixel space.
Abstract: Most existing CNN-based super-resolution (SR) methods are developed based on an assumption that the degradation is fixed and known (e.g., bicubic downsampling). However, these methods suffer a severe performance drop when the real degradation is different from their assumption. To handle various unknown degradations in real-world applications, previous methods rely on degradation estimation to reconstruct the SR image. Nevertheless, degradation estimation methods are usually time-consuming and may lead to SR failure due to large estimation errors. In this paper, we propose an unsupervised degradation representation learning scheme for blind SR without explicit degradation estimation. Specifically, we learn abstract representations to distinguish various degradations in the representation space rather than explicit estimation in the pixel space. Moreover, we introduce a Degradation-Aware SR (DASR) network with flexible adaption to various degradations based on the learned representations. It is demonstrated that our degradation representation learning scheme can extract discriminative representations to obtain accurate degradation information. Experiments on both synthetic and real images show that our network achieves state-of-the-art performance for the blind SR task. Code is available at: this https URL.

Journal ArticleDOI
TL;DR: In this paper, a light field refocusing method was proposed to improve the imaging quality of camera arrays by estimating the disparity of the disparity and then rendering the focused region (bokeh) by using a depth-based anisotropic filter.
Abstract: Camera arrays provide spatial and angular information within a single snapshot. With refocusing methods, focal planes can be altered after exposure. In this letter, we propose a light field refocusing method to improve the imaging quality of camera arrays. In our method, the disparity is first estimated. Then, the unfocused region (bokeh) is rendered by using a depth-based anisotropic filter. Finally, the refocused image is produced by a reconstruction-based superresolution approach where the bokeh image is used as a regularization term. Our method can selectively refocus images with focused region being superresolved and bokeh being aesthetically rendered. Our method also enables postadjustment of depth of field. We conduct experiments on both public and self-developed datasets. Our method achieves superior visual performance with acceptable computational cost as compared to other state-of-the-art methods. Code is available at this https URL.

Posted Content
TL;DR: Point Spatial-Temporal Transformer (PST2) as mentioned in this paper is proposed to learn spatial-temporal representations from dynamic 3D point cloud sequences, which consists of two major modules: a Spatio-Temoral Self-Attention (STSA) module and a Resolution Embedding (RE) module.
Abstract: Effective learning of spatial-temporal information within a point cloud sequence is highly important for many down-stream tasks such as 4D semantic segmentation and 3D action recognition. In this paper, we propose a novel framework named Point Spatial-Temporal Transformer (PST2) to learn spatial-temporal representations from dynamic 3D point cloud sequences. Our PST2 consists of two major modules: a Spatio-Temporal Self-Attention (STSA) module and a Resolution Embedding (RE) module. Our STSA module is introduced to capture the spatial-temporal context information across adjacent frames, while the RE module is proposed to aggregate features across neighbors to enhance the resolution of feature maps. We test the effectiveness our PST2 with two different tasks on point cloud sequences, i.e., 4D semantic segmentation and 3D action recognition. Extensive experiments on three benchmarks show that our PST2 outperforms existing methods on all datasets. The effectiveness of our STSA and RE modules have also been justified with ablation experiments.

Proceedings ArticleDOI
06 Jun 2021
TL;DR: CGAN-Net as mentioned in this paper proposes to calculate the dense similarity matrix in coarse semantic prediction maps, instead of the high-dimensional latent feature map, which is not only computationally and memory efficient but helps to learn query-dependent global context.
Abstract: By introducing various non-local blocks to capture the long-range dependencies, remarkable progress has been achieved in semantic segmentation recently. However, the improvement in segmentation accuracy usually comes at the price of significant reductions in network efficiency, as non-local block usually requires expensive computation and memory cost for dense pixel-to-pixel correlation. In this paper, we introduce a Class-Guided Asymmetric Non-local Network (CGAN-Net) to enhance the class-discriminability in learned feature map, while maintaining real-time efficiency. The key to our approach is to calculate the dense similarity matrix in coarse semantic prediction maps, instead of the high-dimensional latent feature map. This is not only computationally and memory efficient, but helps to learn query-dependent global context. Experiments conducted on Cityscape and CamVid demonstrate the compelling performance of our CGAN-Net. In particular, our network achieves 76.8% mean IoU on the Cityscapes test set with a speed of 38 FPS for 1024×2048 images on a single Tesla V100 GPU.