scispace - formally typeset
Search or ask a question
Author

Qiuyu Mao

Bio: Qiuyu Mao is an academic researcher from University of Science and Technology of China. The author has contributed to research in topics: Object detection & Image plane. The author has an hindex of 1, co-authored 2 publications receiving 2 citations.

Papers
More filters
Posted Content
TL;DR: VPNet as discussed by the authors aligns and aggregates the point cloud and image data at the ''virtual'' points, with their density lying between that of the 3D points and 2D pixels, they can nicely bridge the resolution gap between the two sensors, and thus preserve more information for processing.
Abstract: It has been well recognized that fusing the complementary information from depth-aware LiDAR point clouds and semantic-rich stereo images would benefit 3D object detection. Nevertheless, it is not trivial to explore the inherently unnatural interaction between sparse 3D points and dense 2D pixels. To ease this difficulty, the recent proposals generally project the 3D points onto the 2D image plane to sample the image data and then aggregate the data at the points. However, this approach often suffers from the mismatch between the resolution of point clouds and RGB images, leading to sub-optimal performance. Specifically, taking the sparse points as the multi-modal data aggregation locations causes severe information loss for high-resolution images, which in turn undermines the effectiveness of multi-sensor fusion. In this paper, we present VPFNet -- a new architecture that cleverly aligns and aggregates the point cloud and image data at the `virtual' points. Particularly, with their density lying between that of the 3D points and 2D pixels, the virtual points can nicely bridge the resolution gap between the two sensors, and thus preserve more information for processing. Moreover, we also investigate the data augmentation techniques that can be applied to both point clouds and RGB images, as the data augmentation has made non-negligible contribution towards 3D object detectors to date. We have conducted extensive experiments on KITTI dataset, and have observed good performance compared to the state-of-the-art methods. Remarkably, our VPFNet achieves 83.21\% moderate 3D AP and 91.86\% moderate BEV AP on the KITTI test set, ranking the 1st since May 21th, 2021. The network design also takes computation efficiency into consideration -- we can achieve a FPS of 15 on a single NVIDIA RTX 2080Ti GPU. The code will be made available for reproduction and further investigation.

10 citations

Posted Content
TL;DR: In this article, a survey of multi-sensor fusion-based 3D detection deep learning models that leverage multiple sensor data sources, especially cameras and LiDARs, is presented.
Abstract: In the past few years, we have witnessed rapid development of autonomous driving. However, achieving full autonomy remains a daunting task due to the complex and dynamic driving environment. As a result, self-driving cars are equipped with a suite of sensors to conduct robust and accurate environment perception. As the number and type of sensors keep increasing, combining them for better perception is becoming a natural trend. So far, there has been no indepth review that focuses on multi-sensor fusion based perception. To bridge this gap and motivate future research, this survey devotes to review recent fusion-based 3D detection deep learning models that leverage multiple sensor data sources, especially cameras and LiDARs. In this survey, we first introduce the background of popular sensors for autonomous cars, including their common data representations as well as object detection networks developed for each type of sensor data. Next, we discuss some popular datasets for multi-modal 3D object detection, with a special focus on the sensor data included in each dataset. Then we present in-depth reviews of recent multi-modal 3D detection networks by considering the following three aspects of the fusion: fusion location, fusion data representation, and fusion granularity. After a detailed review, we discuss open challenges and point out possible solutions. We hope that our detailed review can help researchers to embark investigations in the area of multi-modal 3D object detection.

3 citations


Cited by
More filters
Journal ArticleDOI
TL;DR: Zhou et al. as discussed by the authors aligns and aggregates the point cloud and image data at the virtual points, which can nicely bridge the resolution gap between the two sensors, and thus preserve more information for processing.
Abstract: It has been well recognized that fusing the complementary information from depth-aware LiDAR point clouds and semantic-rich stereo images would benefit 3D object detection. Nevertheless, it is non-trivial to explore the inherently unnatural interaction between sparse 3D points and dense 2D pixels. To ease this difficulty, the recent approaches generally project the 3D points onto the 2D image plane to sample the image data and then aggregate the data at the points. However, these approaches often suffer from the mismatch between the resolution of point clouds and RGB images, leading to sub-optimal performance. Specifically, taking the sparse points as the multi-modal data aggregation locations causes severe information loss for high-resolution images, which in turn undermines the effectiveness of multi-sensor fusion. In this paper, we present VPFNet—a new architecture that cleverly aligns and aggregates the point cloud and image data at the “virtual” points. Particularly, with their density lying between that of the 3D points and 2D pixels, the virtual points can nicely bridge the resolution gap between the two sensors, and thus preserve more information for processing. Moreover, we also investigate the data augmentation techniques that can be applied to both point clouds and RGB images, as the data augmentation has made non-negligible contribution towards 3D object detectors to date. We have conducted extensive experiments on KITTI dataset, and have observed good performance compared to the state-of-the-art methods. Remarkably, our VPFNet achieves 83.21% moderate $AP_{3D}$ and 91.86% moderate $AP_{BEV}$ on the KITTI test set. The network design also takes computation efficiency into consideration – we can achieve a FPS of 15 on a single NVIDIA RTX 2080Ti GPU. The source code is available at https://github.com/zhukevkesky/VPFNet.githttps://github.com/zhukevkesky/VPFNet.git.

10 citations

Posted Content
TL;DR: VPNet as discussed by the authors aligns and aggregates the point cloud and image data at the ''virtual'' points, with their density lying between that of the 3D points and 2D pixels, they can nicely bridge the resolution gap between the two sensors, and thus preserve more information for processing.
Abstract: It has been well recognized that fusing the complementary information from depth-aware LiDAR point clouds and semantic-rich stereo images would benefit 3D object detection. Nevertheless, it is not trivial to explore the inherently unnatural interaction between sparse 3D points and dense 2D pixels. To ease this difficulty, the recent proposals generally project the 3D points onto the 2D image plane to sample the image data and then aggregate the data at the points. However, this approach often suffers from the mismatch between the resolution of point clouds and RGB images, leading to sub-optimal performance. Specifically, taking the sparse points as the multi-modal data aggregation locations causes severe information loss for high-resolution images, which in turn undermines the effectiveness of multi-sensor fusion. In this paper, we present VPFNet -- a new architecture that cleverly aligns and aggregates the point cloud and image data at the `virtual' points. Particularly, with their density lying between that of the 3D points and 2D pixels, the virtual points can nicely bridge the resolution gap between the two sensors, and thus preserve more information for processing. Moreover, we also investigate the data augmentation techniques that can be applied to both point clouds and RGB images, as the data augmentation has made non-negligible contribution towards 3D object detectors to date. We have conducted extensive experiments on KITTI dataset, and have observed good performance compared to the state-of-the-art methods. Remarkably, our VPFNet achieves 83.21\% moderate 3D AP and 91.86\% moderate BEV AP on the KITTI test set, ranking the 1st since May 21th, 2021. The network design also takes computation efficiency into consideration -- we can achieve a FPS of 15 on a single NVIDIA RTX 2080Ti GPU. The code will be made available for reproduction and further investigation.

10 citations

Journal ArticleDOI
TL;DR: Wang et al. as mentioned in this paper proposed an effective pipeline that combines SVI and point clouds for the inventory of street furniture, which consists of three steps: off-the-shelf street furniture detection models are applied on SVI for generating two-dimensional (2D) proposals and then 3D point cloud frustums are accordingly cropped.
Abstract: Outdated or sketchy inventory of street furniture may misguide the planners on the renovation and upgrade of transportation infrastructures, thus posing potential threats to traffic safety. Previous studies have taken their steps using point clouds or street-view imagery (SVI) for street furniture inventory, but there remains a gap to balance semantic richness, localization accuracy and working efficiency. Therefore, this paper proposes an effective pipeline that combines SVI and point clouds for the inventory of street furniture. The proposed pipeline encompasses three steps: (1) Off-the-shelf street furniture detection models are applied on SVI for generating two-dimensional (2D) proposals and then three-dimensional (3D) point cloud frustums are accordingly cropped; (2) The instance mask and the instance 3D bounding box are predicted for each frustum using a multi-task neural network; (3) Frustums from adjacent perspectives are associated and fused via multi-object tracking, after which the object-centric instance segmentation outputs the final street furniture with 3D locations and semantic labels. This pipeline was validated on datasets collected in Shanghai and Wuhan, producing component-level street furniture inventory of nine classes. The instance-level mean recall and precision reach 86.4%, 80.9% and 83.2%, 87.8% respectively in Shanghai and Wuhan, and the point-level mean recall, precision, weighted coverage all exceed 73.7%.

4 citations

Journal ArticleDOI
01 Oct 2022-Sensors
TL;DR: This paper presents a two stages solution to 3D vehicle detection and segmentation that outperforms all published solutions in terms of 6 degrees of freedom error (6 DoF err).
Abstract: In this paper, we present a two stages solution to 3D vehicle detection and segmentation. The first stage depends on the combination of EfficientNetB3 architecture with multiparallel residual blocks (inspired by CenterNet architecture) for 3D localization and poses estimation for vehicles on the scene. The second stage takes the output of the first stage as input (cropped car images) to train EfficientNet B3 for the image recognition task. Using predefined 3D Models, we substitute each vehicle on the scene with its match using the rotation matrix and translation vector from the first stage to get the 3D detection bounding boxes and segmentation masks. We trained our models on an open-source dataset (ApolloCar3D). Our method outperforms all published solutions in terms of 6 degrees of freedom error (6 DoF err).

3 citations

Posted Content
TL;DR: TransVG as mentioned in this paper proposes a transformer-based framework for visual grounding to establish the multi-modal correspondence by leveraging transformers, and empirically show that the complex fusion modules can be replaced by a simple stack of transformer encoder layers with higher performance.
Abstract: In this paper, we present a neat yet effective transformer-based framework for visual grounding, namely TransVG, to address the task of grounding a language query to the corresponding region onto an image. The state-of-the-art methods, including two-stage or one-stage ones, rely on a complex module with manually-designed mechanisms to perform the query reasoning and multi-modal fusion. However, the involvement of certain mechanisms in fusion module design, such as query decomposition and image scene graph, makes the models easily overfit to datasets with specific scenarios, and limits the plenitudinous interaction between the visual-linguistic context. To avoid this caveat, we propose to establish the multi-modal correspondence by leveraging transformers, and empirically show that the complex fusion modules (e.g., modular attention network, dynamic graph, and multi-modal tree) can be replaced by a simple stack of transformer encoder layers with higher performance. Moreover, we re-formulate the visual grounding as a direct coordinates regression problem and avoid making predictions out of a set of candidates (i.e., region proposals or anchor boxes). Extensive experiments are conducted on five widely used datasets, and a series of state-of-the-art records are set by our TransVG. We build the benchmark of transformer-based visual grounding framework and will make our code available to the public.

2 citations