Home
/
Authors
/
Qiuyu Mao

Author

Qiuyu Mao

University of Science and Technology of China

Bio: Qiuyu Mao is an academic researcher from University of Science and Technology of China. The author has contributed to research in topics: Object detection & Image plane. The author has an hindex of 1, co-authored 2 publications receiving 2 citations.

Topics: Object detection, Image plane, Point cloud, Artificial intelligence, Pixel ...read more

Papers

PDF

Open Access

More filters

Posted Content•

VPFNet: Improving 3D Object Detection with Virtual Point based LiDAR and Stereo Data Fusion.

[...]

Hanqi Zhu¹, Jiajun Deng¹, Yu Zhang¹, Jianmin Ji¹, Qiuyu Mao¹, Houqiang Li¹, Yanyong Zhang¹ - Show less +3 more•Institutions (1)

University of Science and Technology of China¹

29 Nov 2021-arXiv: Computer Vision and Pattern Recognition

TL;DR: VPNet as discussed by the authors aligns and aggregates the point cloud and image data at the ''virtual'' points, with their density lying between that of the 3D points and 2D pixels, they can nicely bridge the resolution gap between the two sensors, and thus preserve more information for processing.

...read moreread less

Abstract: It has been well recognized that fusing the complementary information from depth-aware LiDAR point clouds and semantic-rich stereo images would benefit 3D object detection. Nevertheless, it is not trivial to explore the inherently unnatural interaction between sparse 3D points and dense 2D pixels. To ease this difficulty, the recent proposals generally project the 3D points onto the 2D image plane to sample the image data and then aggregate the data at the points. However, this approach often suffers from the mismatch between the resolution of point clouds and RGB images, leading to sub-optimal performance. Specifically, taking the sparse points as the multi-modal data aggregation locations causes severe information loss for high-resolution images, which in turn undermines the effectiveness of multi-sensor fusion. In this paper, we present VPFNet -- a new architecture that cleverly aligns and aggregates the point cloud and image data at the `virtual' points. Particularly, with their density lying between that of the 3D points and 2D pixels, the virtual points can nicely bridge the resolution gap between the two sensors, and thus preserve more information for processing. Moreover, we also investigate the data augmentation techniques that can be applied to both point clouds and RGB images, as the data augmentation has made non-negligible contribution towards 3D object detectors to date. We have conducted extensive experiments on KITTI dataset, and have observed good performance compared to the state-of-the-art methods. Remarkably, our VPFNet achieves 83.21\% moderate 3D AP and 91.86\% moderate BEV AP on the KITTI test set, ranking the 1st since May 21th, 2021. The network design also takes computation efficiency into consideration -- we can achieve a FPS of 15 on a single NVIDIA RTX 2080Ti GPU. The code will be made available for reproduction and further investigation.

...read moreread less

10 citations

Posted Content•

Multi-Modal 3D Object Detection in Autonomous Driving: a Survey.

[...]

Yingjie Wang¹, Qiuyu Mao¹, Hanqi Zhu¹, Yu Zhang¹, Jianmin Ji¹, Yanyong Zhang¹ - Show less +2 more•Institutions (1)

University of Science and Technology of China¹

24 Jun 2021-arXiv: Computer Vision and Pattern Recognition

TL;DR: In this article, a survey of multi-sensor fusion-based 3D detection deep learning models that leverage multiple sensor data sources, especially cameras and LiDARs, is presented.

...read moreread less

Abstract: In the past few years, we have witnessed rapid development of autonomous driving. However, achieving full autonomy remains a daunting task due to the complex and dynamic driving environment. As a result, self-driving cars are equipped with a suite of sensors to conduct robust and accurate environment perception. As the number and type of sensors keep increasing, combining them for better perception is becoming a natural trend. So far, there has been no indepth review that focuses on multi-sensor fusion based perception. To bridge this gap and motivate future research, this survey devotes to review recent fusion-based 3D detection deep learning models that leverage multiple sensor data sources, especially cameras and LiDARs. In this survey, we first introduce the background of popular sensors for autonomous cars, including their common data representations as well as object detection networks developed for each type of sensor data. Next, we discuss some popular datasets for multi-modal 3D object detection, with a special focus on the sensor data included in each dataset. Then we present in-depth reviews of recent multi-modal 3D detection networks by considering the following three aspects of the fusion: fusion location, fusion data representation, and fusion granularity. After a detailed review, we discuss open challenges and point out possible solutions. We hope that our detailed review can help researchers to embark investigations in the area of multi-modal 3D object detection.

...read moreread less

3 citations

Cited by

PDF

Open Access

More filters

Journal Article•DOI•

VPFNet: Improving 3D Object Detection with Virtual Point based LiDAR and Stereo Data Fusion

[...]

01 Jan 2022-IEEE Transactions on Multimedia

TL;DR: Zhou et al. as discussed by the authors aligns and aggregates the point cloud and image data at the virtual points, which can nicely bridge the resolution gap between the two sensors, and thus preserve more information for processing.

...read moreread less

Abstract: It has been well recognized that fusing the complementary information from depth-aware LiDAR point clouds and semantic-rich stereo images would benefit 3D object detection. Nevertheless, it is non-trivial to explore the inherently unnatural interaction between sparse 3D points and dense 2D pixels. To ease this difficulty, the recent approaches generally project the 3D points onto the 2D image plane to sample the image data and then aggregate the data at the points. However, these approaches often suffer from the mismatch between the resolution of point clouds and RGB images, leading to sub-optimal performance. Specifically, taking the sparse points as the multi-modal data aggregation locations causes severe information loss for high-resolution images, which in turn undermines the effectiveness of multi-sensor fusion. In this paper, we present VPFNet—a new architecture that cleverly aligns and aggregates the point cloud and image data at the “virtual” points. Particularly, with their density lying between that of the 3D points and 2D pixels, the virtual points can nicely bridge the resolution gap between the two sensors, and thus preserve more information for processing. Moreover, we also investigate the data augmentation techniques that can be applied to both point clouds and RGB images, as the data augmentation has made non-negligible contribution towards 3D object detectors to date. We have conducted extensive experiments on KITTI dataset, and have observed good performance compared to the state-of-the-art methods. Remarkably, our VPFNet achieves 83.21% moderate

$AP_{3D}$

and 91.86% moderate

$AP_{BEV}$

on the KITTI test set. The network design also takes computation efficiency into consideration – we can achieve a FPS of 15 on a single NVIDIA RTX 2080Ti GPU. The source code is available at https://github.com/zhukevkesky/VPFNet.githttps://github.com/zhukevkesky/VPFNet.git.

...read moreread less

10 citations

Posted Content•

VPFNet: Improving 3D Object Detection with Virtual Point based LiDAR and Stereo Data Fusion.

[...]

Hanqi Zhu¹, Jiajun Deng¹, Yu Zhang¹, Jianmin Ji¹, Qiuyu Mao¹, Houqiang Li¹, Yanyong Zhang¹ - Show less +3 more•Institutions (1)

University of Science and Technology of China¹

29 Nov 2021-arXiv: Computer Vision and Pattern Recognition

...read moreread less

10 citations

Journal Article•DOI•

Street-view imagery guided street furniture inventory from mobile laser scanning point clouds

[...]

Yuzhou Zhou, Xu Han, Mingjun Peng, Haiting Li, Bo Yang, Zhen Dong, Bisheng Yang - Show less +3 more

01 Jul 2022-Isprs Journal of Photogrammetry and Remote Sensing

TL;DR: Wang et al. as mentioned in this paper proposed an effective pipeline that combines SVI and point clouds for the inventory of street furniture, which consists of three steps: off-the-shelf street furniture detection models are applied on SVI for generating two-dimensional (2D) proposals and then 3D point cloud frustums are accordingly cropped.

...read moreread less

Abstract: Outdated or sketchy inventory of street furniture may misguide the planners on the renovation and upgrade of transportation infrastructures, thus posing potential threats to traffic safety. Previous studies have taken their steps using point clouds or street-view imagery (SVI) for street furniture inventory, but there remains a gap to balance semantic richness, localization accuracy and working efficiency. Therefore, this paper proposes an effective pipeline that combines SVI and point clouds for the inventory of street furniture. The proposed pipeline encompasses three steps: (1) Off-the-shelf street furniture detection models are applied on SVI for generating two-dimensional (2D) proposals and then three-dimensional (3D) point cloud frustums are accordingly cropped; (2) The instance mask and the instance 3D bounding box are predicted for each frustum using a multi-task neural network; (3) Frustums from adjacent perspectives are associated and fused via multi-object tracking, after which the object-centric instance segmentation outputs the final street furniture with 3D locations and semantic labels. This pipeline was validated on datasets collected in Shanghai and Wuhan, producing component-level street furniture inventory of nine classes. The instance-level mean recall and precision reach 86.4%, 80.9% and 83.2%, 87.8% respectively in Shanghai and Wuhan, and the point-level mean recall, precision, weighted coverage all exceed 73.7%.

...read moreread less

4 citations

Journal Article•DOI•

3D Vehicle Detection and Segmentation Based on EfficientNetB3 and CenterNet Residual Blocks

[...]

Alexey Kashevnik, Ammar Hammad Ali

01 Oct 2022-Sensors

TL;DR: This paper presents a two stages solution to 3D vehicle detection and segmentation that outperforms all published solutions in terms of 6 degrees of freedom error (6 DoF err).

...read moreread less

Abstract: In this paper, we present a two stages solution to 3D vehicle detection and segmentation. The first stage depends on the combination of EfficientNetB3 architecture with multiparallel residual blocks (inspired by CenterNet architecture) for 3D localization and poses estimation for vehicles on the scene. The second stage takes the output of the first stage as input (cropped car images) to train EfficientNet B3 for the image recognition task. Using predefined 3D Models, we substitute each vehicle on the scene with its match using the rotation matrix and translation vector from the first stage to get the 3D detection bounding boxes and segmentation masks. We trained our models on an open-source dataset (ApolloCar3D). Our method outperforms all published solutions in terms of 6 degrees of freedom error (6 DoF err).

...read moreread less

3 citations

Posted Content•

TransVG: End-to-End Visual Grounding with Transformers

[...]

Jiajun Deng¹, Zhengyuan Yang², Tianlang Chen², Wengang Zhou¹, Houqiang Li¹ - Show less +1 more•Institutions (2)

University of Science and Technology of China¹, University of Rochester²

17 Apr 2021-arXiv: Computer Vision and Pattern Recognition

TL;DR: TransVG as mentioned in this paper proposes a transformer-based framework for visual grounding to establish the multi-modal correspondence by leveraging transformers, and empirically show that the complex fusion modules can be replaced by a simple stack of transformer encoder layers with higher performance.

...read moreread less

Abstract: In this paper, we present a neat yet effective transformer-based framework for visual grounding, namely TransVG, to address the task of grounding a language query to the corresponding region onto an image. The state-of-the-art methods, including two-stage or one-stage ones, rely on a complex module with manually-designed mechanisms to perform the query reasoning and multi-modal fusion. However, the involvement of certain mechanisms in fusion module design, such as query decomposition and image scene graph, makes the models easily overfit to datasets with specific scenarios, and limits the plenitudinous interaction between the visual-linguistic context. To avoid this caveat, we propose to establish the multi-modal correspondence by leveraging transformers, and empirically show that the complex fusion modules (e.g., modular attention network, dynamic graph, and multi-modal tree) can be replaced by a simple stack of transformer encoder layers with higher performance. Moreover, we re-formulate the visual grounding as a direct coordinates regression problem and avoid making predictions out of a set of candidates (i.e., region proposals or anchor boxes). Extensive experiments are conducted on five widely used datasets, and a series of state-of-the-art records are set by our TransVG. We build the benchmark of transformer-based visual grounding framework and will make our code available to the public.

...read moreread less

2 citations