Showing papers by "Hang Zhao published in 2021"

PDF

Open Access

Posted Content•

Large Scale Interactive Motion Forecasting for Autonomous Driving : The Waymo Open Motion Dataset

[...]

Scott Ettinger, Shuyang Cheng, Benjamin Caine¹, Chenxi Liu², Hang Zhao³, Sabeek Pradhan, Yuning Chai¹, Benjamin Sapp⁴, Charles R. Qi⁵, Yin Zhou⁶, Zoey Yang, Aurelien Chouard, Pei Sun¹, Jiquan Ngiam¹, Vijay K. Vasudevan⁷, Alexander McCauley, Jonathon Shlens¹, Dragomir Anguelov¹ - Show less +14 more•Institutions (7)

Google¹, Johns Hopkins University², Tsinghua University³, California Institute of Technology⁴, Stanford University⁵, Apple Inc.⁶, University of Cincinnati⁷

20 Apr 2021-arXiv: Computer Vision and Pattern Recognition

TL;DR: In this article, the authors introduce a large-scale interactive motion dataset with over 100,000 scenes, each 20 seconds long at 10 Hz, collected by mining for interesting interactions between vehicles, pedestrians, and cyclists across six cities within the United States.

...read moreread less

Abstract: As autonomous driving systems mature, motion forecasting has received increasing attention as a critical requirement for planning. Of particular importance are interactive situations such as merges, unprotected turns, etc., where predicting individual object motion is not sufficient. Joint predictions of multiple objects are required for effective route planning. There has been a critical need for high-quality motion data that is rich in both interactions and annotation to develop motion planning models. In this work, we introduce the most diverse interactive motion dataset to our knowledge, and provide specific labels for interacting objects suitable for developing joint prediction models. With over 100,000 scenes, each 20 seconds long at 10 Hz, our new dataset contains more than 570 hours of unique data over 1750 km of roadways. It was collected by mining for interesting interactions between vehicles, pedestrians, and cyclists across six cities within the United States. We use a high-accuracy 3D auto-labeling system to generate high quality 3D bounding boxes for each road agent, and provide corresponding high definition 3D maps for each scene. Furthermore, we introduce a new set of metrics that provides a comprehensive evaluation of both single agent and joint agent interaction motion forecasting models. Finally, we provide strong baseline models for individual-agent prediction and joint-prediction. We hope that this new large-scale interactive motion dataset will provide new opportunities for advancing motion forecasting models.

...read moreread less

63 citations

Proceedings Article•

DenseTNT: End-to-end Trajectory Prediction from Dense Goal Sets

[...]

Junru Gu, Chen Sun¹, Hang Zhao²•Institutions (2)

Google¹, Tsinghua University²

22 Aug 2021

TL;DR: DenseTNT as mentioned in this paper proposes an anchor-free and end-to-end trajectory prediction model that directly outputs a set of trajectories from dense goal candidates and uses an offline optimization-based technique to provide multi-future pseudo-labels.

...read moreread less

Abstract: Due to the stochasticity of human behaviors, predicting the future trajectories of road agents is challenging for autonomous driving. Recently, goal-based multi-trajectory prediction methods are proved to be effective, where they first score over-sampled goal candidates and then select a final set from them. However, these methods usually involve goal predictions based on sparse pre-defined anchors and heuristic goal selection algorithms. In this work, we propose an anchor-free and end-to-end trajectory prediction model, named DenseTNT, that directly outputs a set of trajectories from dense goal candidates. In addition, we introduce an offline optimization-based technique to provide multi-future pseudo-labels for our final online model. Experiments show that DenseTNT achieves state-of-the-art performance, ranking 1st on the Argoverse motion forecasting benchmark and being the 1st place winner of the 2021 Waymo Open Dataset Motion Prediction Challenge.

...read moreread less

47 citations

Proceedings Article•DOI•

HDMapGen: A Hierarchical Graph Generative Model of High Definition Maps

[...]

Lu Mi, Hang Zhao, Charlie Nash, Xiaohan Jin, Jiyang Gao, Chen Sun¹, Cordelia Schmid¹, Nir Shavit², Yuning Chai, Dragomir Anguelov - Show less +6 more•Institutions (2)

Google¹, Massachusetts Institute of Technology²

28 Jun 2021

TL;DR: In this article, a hierarchical graph generation model is proposed to generate high-quality and diverse high-definition maps through a coarse-to-fine approach, which significantly outperforms baseline methods.

...read moreread less

Abstract: High Definition (HD) maps are maps with precise definitions of road lanes with rich semantics of the traffic rules. They are critical for several key stages in an autonomous driving system, including motion forecasting and planning. However, there are only a small amount of real-world road topologies and geometries, which significantly limits our ability to test out the self-driving stack to generalize onto new unseen scenarios. To address this issue, we introduce a new challenging task to generate HD maps. In this work, we explore several autoregressive models using different data representations, including sequence, plain graph, and hierarchical graph. We propose HDMapGen, a hierarchical graph generation model capable of producing high-quality and diverse HD maps through a coarse-to-fine approach. Experiments on the Argoverse dataset and an inhouse dataset show that HDMapGen significantly outperforms baseline methods. Additionally, we demonstrate that HDMapGen achieves high scalability and efficiency.

...read moreread less

23 citations

Posted Content•

Multimodal Knowledge Expansion.

[...]

Zihui Xue¹, Sucheng Ren², Zhengqi Gao¹, Hang Zhao³•Institutions (3)

Fudan University¹, South China University of Technology², Tsinghua University³

26 Mar 2021-arXiv: Computer Vision and Pattern Recognition

TL;DR: Opposite to traditional knowledge distillation, where the student is designed to be lightweight and inferior to the teacher, it is observed that a multimodal student model consistently rectifies pseudo labels and generalizes better than its teacher.

...read moreread less

Abstract: The popularity of multimodal sensors and the accessibility of the Internet have brought us a massive amount of unlabeled multimodal data Since existing datasets and well-trained models are primarily unimodal, the modality gap between a unimodal network and unlabeled multimodal data poses an interesting problem: how to transfer a pre-trained unimodal network to perform the same task on unlabeled multimodal data? In this work, we propose multimodal knowledge expansion (MKE), a knowledge distillation-based framework to effectively utilize multimodal data without requiring labels Opposite to traditional knowledge distillation, where the student is designed to be lightweight and inferior to the teacher, we observe that a multimodal student model consistently denoises pseudo labels and generalizes better than its teacher Extensive experiments on four tasks and different modalities verify this finding Furthermore, we connect the mechanism of MKE to semi-supervised learning and offer both empirical and theoretical explanations to understand the denoising capability of a multimodal student

...read moreread less

10 citations

Posted Content•

HDMapNet: An Online HD Map Construction and Evaluation Framework.

[...]

Qi Li, Yue Wang, Yilun Wang, Hang Zhao

13 Jul 2021-arXiv: Computer Vision and Pattern Recognition

TL;DR: Li et al. as discussed by the authors proposed an online map learning method, which dynamically constructs the HD maps based on local sensor observations, to provide semantic and geometry priors to self-driving vehicles than traditional pre-annotated HD maps.

...read moreread less

Abstract: High-definition map (HD map) construction is a crucial problem for autonomous driving. This problem typically involves collecting high-quality point clouds, fusing multiple point clouds of the same scene, annotating map elements, and updating maps constantly. This pipeline, however, requires a vast amount of human efforts and resources which limits its scalability. Additionally, traditional HD maps are coupled with centimeter-level accurate localization which is unreliable in many scenarios. In this paper, we argue that online map learning, which dynamically constructs the HD maps based on local sensor observations, is a more scalable way to provide semantic and geometry priors to self-driving vehicles than traditional pre-annotated HD maps. Meanwhile, we introduce an online map learning method, titled HDMapNet. It encodes image features from surrounding cameras and/or point clouds from LiDAR, and predicts vectorized map elements in the bird's-eye view. We benchmark HDMapNet on the nuScenes dataset and show that in all settings, it performs better than baseline methods. Of note, our fusion-based HDMapNet outperforms existing methods by more than 50% in all metrics. To accelerate future research, we develop customized metrics to evaluate map learning performance, including both semantic-level and instance-level ones. By introducing this method and metrics, we invite the community to study this novel map learning problem. We will release our code and evaluation kit to facilitate future development.

...read moreread less

6 citations

Posted Content•

On Feature Decorrelation in Self-Supervised Learning

[...]

Tianyu Hua¹, Wenxiao Wang, Zihui Xue², Sucheng Ren³, Yue Wang⁴, Hang Zhao⁵ - Show less +2 more•Institutions (5)

Vanderbilt University¹, Fudan University², South China University of Technology³, Massachusetts Institute of Technology⁴, Tsinghua University⁵

02 May 2021-arXiv: Learning

TL;DR: In this article, the authors verify the existence of complete collapse and discover another reachable collapse pattern that is usually overlooked, namely dimensional collapse, and connect dimensional collapse with strong correlations between axes and consider such connection as a strong motivation for feature decorrelation.

...read moreread less

Abstract: In self-supervised representation learning, a common idea behind most of the state-of-the-art approaches is to enforce the robustness of the representations to predefined augmentations. A potential issue of this idea is the existence of completely collapsed solutions (i.e., constant features), which are typically avoided implicitly by carefully chosen implementation details. In this work, we study a relatively concise framework containing the most common components from recent approaches. We verify the existence of complete collapse and discover another reachable collapse pattern that is usually overlooked, namely dimensional collapse. We connect dimensional collapse with strong correlations between axes and consider such connection as a strong motivation for feature decorrelation (i.e., standardizing the covariance matrix). The capability of correlation as an unsupervised metric and the gains from feature decorrelation are verified empirically to highlight the importance and the potential of this insight.

...read moreread less

4 citations

Proceedings Article•DOI•

CVC: Contrastive Learning for Non-Parallel Voice Conversion

[...]

Tingle Li¹, Yichen Liu, Chenxu Hu, Hang Zhao²•Institutions (2)

Duke University¹, Tsinghua University²

30 Aug 2021

TL;DR: In this article, a contrastive learning-based adversarial approach for voice conversion is proposed, which only requires an efficient one-way GAN training by taking the advantage of Contrastive Learning.

...read moreread less

Abstract: Cycle consistent generative adversarial network (CycleGAN) and variational autoencoder (VAE) based models have gained popularity in non-parallel voice conversion recently. However, they often suffer from difficult training process and unsatisfactory results. In this paper, we propose CVC, a contrastive learning-based adversarial approach for voice conversion. Compared to previous CycleGAN-based methods, CVC only requires an efficient one-way GAN training by taking the advantage of contrastive learning. When it comes to non-parallel one-to-one voice conversion, CVC is on par or better than CycleGAN and VAE while effectively reducing training time. CVC further demonstrates superior performance in many-to-one voice conversion, enabling the conversion from unseen speakers.

...read moreread less

4 citations

Posted Content•

What Makes Multimodal Learning Better than Single (Provably)

[...]

Yu Huang, Chenzhuang Du, Zihui Xue, Xuanyao Chen, Hang Zhao, Longbo Huang¹ - Show less +2 more•Institutions (1)

Tsinghua University¹

08 Jun 2021-arXiv: Learning

TL;DR: In this paper, the authors show that multimodal learning with multiple modalities achieves a smaller population risk than using only a subset of modalities, while the latter has more accurate estimate of the latent space representation.

...read moreread less

Abstract: The world provides us with data of multiple modalities. Intuitively, models fusingdata from different modalities outperform unimodal models, since more informationis aggregated. Recently, joining the success of deep learning, there is an influentialline of work on deep multimodal learning, which has remarkable empirical resultson various applications. However, theoretical justifications in this field are notablylacking.Can multimodal provably perform better than unimodal? In this paper, we answer this question under a most popular multimodal learningframework, which firstly encodes features from different modalities into a commonlatent space and seamlessly maps the latent representations into the task space. Weprove that learning with multiple modalities achieves a smaller population risk thanonly using its subset of modalities. The main intuition is that the former has moreaccurate estimate of the latent space representation. To the best of our knowledge,this is the first theoretical treatment to capture important qualitative phenomenaobserved in real multimodal applications. Combining with experiment results, weshow that multimodal learning does possess an appealing formal guarantee.

...read moreread less

2 citations

Proceedings Article•

On Feature Decorrelation in Self-Supervised Learning

[...]

Tianyu Hua¹, Wenxiao Wang, Zihui Xue², Sucheng Ren³, Yue Wang⁴, Hang Zhao⁵ - Show less +2 more•Institutions (5)

Vanderbilt University¹, Fudan University², South China University of Technology³, Massachusetts Institute of Technology⁴, Tsinghua University⁵

02 May 2021

TL;DR: In this paper, the authors verify the existence of complete collapse and discover another reachable collapse pattern that is usually overlooked, namely dimensional collapse, and connect dimensional collapse with strong correlations between axes and consider such connection as a strong motivation for feature decorrelation.

...read moreread less

2 citations

Posted Content•

Neural Dubber: Dubbing for Silent Videos According to Scripts

[...]

Chenxu Hu, Qiao Tian, Tingle Li, Yuping Wang, Yuxuan Wang, Hang Zhao - Show less +2 more

15 Oct 2021-arXiv: Audio and Speech Processing

TL;DR: Neural Dubber as mentioned in this paper is a multi-modal text-to-speech model that utilizes the lip movement in the video to control the prosody of the generated speech and an image-based speaker embedding module is developed for the multi-speaker setting, which enables Neural Dubber to generate speech with a reasonable timbre according to the speaker's face.

...read moreread less

Abstract: Dubbing is a post-production process of re-recording actors' dialogues, which is extensively used in filmmaking and video production. It is usually performed manually by professional voice actors who read lines with proper prosody, and in synchronization with the pre-recorded videos. In this work, we propose Neural Dubber, the first neural network model to solve a novel automatic video dubbing (AVD) task: synthesizing human speech synchronized with the given silent video from the text. Neural Dubber is a multi-modal text-to-speech (TTS) model that utilizes the lip movement in the video to control the prosody of the generated speech. Furthermore, an image-based speaker embedding (ISE) module is developed for the multi-speaker setting, which enables Neural Dubber to generate speech with a reasonable timbre according to the speaker's face. Experiments on the chemistry lecture single-speaker dataset and LRS2 multi-speaker dataset show that Neural Dubber can generate speech audios on par with state-of-the-art TTS models in terms of speech quality. Most importantly, both qualitative and quantitative evaluations show that Neural Dubber can control the prosody of synthesized speech by the video, and generate high-fidelity speech temporally synchronized with the video.

...read moreread less

1 citations

Posted Content•

Co-advise: Cross Inductive Bias Distillation

[...]

Sucheng Ren, Zhengqi Gao, Tianyu Hua, Zihui Xue, Yonglong Tian, Shengfeng He, Hang Zhao - Show less +3 more

23 Jun 2021-arXiv: Computer Vision and Pattern Recognition

TL;DR: CivT as mentioned in this paper introduces lightweight teachers with different architectural inductive biases (e.g., convolution and involution) to co-advise the student transformers, which improves the student's performance.

...read moreread less

Abstract: Transformers recently are adapted from the community of natural language processing as a promising substitute of convolution-based neural networks for visual learning tasks. However, its supremacy degenerates given an insufficient amount of training data (e.g., ImageNet). To make it into practical utility, we propose a novel distillation-based method to train vision transformers. Unlike previous works, where merely heavy convolution-based teachers are provided, we introduce lightweight teachers with different architectural inductive biases (e.g., convolution and involution) to co-advise the student transformer. The key is that teachers with different inductive biases attain different knowledge despite that they are trained on the same dataset, and such different knowledge compounds and boosts the student's performance during distillation. Equipped with this cross inductive bias distillation method, our vision transformers (termed as CivT) outperform all previous transformers of the same architecture on ImageNet.

...read moreread less

Posted Content•

Intrinsically Motivated Self-supervised Learning in Reinforcement Learning.

[...]

Yue Zhao, Chenzhuang Du, Hang Zhao, Tiejun Li

26 Jun 2021-arXiv: Learning

TL;DR: In this article, the self-supervised loss can be decomposed as exploration for novel states and robustness improvement from nuisance elimination, which can be easily plugged into any reinforcement learning with selfsupervised auxiliary objectives with nearly no additional cost.

...read moreread less

Abstract: In vision-based reinforcement learning (RL) tasks, it is prevalent to assign the auxiliary task with a surrogate self-supervised loss so as to obtain more semantic representations and improve sample efficiency. However, abundant information in self-supervised auxiliary tasks has been disregarded, since the representation learning part and the decision-making part are separated. To sufficiently utilize information in the auxiliary task, we present a simple yet effective idea to employ self-supervised loss as an intrinsic reward, called Intrinsically Motivated Self-Supervised learning in Reinforcement learning (IM-SSR). We formally show that the self-supervised loss can be decomposed as exploration for novel states and robustness improvement from nuisance elimination. IM-SSR can be effortlessly plugged into any reinforcement learning with self-supervised auxiliary objectives with nearly no additional cost. Combined with IM-SSR, the previous underlying algorithms achieve salient improvements on both sample efficiency and generalization in various vision-based robotics tasks from the DeepMind Control Suite, especially when the reward signal is sparse.

...read moreread less

Posted Content•

Improving Multi-Modal Learning with Uni-Modal Teachers.

[...]

Chenzhuang Du, Tingle Li, Yichen Liu, Zixin Wen, Tianyu Hua, Yue Wang, Hang Zhao - Show less +3 more

21 Jun 2021-arXiv: Learning

TL;DR: In this article, the authors proposed Uni-Modal Teacher, which combines the fusion objective and uni-modal distillation to tackle the modality failure problem, and showed that their method not only drastically improves the representation of each modality, but also improves the overall multimodal task performance.

...read moreread less

Abstract: Learning multi-modal representations is an essential step towards real-world robotic applications, and various multi-modal fusion models have been developed for this purpose. However, we observe that existing models, whose objectives are mostly based on joint training, often suffer from learning inferior representations of each modality. We name this problem Modality Failure, and hypothesize that the imbalance of modalities and the implicit bias of common objectives in fusion method prevent encoders of each modality from sufficient feature learning. To this end, we propose a new multi-modal learning method, Uni-Modal Teacher, which combines the fusion objective and uni-modal distillation to tackle the modality failure problem. We show that our method not only drastically improves the representation of each modality, but also improves the overall multi-modal task performance. Our method can be effectively generalized to most multi-modal fusion approaches. We achieve more than 3% improvement on the VGGSound audio-visual classification task, as well as improving performance on the NYU depth V2 RGB-D image segmentation task.

...read moreread less

Posted Content•

DenseTNT: Waymo Open Dataset Motion Prediction Challenge 1st Place Solution

[...]

Junru Gu, Qiao Sun, Hang Zhao

27 Jun 2021-arXiv: Computer Vision and Pattern Recognition

TL;DR: The authors proposed an anchor-free model, named DenseTNT, which performs dense goal probability estimation for trajectory prediction and achieved state-of-the-art performance on the Waymo Open Dataset Motion Prediction Challenge.

...read moreread less

Abstract: In autonomous driving, goal-based multi-trajectory prediction methods are proved to be effective recently, where they first score goal candidates, then select a final set of goals, and finally complete trajectories based on the selected goals. However, these methods usually involve goal predictions based on sparse predefined anchors. In this work, we propose an anchor-free model, named DenseTNT, which performs dense goal probability estimation for trajectory prediction. Our model achieves state-of-the-art performance, and ranks 1st on the Waymo Open Dataset Motion Prediction Challenge.

...read moreread less

Journal Article•

SEMI: Self-supervised Exploration via Multisensory Incongruity

[...]

Jianren Wang¹, Ziwen Zhuang², Hang Zhao³•Institutions (3)

Carnegie Mellon University¹, ShanghaiTech University², Tsinghua University³

04 May 2021-arXiv: Learning

TL;DR: This work introduces a self-supervised exploration policy by incentivizing the agent to maximize multisensory incongruity, which can be measured in two aspects: perception incongsruity and action incongriity.

...read moreread less

Abstract: Efficient exploration is a long-standing problem in reinforcement learning. In this work, we introduce a self-supervised exploration policy by incentivizing the agent to maximize multisensory incongruity, which can be measured in two aspects: perception incongruity and action incongruity. The former represents the uncertainty in multisensory fusion model, while the latter represents the uncertainty in an agent's policy. Specifically, an alignment predictor is trained to detect whether multiple sensory inputs are aligned, the error of which is used to measure perception incongruity. The policy takes the multisensory observations with sensory-wise dropout as input, and outputs actions for exploration. The variance of actions is further used to measure action incongruity. Our formulation allows the agent to learn skills by exploring in a self-supervised manner without any external rewards. Besides, our method enables the agent to learn a compact multimodal representation from hard examples, which further improves the sample efficiency of our policy learning. We demonstrate the efficacy of this formulation across a variety of benchmark environments including object manipulation and audio-visual games.

...read moreread less

Posted Content•

DenseTNT: End-to-end Trajectory Prediction from Dense Goal Sets.

[...]

Junru Gu, Chen Sun¹, Hang Zhao²•Institutions (2)

Google¹, Tsinghua University²

22 Aug 2021-arXiv: Computer Vision and Pattern Recognition

TL;DR: DenseTNT as discussed by the authors proposes an anchor-free and end-to-end trajectory prediction model that directly outputs a set of trajectories from dense goal candidates and uses an offline optimization-based technique to provide multi-future pseudo-labels.

...read moreread less

Multi-Agent Trajectory Prediction by Combining Egocentric and Allocentric Views

[...]

Xiaosong Jia, Liting Sun, Hang Zhao, Masayoshi Tomizuka, Wei Zhan - Show less +1 more

19 Jun 2021

Adversarially Robust Imitation Learning

[...]

Jianren Wang, Ziwen Zhuang, Yuyang Wang, Hang Zhao

19 Jun 2021

Posted Content•

Neural Dubber: Dubbing for Videos According to Scripts.

[...]

Chenxu Hu, Qiao Tian, Tingle Li, Yuping Wang, Yuxuan Wang, Hang Zhao¹ - Show less +2 more•Institutions (1)

Tsinghua University¹

16 Nov 2021-arXiv: Audio and Speech Processing

...read moreread less

Abstract: Dubbing is a post-production process of re-recording actors' dialogues, which is extensively used in filmmaking and video production. It is usually performed manually by professional voice actors who read lines with proper prosody, and in synchronization with the pre-recorded videos. In this work, we propose Neural Dubber, the first neural network model to solve a novel automatic video dubbing (AVD) task: synthesizing human speech synchronized with the given video from the text. Neural Dubber is a multi-modal text-to-speech (TTS) model that utilizes the lip movement in the video to control the prosody of the generated speech. Furthermore, an image-based speaker embedding (ISE) module is developed for the multi-speaker setting, which enables Neural Dubber to generate speech with a reasonable timbre according to the speaker's face. Experiments on the chemistry lecture single-speaker dataset and LRS2 multi-speaker dataset show that Neural Dubber can generate speech audios on par with state-of-the-art TTS models in terms of speech quality. Most importantly, both qualitative and quantitative evaluations show that Neural Dubber can control the prosody of synthesized speech by the video, and generate high-fidelity speech temporally synchronized with the video.

...read moreread less

RealCity3D: A Large-scale Georeferenced 3D Shape Dataset of Real-world Cities

[...]

Congcong Wen, Wenyu Han, Lazarus Chok, Yan Liang Tan, Sheung Lung Chan, Hang Zhao, Chen Feng - Show less +3 more

08 Jun 2021

Posted Content•

DETR3D: 3D Object Detection from Multi-view Images via 3D-to-2D Queries

[...]

Yue Wang¹, Vitor Guizilini, Tianyuan Zhang², Yilun Wang³, Hang Zhao⁴, Justin Solomon¹ - Show less +2 more•Institutions (4)

Massachusetts Institute of Technology¹, Carnegie Mellon University², Stanford University³, Tsinghua University⁴

13 Oct 2021-arXiv: Computer Vision and Pattern Recognition

TL;DR: In this paper, a top-down approach is proposed for multi-camera 3D object detection, which extracts 2D features from multiple camera images and then uses a sparse set of 3D objects queries to index into these features, linking 3D positions to multi-view images using camera transformation matrices.

...read moreread less

Abstract: We introduce a framework for multi-camera 3D object detection. In contrast to existing works, which estimate 3D bounding boxes directly from monocular images or use depth prediction networks to generate input for 3D object detection from 2D information, our method manipulates predictions directly in 3D space. Our architecture extracts 2D features from multiple camera images and then uses a sparse set of 3D object queries to index into these 2D features, linking 3D positions to multi-view images using camera transformation matrices. Finally, our model makes a bounding box prediction per object query, using a set-to-set loss to measure the discrepancy between the ground-truth and the prediction. This top-down approach outperforms its bottom-up counterpart in which object bounding box prediction follows per-pixel depth estimation, since it does not suffer from the compounding error introduced by a depth prediction model. Moreover, our method does not require post-processing such as non-maximum suppression, dramatically improving inference speed. We achieve state-of-the-art performance on the nuScenes autonomous driving benchmark.

...read moreread less

Posted Content•

HDMapNet: A Local Semantic Map Learning and Evaluation Framework.

[...]

Qi Li, Yue Wang, Yilun Wang, Hang Zhao¹•Institutions (1)

Tsinghua University¹

13 Jul 2021-arXiv: Computer Vision and Pattern Recognition

TL;DR: Zhang et al. as mentioned in this paper proposed a local semantic map learning method, dubbed HDMapNet, which dynamically constructs the vectorized semantics based on onboard sensor observations, and predicts vectorized map elements in the bird's-eye view.

...read moreread less

Abstract: Estimating local semantics from sensory inputs is a central component for high-definition map constructions in autonomous driving. However, traditional pipelines require a vast amount of human efforts and resources in annotating and maintaining the semantics in the map, which limits its scalability. In this paper, we introduce the problem of local semantic map learning, which dynamically constructs the vectorized semantics based on onboard sensor observations. Meanwhile, we introduce a local semantic map learning method, dubbed HDMapNet. HDMapNet encodes image features from surrounding cameras and/or point clouds from LiDAR, and predicts vectorized map elements in the bird's-eye view. We benchmark HDMapNet on nuScenes dataset and show that in all settings, it performs better than baseline methods. Of note, our fusion-based HDMapNet outperforms existing methods by more than 50% in all metrics. In addition, we develop semantic-level and instance-level metrics to evaluate the map learning performance. Finally, we showcase our method is capable of predicting a locally consistent map. By introducing the method and metrics, we invite the community to study this novel map learning problem. Code and evaluation kit will be released to facilitate future development.

...read moreread less

Posted Content•

Predictive Visual Tracking: A New Benchmark and Baseline Approach.

[...]

Bowen Li, Yiming Li, Junjie Ye, Changhong Fu, Hang Zhao - Show less +1 more

08 Mar 2021-arXiv: Computer Vision and Pattern Recognition

TL;DR: In this article, a new predictive visual tracking baseline is developed to compensate for the latency stemming from the onboard computation, which can provide a more realistic evaluation of the trackers for the robotic applications.

...read moreread less

Abstract: As a crucial robotic perception capability, visual tracking has been intensively studied recently. In the real-world scenarios, the onboard processing time of the image streams inevitably leads to a discrepancy between the tracking results and the real-world states. However, existing visual tracking benchmarks commonly run the trackers offline and ignore such latency in the evaluation. In this work, we aim to deal with a more realistic problem of latency-aware tracking. The state-of-the-art trackers are evaluated in the aerial scenarios with new metrics jointly assessing the tracking accuracy and efficiency. Moreover, a new predictive visual tracking baseline is developed to compensate for the latency stemming from the onboard computation. Our latency-aware benchmark can provide a more realistic evaluation of the trackers for the robotic applications. Besides, exhaustive experiments have proven the effectiveness of the proposed predictive visual tracking baseline approach.

...read moreread less