Top 16 papers published by Hang Zhao from Tsinghua University in 2020

Proceedings Article•DOI•

Scalability in Perception for Autonomous Driving: Waymo Open Dataset

[...]

Pei Sun, Henrik Kretzschmar, Xerxes Dotiwalla, Aurelien Chouard, Vijaysai Patnaik, Paul Tsui, James Guo, Yin Zhou, Yuning Chai, Benjamin Caine¹, Vijay K. Vasudevan¹, Wei Han¹, Jiquan Ngiam¹, Hang Zhao, Aleksei Timofeev, Scott Ettinger, Maxim Krivokon, Amy Gao, Aditya Joshi, Yu Zhang, Jonathon Shlens¹, Zhifeng Chen¹, Dragomir Anguelov - Show less +19 more•Institutions (1)

Google¹

14 Jun 2020

TL;DR: In this paper, a large scale, high quality, and diverse dataset for self-driving data is presented, consisting of LiDAR and camera data captured across a range of urban and suburban geographies.

...read moreread less

Abstract: The research community has increasing interest in autonomous driving research, despite the resource intensity of obtaining representative real world data. Existing self-driving datasets are limited in the scale and variation of the environments they capture, even though generalization within and between operating regions is crucial to the over-all viability of the technology. In an effort to help align the research community’s contributions with real-world self-driving problems, we introduce a new large scale, high quality, diverse dataset. Our new dataset consists of 1150 scenes that each span 20 seconds, consisting of well synchronized and calibrated high quality LiDAR and camera data captured across a range of urban and suburban geographies. It is 15x more diverse than the largest camera+LiDAR dataset available based on our proposed diversity metric. We exhaustively annotated this data with 2D (camera image) and 3D (LiDAR) bounding boxes, with consistent identifiers across frames. Finally, we provide strong baselines for 2D as well as 3D detection and tracking tasks. We further study the effects of dataset size and generalization across geographies on 3D detection methods. Find data, code and more up-to-date information at http://www.waymo.com/open.

...read moreread less

789 citations

Posted Content•

VectorNet: Encoding HD Maps and Agent Dynamics from Vectorized Representation

[...]

Jiyang Gao, Chen Sun¹, Hang Zhao, Yi Shen, Dragomir Anguelov, Congcong Li, Cordelia Schmid¹ - Show less +3 more•Institutions (1)

Google¹

08 May 2020-arXiv: Computer Vision and Pattern Recognition

TL;DR: VectorNet is introduced, a hierarchical graph neural network that first exploits the spatial locality of individual road components represented by vectors and then models the high-order interactions among all components and obtains state-of-the-art performance on the Argoverse dataset.

...read moreread less

Abstract: Behavior prediction in dynamic, multi-agent systems is an important problem in the context of self-driving cars, due to the complex representations and interactions of road components, including moving agents (e.g. pedestrians and vehicles) and road context information (e.g. lanes, traffic lights). This paper introduces VectorNet, a hierarchical graph neural network that first exploits the spatial locality of individual road components represented by vectors and then models the high-order interactions among all components. In contrast to most recent approaches, which render trajectories of moving agents and road context information as bird-eye images and encode them with convolutional neural networks (ConvNets), our approach operates on a vector representation. By operating on the vectorized high definition (HD) maps and agent trajectories, we avoid lossy rendering and computationally intensive ConvNet encoding steps. To further boost VectorNet's capability in learning context features, we propose a novel auxiliary task to recover the randomly masked out map entities and agent trajectories based on their context. We evaluate VectorNet on our in-house behavior prediction benchmark and the recently released Argoverse forecasting dataset. Our method achieves on par or better performance than the competitive rendering approach on both benchmarks while saving over 70% of the model parameters with an order of magnitude reduction in FLOPs. It also outperforms the state of the art on the Argoverse dataset.

...read moreread less

276 citations

Posted Content•

TNT: Target-driveN Trajectory Prediction

[...]

Hang Zhao, Jiyang Gao, Tian Lan, Chen Sun, Benjamin Sapp, Balakrishnan Varadarajan, Yue Shen, Yi Shen, Yuning Chai, Cordelia Schmid, Congcong Li, Dragomir Anguelov - Show less +8 more

19 Aug 2020-arXiv: Computer Vision and Pattern Recognition

TL;DR: The key insight is that for prediction within a moderate time horizon, the future modes can be effectively captured by a set of target states, which leads to the target-driven trajectory prediction (TNT) framework.

...read moreread less

Abstract: Predicting the future behavior of moving agents is essential for real world applications. It is challenging as the intent of the agent and the corresponding behavior is unknown and intrinsically multimodal. Our key insight is that for prediction within a moderate time horizon, the future modes can be effectively captured by a set of target states. This leads to our target-driven trajectory prediction (TNT) framework. TNT has three stages which are trained end-to-end. It first predicts an agent's potential target states $T$ steps into the future, by encoding its interactions with the environment and the other agents. TNT then generates trajectory state sequences conditioned on targets. A final stage estimates trajectory likelihoods and a final compact set of trajectory predictions is selected. This is in contrast to previous work which models agent intents as latent variables, and relies on test-time sampling to generate diverse trajectories. We benchmark TNT on trajectory prediction of vehicles and pedestrians, where we outperform state-of-the-art on Argoverse Forecasting, INTERACTION, Stanford Drone and an in-house Pedestrian-at-Intersection dataset.

...read moreread less

211 citations

Proceedings Article•DOI•

VectorNet: Encoding HD Maps and Agent Dynamics From Vectorized Representation

[...]

Jiyang Gao, Chen Sun¹, Hang Zhao, Yi Shen, Dragomir Anguelov, Congcong Li, Cordelia Schmid¹ - Show less +3 more•Institutions (1)

Google¹

14 Jun 2020

TL;DR: VectorNet as discussed by the authors proposes a hierarchical graph neural network that first exploits the spatial locality of individual road components represented by vectors and then models the high-order interactions among all the components.

...read moreread less

Abstract: Behavior prediction in dynamic, multi-agent systems is an important problem in the context of self-driving cars, due to the complex representations and interactions of road components, including moving agents (e.g. pedestrians and vehicles) and road context information (e.g. lanes, traffic lights). This paper introduces VectorNet, a hierarchical graph neural network that first exploits the spatial locality of individual road components represented by vectors and then models the high-order interactions among all components. In contrast to most recent approaches, which render trajectories of moving agents and road context information as bird-eye images and encode them with convolutional neural networks (ConvNets), our approach operates on the primitive vector representation. By operating on the vectorized high definition (HD) maps and agent trajectories, we avoid lossy rendering and computationally intensive ConvNet encoding steps. To further boost VectorNet's capability in learning context features, we propose a novel auxiliary task to recover the randomly masked out map entities and agent trajectories based on their context. We evaluate VectorNet on our in-house behavior prediction benchmark and the recently released Argoverse forecasting dataset. Our method achieves on par or better performance than the competitive rendering approach on both benchmarks while saving over 70% of the model parameters with an order of magnitude reduction in FLOPs. It also obtains state-of-the-art performance on the Argoverse dataset.

...read moreread less

210 citations

Proceedings Article•DOI•

Music Gesture for Visual Sound Separation

[...]

Chuang Gan¹, Deng Huang¹, Hang Zhao², Joshua B. Tenenbaum², Antonio Torralba² - Show less +1 more•Institutions (2)

IBM¹, Massachusetts Institute of Technology²

14 Jun 2020

TL;DR: This work proposes ``Music Gesture," a keypoint-based structured representation to explicitly model the body and finger movements of musicians when they perform music, which adopts a context-aware graph network to integrate visual semantic context with body dynamics and applies an audio-visual fusion model to associate body movements with the corresponding audio signals.

...read moreread less

Abstract: Recent deep learning approaches have achieved impressive performance on visual sound separation tasks. However, these approaches are mostly built on appearance and optical flow like motion feature representations, which exhibit limited abilities to find the correlations between audio signals and visual points, especially when separating multiple instruments of the same types, such as multiple violins in a scene. To address this, we propose ``Music Gesture," a keypoint-based structured representation to explicitly model the body and finger movements of musicians when they perform music. We first adopt a context-aware graph network to integrate visual semantic context with body dynamics and then apply an audio-visual fusion model to associate body movements with the corresponding audio signals. Experimental results on three music performance datasets show: 1) strong improvements upon benchmark metrics for hetero-musical separation tasks (i.e. different instruments); 2) new ability for effective homo-musical separation for piano, flute, and trumpet duets, which to our best knowledge has never been achieved with alternative methods.

...read moreread less

191 citations

Posted Content•

Unsupervised Monocular Depth Learning in Dynamic Scenes

[...]

Hanhan Li, Ariel Gordon, Hang Zhao, Vincent Casser, Anelia Angelova - Show less +1 more

30 Oct 2020-arXiv: Computer Vision and Pattern Recognition

TL;DR: This work shows that this apparently heavily underdetermined problem can be regularized by imposing the following prior knowledge about 3D translation fields: they are sparse, since most of the scene is static, and they tend to be constant for rigid moving objects.

...read moreread less

Abstract: We present a method for jointly training the estimation of depth, ego-motion, and a dense 3D translation field of objects relative to the scene, with monocular photometric consistency being the sole source of supervision. We show that this apparently heavily underdetermined problem can be regularized by imposing the following prior knowledge about 3D translation fields: they are sparse, since most of the scene is static, and they tend to be constant for rigid moving objects. We show that this regularization alone is sufficient to train monocular depth prediction models that exceed the accuracy achieved in prior work for dynamic scenes, including methods that require semantic input. Code is at this https URL .

...read moreread less

54 citations

Posted Content•

Music Gesture for Visual Sound Separation

[...]

Chuang Gan¹, Deng Huang¹, Hang Zhao², Joshua B. Tenenbaum², Antonio Torralba² - Show less +1 more•Institutions (2)

IBM¹, Massachusetts Institute of Technology²

20 Apr 2020-arXiv: Computer Vision and Pattern Recognition

TL;DR: Song et al. as discussed by the authors proposed a keypoint-based structured representation to explicitly model the body and finger movements of musicians when they perform music, and adopted a context-aware graph network to integrate visual semantic context with body dynamics, and applied an audio-visual fusion model to associate body movements with the corresponding audio signals.

...read moreread less

Abstract: Recent deep learning approaches have achieved impressive performance on visual sound separation tasks. However, these approaches are mostly built on appearance and optical flow like motion feature representations, which exhibit limited abilities to find the correlations between audio signals and visual points, especially when separating multiple instruments of the same types, such as multiple violins in a scene. To address this, we propose "Music Gesture," a keypoint-based structured representation to explicitly model the body and finger movements of musicians when they perform music. We first adopt a context-aware graph network to integrate visual semantic context with body dynamics, and then apply an audio-visual fusion model to associate body movements with the corresponding audio signals. Experimental results on three music performance datasets show: 1) strong improvements upon benchmark metrics for hetero-musical separation tasks (i.e. different instruments); 2) new ability for effective homo-musical separation for piano, flute, and trumpet duets, which to our best knowledge has never been achieved with alternative methods. Project page: this http URL.

...read moreread less

32 citations

Unsupervised Monocular Depth Learning in Dynamic Scenes

[...]

Hanhan Li, Ariel Gordon, Hang Zhao, Vincent Casser, Anelia Angelova - Show less +1 more

30 Oct 2020

TL;DR: This paper proposed a method for jointly training the estimation of depth, ego-motion, and a dense 3D translation field of objects relative to the scene, with monocular photometric consistency being the sole source of supervision.

...read moreread less

Abstract: We present a method for jointly training the estimation of depth, ego-motion, and a dense 3D translation field of objects relative to the scene, with monocular photometric consistency being the sole source of supervision. We show that this apparently heavily underdetermined problem can be regularized by imposing the following prior knowledge about 3D translation fields: they are sparse, since most of the scene is static, and they tend to be constant for rigid moving objects. We show that this regularization alone is sufficient to train monocular depth prediction models that exceed the accuracy achieved in prior work for dynamic scenes, including methods that require semantic input. Code is at this https URL .

...read moreread less

20 citations

Proceedings Article•DOI•

AlignNet: A Unifying Approach to Audio-Visual Alignment

[...]

Jianren Wang¹, Zhaoyuan Fang², Hang Zhao³•Institutions (3)

Carnegie Mellon University¹, University of Notre Dame², Massachusetts Institute of Technology³

01 Mar 2020

TL;DR: Qualitative, quantitative and subjective evaluation results on dance-music alignment and speech-lip alignment demonstrate that the AlignNet method far outperforms the state-of- the-art methods.

...read moreread less

Abstract: We present AlignNet, a model that synchronizes videos with reference audios undernon-uniform and irregularmis- alignments. AlignNet learns the end-to-end dense correspondence between each frame of a video and an audio. Our method is designed according to simple and well- established principles: attention, pyramidal processing, warping, and affinity function. Together with the model, we release a dancing dataset Dance50 for training and evaluation. Qualitative, quantitative and subjective evaluation results on dance-music alignment and speech-lip alignment demonstrate that our method far outperforms the state-of- the-art methods. Code, dataset and sample videos are available at our project page1.

...read moreread less

17 citations

Posted Content•

AlignNet: A Unifying Approach to Audio-Visual Alignment.

[...]

Jianren Wang¹, Zhaoyuan Fang², Hang Zhao³•Institutions (3)

Carnegie Mellon University¹, University of Notre Dame², Massachusetts Institute of Technology³

12 Feb 2020-arXiv: Computer Vision and Pattern Recognition

TL;DR: AlignNet as discussed by the authors learns the end-to-end dense correspondence between each frame of a video and an audio by using attention, pyramidal processing, warping, and affinity function.

...read moreread less

Abstract: We present AlignNet, a model that synchronizes videos with reference audios under non-uniform and irregular misalignments. AlignNet learns the end-to-end dense correspondence between each frame of a video and an audio. Our method is designed according to simple and well-established principles: attention, pyramidal processing, warping, and affinity function. Together with the model, we release a dancing dataset Dance50 for training and evaluation. Qualitative, quantitative and subjective evaluation results on dance-music alignment and speech-lip alignment demonstrate that our method far outperforms the state-of-the-art methods. Project video and code are available at this https URL.

...read moreread less

16 citations

TNT: Target-driveN Trajectory Prediction

[...]

Hang Zhao, Jiyang Gao, Tian Lan, Chen Sun, Benjamin Sapp, Balakrishnan Varadarajan, Yue Shen, Yi Shen, Yuning Chai, Cordelia Schmid, Congcong Li, Dragomir Anguelov - Show less +8 more

19 Aug 2020

TL;DR: Target-Driven Trajectory Prediction (TNT) as mentioned in this paper predicts an agent's potential target states $T$ steps into the future, by encoding its interactions with the environment and the other agents.

...read moreread less

Abstract: Predicting the future behavior of moving agents is essential for real world applications. It is challenging as the intent of the agent and the corresponding behavior is unknown and intrinsically multimodal. Our key insight is that for prediction within a moderate time horizon, the future modes can be effectively captured by a set of target states. This leads to our target-driven trajectory prediction (TNT) framework. TNT has three stages which are trained end-to-end. It first predicts an agent's potential target states $T$ steps into the future, by encoding its interactions with the environment and the other agents. TNT then generates trajectory state sequences conditioned on targets. A final stage estimates trajectory likelihoods and a final compact set of trajectory predictions is selected. This is in contrast to previous work which models agent intents as latent variables, and relies on test-time sampling to generate diverse trajectories. We benchmark TNT on trajectory prediction of vehicles and pedestrians, where we outperform state-of-the-art on Argoverse Forecasting, INTERACTION, Stanford Drone and an in-house Pedestrian-at-Intersection dataset.

...read moreread less

Posted Content•

CVC: Contrastive Learning for Non-parallel Voice Conversion

[...]

Tingle Li¹, Yichen Liu, Chenxu Hu, Hang Zhao²•Institutions (2)

Duke University¹, Tsinghua University²

02 Nov 2020-arXiv: Sound

TL;DR: CVC is proposed, a contrastive learning-based adversarial model for voice conversion that only requires one-way GAN training when it comes to non-parallel one-to-one voice conversion, while improving speech quality and reducing training time.

...read moreread less

Abstract: Cycle consistent generative adversarial network (CycleGAN) and variational autoencoder (VAE) based models have gained popularity in non-parallel voice conversion recently. However, they often suffer from difficult training process and unsatisfactory results. In this paper, we propose CVC, a contrastive learning-based adversarial approach for voice conversion. Compared to previous CycleGAN-based methods, CVC only requires an efficient one-way GAN training by taking the advantage of contrastive learning. When it comes to non-parallel one-to-one voice conversion, CVC is on par or better than CycleGAN and VAE while effectively reducing training time. CVC further demonstrates superior performance in many-to-one voice conversion, enabling the conversion from unseen speakers.

...read moreread less

Posted Content•

LID 2020: The Learning from Imperfect Data Challenge Results.

[...]

17 Oct 2020-arXiv: Computer Vision and Pattern Recognition

TL;DR: A new evaluation metric proposed by Zhang 2020rethinking, i.e., IoU curve, is introduced to measure the quality of the generated object localization maps, to find the state-of-the-art approaches in the weakly supervised learning setting for object detection, semantic segmentation, and scene parsing.

...read moreread less

Abstract: Learning from imperfect data becomes an issue in many industrial applications after the research community has made profound progress in supervised learning from perfectly annotated datasets. The purpose of the Learning from Imperfect Data (LID) workshop is to inspire and facilitate the research in developing novel approaches that would harness the imperfect data and improve the data-efficiency during training. A massive amount of user-generated data nowadays available on multiple internet services. How to leverage those and improve the machine learning models is a high impact problem. We organize the challenges in conjunction with the workshop. The goal of these challenges is to find the state-of-the-art approaches in the weakly supervised learning setting for object detection, semantic segmentation, and scene parsing. There are three tracks in the challenge, i.e., weakly supervised semantic segmentation (Track 1), weakly supervised scene parsing (Track 2), and weakly supervised object localization (Track 3). In Track 1, based on ILSVRC DET, we provide pixel-level annotations of 15K images from 200 categories for evaluation. In Track 2, we provide point-based annotations for the training set of ADE20K. In Track 3, based on ILSVRC CLS-LOC, we provide pixel-level annotations of 44,271 images for evaluation. Besides, we further introduce a new evaluation metric proposed by \cite{zhang2020rethinking}, i.e., IoU curve, to measure the quality of the generated object localization maps. This technical report summarizes the highlights from the challenge. The challenge submission server and the leaderboard will continue to open for the researchers who are interested in it. More details regarding the challenge and the benchmarks are available at this https URL

...read moreread less

Posted Content•

CLOUD: Contrastive Learning of Unsupervised Dynamics

[...]

Jianren Wang¹, Yujie Lu, Hang Zhao•Institutions (1)

Carnegie Mellon University¹

23 Oct 2020-arXiv: Robotics

TL;DR: This work proposes to learn forward and inverse dynamics in a fully unsupervised manner via contrastive estimation in the feature space of states and actions with data collected from random exploration.

...read moreread less

Abstract: Developing agents that can perform complex control tasks from high dimensional observations such as pixels is challenging due to difficulties in learning dynamics efficiently. In this work, we propose to learn forward and inverse dynamics in a fully unsupervised manner via contrastive estimation. Specifically, we train a forward dynamics model and an inverse dynamics model in the feature space of states and actions with data collected from random exploration. Unlike most existing deterministic models, our energy-based model takes into account the stochastic nature of agent-environment interactions. We demonstrate the efficacy of our approach across a variety of tasks including goal-directed planning and imitation from observations. Project videos and code are at this https URL.

...read moreread less

CLOUD: Contrastive Learning of Unsupervised Dynamics

[...]

Jianren Wang, Yujie Lu, Hang Zhao

01 Jan 2020

TL;DR: The authors propose to learn forward and inverse dynamics in a fully unsupervised manner via contrastive estimation, and demonstrate the efficacy of their approach across a variety of tasks including goal-directed planning and imitation from observations.

...read moreread less

Abstract: Developing agents that can perform complex control tasks from high dimensional observations such as pixels is challenging due to difficulties in learning dynamics efficiently. In this work, we propose to learn forward and inverse dynamics in a fully unsupervised manner via contrastive estimation. Specifically, we train a forward dynamics model and an inverse dynamics model in the feature space of states and actions with data collected from random exploration. Unlike most existing deterministic models, our energy-based model takes into account the stochastic nature of agent-environment interactions. We demonstrate the efficacy of our approach across a variety of tasks including goal-directed planning and imitation from observations. Project videos and code are at this https URL.

...read moreread less

Patent•

Agent trajectory prediction using vectorized inputs

[...]

Jiyang Gao, Shen Yi, Hang Zhao, Chen Sun

16 Nov 2020

Showing papers by "Hang Zhao published in 2020"