scispace - formally typeset
Search or ask a question
Author

Hang Zhao

Other affiliations: Zhejiang University, Nvidia, New York University  ...read more
Bio: Hang Zhao is an academic researcher from Tsinghua University. The author has contributed to research in topics: Computer science & Engineering. The author has an hindex of 32, co-authored 83 publications receiving 12696 citations. Previous affiliations of Hang Zhao include Zhejiang University & Nvidia.

Papers published on a yearly basis

Papers
More filters
Book ChapterDOI
08 Sep 2018
TL;DR: PixelPlayer as discussed by the authors learns to locate image regions which produce sounds and separate the input sounds into a set of components that represents the sound from each pixel, which can be used to adjust the volume of sound sources.
Abstract: We introduce PixelPlayer, a system that, by leveraging large amounts of unlabeled videos, learns to locate image regions which produce sounds and separate the input sounds into a set of components that represents the sound from each pixel. Our approach capitalizes on the natural synchronization of the visual and audio modalities to learn models that jointly parse sounds and images, without requiring additional manual supervision. Experimental results on a newly collected MUSIC dataset show that our proposed Mix-and-Separate framework outperforms several baselines on source separation. Qualitative results suggest our model learns to ground sounds in vision, enabling applications such as independently adjusting the volume of sound sources.

341 citations

Posted Content
Jiyang Gao, Chen Sun1, Hang Zhao, Yi Shen, Dragomir Anguelov, Congcong Li, Cordelia Schmid1 
TL;DR: VectorNet is introduced, a hierarchical graph neural network that first exploits the spatial locality of individual road components represented by vectors and then models the high-order interactions among all components and obtains state-of-the-art performance on the Argoverse dataset.
Abstract: Behavior prediction in dynamic, multi-agent systems is an important problem in the context of self-driving cars, due to the complex representations and interactions of road components, including moving agents (e.g. pedestrians and vehicles) and road context information (e.g. lanes, traffic lights). This paper introduces VectorNet, a hierarchical graph neural network that first exploits the spatial locality of individual road components represented by vectors and then models the high-order interactions among all components. In contrast to most recent approaches, which render trajectories of moving agents and road context information as bird-eye images and encode them with convolutional neural networks (ConvNets), our approach operates on a vector representation. By operating on the vectorized high definition (HD) maps and agent trajectories, we avoid lossy rendering and computationally intensive ConvNet encoding steps. To further boost VectorNet's capability in learning context features, we propose a novel auxiliary task to recover the randomly masked out map entities and agent trajectories based on their context. We evaluate VectorNet on our in-house behavior prediction benchmark and the recently released Argoverse forecasting dataset. Our method achieves on par or better performance than the competitive rendering approach on both benchmarks while saving over 70% of the model parameters with an order of magnitude reduction in FLOPs. It also outperforms the state of the art on the Argoverse dataset.

276 citations

Proceedings ArticleDOI
07 Aug 2018
TL;DR: RF-Pose3D is the first system that infers 3D human skeletons from RF signals, and works with multiple people and across walls and occlusions, and generates dynamic skeletons that follow the people as they move, walk or sit.
Abstract: This paper introduces RF-Pose3D, the first system that infers 3D human skeletons from RF signals. It requires no sensors on the body, and works with multiple people and across walls and occlusions. Further, it generates dynamic skeletons that follow the people as they move, walk or sit. As such, RF-Pose3D provides a significant leap in RF-based sensing and enables new applications in gaming, healthcare, and smart homes. RF-Pose3D is based on a novel convolutional neural network (CNN) architecture that performs high-dimensional convolutions by decomposing them into low-dimensional operations. This property allows the network to efficiently condense the spatio-temporal information in RF signals. The network first zooms in on the individuals in the scene, and crops the RF signals reflected off each person. For each individual, it localizes and tracks their body parts - head, shoulders, arms, wrists, hip, knees, and feet. Our evaluation results show that RF-Pose3D tracks each keypoint on the human body with an average error of 4.2 cm, 4.0 cm, and 4.9 cm along the X, Y, and Z axes respectively. It maintains this accuracy even in the presence of multiple people, and in new environments that it has not seen in the training set. Demo videos are available at our website: http://rfpose3d.csail.mit.edu.

258 citations

Proceedings ArticleDOI
11 Apr 2019
TL;DR: Quantitative and qualitative evaluations show that comparing to previous models that rely on visual appearance cues, the proposed novel motion based system improves performance in separating musical instrument sounds.
Abstract: Sounds originate from object motions and vibrations of surrounding air. Inspired by the fact that humans is capable of interpreting sound sources from how objects move visually, we propose a novel system that explicitly captures such motion cues for the task of sound localization and separation. Our system is composed of an end-to-end learnable model called Deep Dense Trajectory (DDT), and a curriculum learning scheme. It exploits the inherent coherence of audio-visual signals from a large quantities of unlabeled videos. Quantitative and qualitative evaluations show that comparing to previous models that rely on visual appearance cues, our motion based system improves performance in separating musical instrument sounds. Furthermore, it separates sound components from duets of the same category of instruments, a challenging problem that has not been addressed before.

246 citations

Proceedings ArticleDOI
02 Jun 2013
TL;DR: This work shows that New York City is a multipath-rich environment when using highly directional steerable horn antennas, and that an average of 2.5 signal lobes exists at any receiver location, and proposes here a new lobe modeling technique that can be used to create a statistical channel model for lobe path loss and shadow fading.
Abstract: Propagation measurements at 28 GHz were conducted in outdoor urban environments in New York City using four different transmitter locations and 83 receiver locations with distances of up to 500 m. A 400 mega- chip per second channel sounder with steerable 24.5 dBi horn antennas at the transmitter and receiver was used to measure the angular distributions of received multipath power over a wide range of propagation distances and urban settings. Measurements were also made to study the small-scale fading of closely-spaced power delay profiles recorded at half-wavelength (5.35 mm) increments along a small-scale linear track (10 wavelengths, or 107 mm) at two different receiver locations. Our measurements indicate that power levels for small- scale fading do not significantly fluctuate from the mean power level at a fixed angle of arrival. We propose here a new lobe modeling technique that can be used to create a statistical channel model for lobe path loss and shadow fading, and we provide many model statistics as a function of transmitter- receiver separation distance. Our work shows that New York City is a multipath-rich environment when using highly directional steerable horn antennas, and that an average of 2.5 signal lobes exists at any receiver location, where each lobe has an average total angle spread of 40.3° and an RMS angle spread of 7.8°. This work aims to create a 28 GHz statistical spatial channel model for future 5G cellular networks.

239 citations


Cited by
More filters
Proceedings ArticleDOI
21 Jul 2017
TL;DR: This paper exploits the capability of global context information by different-region-based context aggregation through the pyramid pooling module together with the proposed pyramid scene parsing network (PSPNet) to produce good quality results on the scene parsing task.
Abstract: Scene parsing is challenging for unrestricted open vocabulary and diverse scenes. In this paper, we exploit the capability of global context information by different-region-based context aggregation through our pyramid pooling module together with the proposed pyramid scene parsing network (PSPNet). Our global prior representation is effective to produce good quality results on the scene parsing task, while PSPNet provides a superior framework for pixel-level prediction. The proposed approach achieves state-of-the-art performance on various datasets. It came first in ImageNet scene parsing challenge 2016, PASCAL VOC 2012 benchmark and Cityscapes benchmark. A single PSPNet yields the new record of mIoU accuracy 85.4% on PASCAL VOC 2012 and accuracy 80.2% on Cityscapes.

10,189 citations

Journal ArticleDOI
TL;DR: This paper discusses all of these topics, identifying key challenges for future research and preliminary 5G standardization activities, while providing a comprehensive overview of the current literature, and in particular of the papers appearing in this special issue.
Abstract: What will 5G be? What it will not be is an incremental advance on 4G. The previous four generations of cellular technology have each been a major paradigm shift that has broken backward compatibility. Indeed, 5G will need to be a paradigm shift that includes very high carrier frequencies with massive bandwidths, extreme base station and device densities, and unprecedented numbers of antennas. However, unlike the previous four generations, it will also be highly integrative: tying any new 5G air interface and spectrum together with LTE and WiFi to provide universal high-rate coverage and a seamless user experience. To support this, the core network will also have to reach unprecedented levels of flexibility and intelligence, spectrum regulation will need to be rethought and improved, and energy and cost efficiencies will become even more critical considerations. This paper discusses all of these topics, identifying key challenges for future research and preliminary 5G standardization activities, while providing a comprehensive overview of the current literature, and in particular of the papers appearing in this special issue.

7,139 citations

Book ChapterDOI
Liang-Chieh Chen1, Yukun Zhu1, George Papandreou1, Florian Schroff1, Hartwig Adam1 
08 Sep 2018
TL;DR: This work extends DeepLabv3 by adding a simple yet effective decoder module to refine the segmentation results especially along object boundaries and applies the depthwise separable convolution to both Atrous Spatial Pyramid Pooling and decoder modules, resulting in a faster and stronger encoder-decoder network.
Abstract: Spatial pyramid pooling module or encode-decoder structure are used in deep neural networks for semantic segmentation task. The former networks are able to encode multi-scale contextual information by probing the incoming features with filters or pooling operations at multiple rates and multiple effective fields-of-view, while the latter networks can capture sharper object boundaries by gradually recovering the spatial information. In this work, we propose to combine the advantages from both methods. Specifically, our proposed model, DeepLabv3+, extends DeepLabv3 by adding a simple yet effective decoder module to refine the segmentation results especially along object boundaries. We further explore the Xception model and apply the depthwise separable convolution to both Atrous Spatial Pyramid Pooling and decoder modules, resulting in a faster and stronger encoder-decoder network. We demonstrate the effectiveness of the proposed model on PASCAL VOC 2012 and Cityscapes datasets, achieving the test set performance of 89% and 82.1% without any post-processing. Our paper is accompanied with a publicly available reference implementation of the proposed models in Tensorflow at https://github.com/tensorflow/models/tree/master/research/deeplab.

7,113 citations

Journal ArticleDOI
TL;DR: The motivation for new mm-wave cellular systems, methodology, and hardware for measurements are presented and a variety of measurement results are offered that show 28 and 38 GHz frequencies can be used when employing steerable directional antennas at base stations and mobile devices.
Abstract: The global bandwidth shortage facing wireless carriers has motivated the exploration of the underutilized millimeter wave (mm-wave) frequency spectrum for future broadband cellular communication networks. There is, however, little knowledge about cellular mm-wave propagation in densely populated indoor and outdoor environments. Obtaining this information is vital for the design and operation of future fifth generation cellular networks that use the mm-wave spectrum. In this paper, we present the motivation for new mm-wave cellular systems, methodology, and hardware for measurements and offer a variety of measurement results that show 28 and 38 GHz frequencies can be used when employing steerable directional antennas at base stations and mobile devices.

6,708 citations

Posted Content
TL;DR: The proposed `DeepLabv3' system significantly improves over the previous DeepLab versions without DenseCRF post-processing and attains comparable performance with other state-of-art models on the PASCAL VOC 2012 semantic image segmentation benchmark.
Abstract: In this work, we revisit atrous convolution, a powerful tool to explicitly adjust filter's field-of-view as well as control the resolution of feature responses computed by Deep Convolutional Neural Networks, in the application of semantic image segmentation. To handle the problem of segmenting objects at multiple scales, we design modules which employ atrous convolution in cascade or in parallel to capture multi-scale context by adopting multiple atrous rates. Furthermore, we propose to augment our previously proposed Atrous Spatial Pyramid Pooling module, which probes convolutional features at multiple scales, with image-level features encoding global context and further boost performance. We also elaborate on implementation details and share our experience on training our system. The proposed `DeepLabv3' system significantly improves over our previous DeepLab versions without DenseCRF post-processing and attains comparable performance with other state-of-art models on the PASCAL VOC 2012 semantic image segmentation benchmark.

5,691 citations