Showing papers by "Hang Zhao published in 2017"

PDF

Open Access

Proceedings Article•DOI•

[...]

Bolei Zhou¹, Hang Zhao¹, Xavier Puig¹, Sanja Fidler², Adela Barriuso¹, Antonio Torralba¹ - Show less +2 more•Institutions (2)

Massachusetts Institute of Technology¹, University of Toronto²

21 Jul 2017

TL;DR: The ADE20K dataset, spanning diverse annotations of scenes, objects, parts of objects, and in some cases even parts of parts, is introduced and it is shown that the trained scene parsing networks can lead to applications such as image content removal and scene synthesis.

...read moreread less

Abstract: Scene parsing, or recognizing and segmenting objects and stuff in an image, is one of the key problems in computer vision. Despite the communitys efforts in data collection, there are still few image datasets covering a wide range of scenes and object categories with dense and detailed annotations for scene parsing. In this paper, we introduce and analyze the ADE20K dataset, spanning diverse annotations of scenes, objects, parts of objects, and in some cases even parts of parts. A scene parsing benchmark is built upon the ADE20K with 150 object and stuff classes included. Several segmentation baseline models are evaluated on the benchmark. A novel network design called Cascade Segmentation Module is proposed to parse a scene into stuff, objects, and object parts in a cascade and improve over the baselines. We further show that the trained scene parsing networks can lead to applications such as image content removal and scene synthesis1.

...read moreread less

2,233 citations

Journal Article•DOI•

Loss Functions for Image Restoration With Neural Networks

[...]

Hang Zhao¹, Orazio Gallo¹, Iuri Frosio¹, Jan Kautz¹•Institutions (1)

Nvidia¹

01 Mar 2017-IEEE Transactions on Computational Imaging

TL;DR: It is shown that the quality of the results improves significantly with better loss functions, even when the network architecture is left unchanged, and a novel, differentiable error function is proposed.

...read moreread less

Abstract: Neural networks are becoming central in several areas of computer vision and image processing and different architectures have been proposed to solve specific problems. The impact of the loss layer of neural networks, however, has not received much attention in the context of image processing: the default and virtually only choice is $\ell _2$ . In this paper, we bring attention to alternative choices for image restoration. In particular, we show the importance of perceptually-motivated losses when the resulting image is to be evaluated by a human observer. We compare the performance of several losses, and propose a novel, differentiable error function. We show that the quality of the results improves significantly with better loss functions, even when the network architecture is left unchanged.

...read moreread less

1,758 citations

Proceedings Article•DOI•

Duckietown: An open, inexpensive and flexible platform for autonomy education and research

[...]

21 Jul 2017

TL;DR: Duckietown is an open, inexpensive and flexible platform for autonomy education and research that comprises small autonomous vehicles built from off-the-shelf components, and cities complete with roads, signage, traffic lights, obstacles, and citizens in need of transportation.

...read moreread less

Abstract: Duckietown is an open, inexpensive and flexible platform for autonomy education and research. The platform comprises small autonomous vehicles (“Duckiebots”) built from off-the-shelf components, and cities (“Duckietowns”) complete with roads, signage, traffic lights, obstacles, and citizens (duckies) in need of transportation. The Duckietown platform offers a wide range of functionalities at a low cost. Duckiebots sense the world with only one monocular camera and perform all processing onboard with a Raspberry Pi 2, yet are able to: follow lanes while avoiding obstacles, pedestrians (duckies) and other Duckiebots, localize within a global map, navigate a city, and coordinate with other Duckiebots to avoid collisions. Duckietown is a useful tool since educators and researchers can save money and time by not having to develop all of the necessary supporting infrastructure and capabilities. All materials are available as open source, and the hope is that others in the community will adopt the platform for education and research.

...read moreread less

181 citations

Posted Content•

HACS: Human Action Clips and Segments Dataset for Recognition and Temporal Localization

[...]

Hang Zhao¹, Antonio Torralba¹, Lorenzo Torresani², Zhicheng Yan³•Institutions (3)

Massachusetts Institute of Technology¹, Dartmouth College², University of Illinois at Urbana–Champaign³

26 Dec 2017-arXiv: Computer Vision and Pattern Recognition

TL;DR: On HACS Segments, the state-of-the-art methods of action proposal generation and action localization are evaluated, and the new challenges posed by the dense temporal annotations are highlighted.

...read moreread less

Abstract: This paper presents a new large-scale dataset for recognition and temporal localization of human actions collected from Web videos. We refer to it as HACS (Human Action Clips and Segments). We leverage both consensus and disagreement among visual classifiers to automatically mine candidate short clips from unlabeled videos, which are subsequently validated by human annotators. The resulting dataset is dubbed HACS Clips. Through a separate process we also collect annotations defining action segment boundaries. This resulting dataset is called HACS Segments. Overall, HACS Clips consists of 1.5M annotated clips sampled from 504K untrimmed videos, and HACS Seg-ments contains 139K action segments densely annotatedin 50K untrimmed videos spanning 200 action categories. HACS Clips contains more labeled examples than any existing video benchmark. This renders our dataset both a large scale action recognition benchmark and an excellent source for spatiotemporal feature learning. In our transferlearning experiments on three target datasets, HACS Clips outperforms Kinetics-600, Moments-In-Time and Sports1Mas a pretraining source. On HACS Segments, we evaluate state-of-the-art methods of action proposal generation and action localization, and highlight the new challenges posed by our dense temporal annotations.

...read moreread less

122 citations

Proceedings Article•DOI•

Open Vocabulary Scene Parsing

[...]

Hang Zhao¹, Xavier Puig¹, Bolei Zhou¹, Sanja Fidler², Antonio Torralba¹ - Show less +1 more•Institutions (2)

Massachusetts Institute of Technology¹, University of Toronto²

01 Oct 2017

TL;DR: In this article, a joint image pixel and word concept embeddings framework is proposed, where word concepts are connected by semantic relations and the trained joint embedding space is further explored to show its interpretability.

...read moreread less

Abstract: Recognizing arbitrary objects in the wild has been a challenging problem due to the limitations of existing classification models and datasets. In this paper, we propose a new task that aims at parsing scenes with a large and open vocabulary, and several evaluation metrics are explored for this problem. Our approach is a joint image pixel and word concept embeddings framework, where word concepts are connected by semantic relations. We validate the open vocabulary prediction ability of our framework on ADE20K dataset which covers a wide variety of scenes and objects. We further explore the trained joint embedding space to show its interpretability.

...read moreread less

52 citations

Posted Content•

SLAC: A Sparsely Labeled Dataset for Action Classification and Localization

[...]

Hang Zhao, Zhicheng Yan, Heng Wang, Lorenzo Torresani, Antonio Torralba - Show less +1 more

26 Dec 2017-arXiv: Computer Vision and Pattern Recognition

TL;DR: The proposed procedure dramatically reduces the amount of human labeling by automatically identifying hard clips, i.e., clips that contain coherent actions but lead to prediction disagreement between action classifiers, thus generating labels for highly informative samples at little cost.

...read moreread less

Abstract: This paper describes a procedure for the creation of large-scale video datasets for action classification and localization from unconstrained, realistic web data. The scalability of the proposed procedure is demonstrated by building a novel video benchmark, named SLAC (Sparsely Labeled ACtions), consisting of over 520K untrimmed videos and 1.75M clip annotations spanning 200 action categories. Using our proposed framework, annotating a clip takes merely 8.8 seconds on average. This represents a saving in labeling time of over 95% compared to the traditional procedure of manual trimming and localization of actions. Our approach dramatically reduces the amount of human labeling by automatically identifying hard clips, i.e., clips that contain coherent actions but lead to prediction disagreement between action classifiers. A human annotator can disambiguate whether such a clip truly contains the hypothesized action in a handful of seconds, thus generating labels for highly informative samples at little cost. We show that our large-scale dataset can be used to effectively pre-train action recognition models, significantly improving final metrics on smaller-scale benchmarks after fine-tuning. On Kinetics, UCF-101 and HMDB-51, models pre-trained on SLAC outperform baselines trained from scratch, by 2.0%, 20.1% and 35.4% in top-1 accuracy, respectively when RGB input is used. Furthermore, we introduce a simple procedure that leverages the sparse labels in SLAC to pre-train action localization models. On THUMOS14 and ActivityNet-v1.3, our localization model improves the mAP of baseline model by 8.6% and 2.5%, respectively.

...read moreread less

38 citations

Posted Content•

Open Vocabulary Scene Parsing

[...]

Hang Zhao¹, Xavier Puig¹, Bolei Zhou¹, Sanja Fidler², Antonio Torralba¹ - Show less +1 more•Institutions (2)

Massachusetts Institute of Technology¹, University of Toronto²

26 Mar 2017-arXiv: Computer Vision and Pattern Recognition

TL;DR: A joint image pixel and word concept embeddings framework, where word concepts are connected by semantic relations, and validated the open vocabulary prediction ability of this framework on ADE20K dataset which covers a wide variety of scenes and objects.

...read moreread less

Abstract: Recognizing arbitrary objects in the wild has been a challenging problem due to the limitations of existing classification models and datasets. In this paper, we propose a new task that aims at parsing scenes with a large and open vocabulary, and several evaluation metrics are explored for this problem. Our proposed approach to this problem is a joint image pixel and word concept embeddings framework, where word concepts are connected by semantic relations. We validate the open vocabulary prediction ability of our framework on ADE20K dataset which covers a wide variety of scenes and objects. We further explore the trained joint embedding space to show its interpretability.

...read moreread less

35 citations