Showing papers by "Jia Deng published in 2018"

PDF

Open Access

Book Chapter•DOI•

Cornernet: Detecting objects as paired keypoints

[...]

Hei Law¹, Jia Deng¹•Institutions (1)

08 Sep 2018

TL;DR: CornerNet as mentioned in this paper detects an object bounding box as a pair of keypoints, the top-left corner and the bottom-right corner, using a single convolution neural network.

...read moreread less

Abstract: We propose CornerNet, a new approach to object detection where we detect an object bounding box as a pair of keypoints, the top-left corner and the bottom-right corner, using a single convolution neural network. By detecting objects as paired keypoints, we eliminate the need for designing a set of anchor boxes commonly used in prior single-stage detectors. In addition to our novel formulation, we introduce corner pooling, a new type of pooling layer that helps the network better localize corners. Experiments show that CornerNet achieves a 42.1% AP on MS COCO, outperforming all existing one-stage detectors.

...read moreread less

1,642 citations

Posted Content•

CornerNet: Detecting Objects as Paired Keypoints

[...]

Hei Law¹, Jia Deng²•Institutions (2)

Princeton University¹, University of Michigan²

03 Aug 2018-arXiv: Computer Vision and Pattern Recognition

TL;DR: CornerNet, a new approach to object detection where an object bounding box is detected as a pair of keypoints, the top-left corner and the bottom-right corner, using a single convolution neural network, is proposed.

...read moreread less

Abstract: We propose CornerNet, a new approach to object detection where we detect an object bounding box as a pair of keypoints, the top-left corner and the bottom-right corner, using a single convolution neural network. By detecting objects as paired keypoints, we eliminate the need for designing a set of anchor boxes commonly used in prior single-stage detectors. In addition to our novel formulation, we introduce corner pooling, a new type of pooling layer that helps the network better localize corners. Experiments show that CornerNet achieves a 42.2% AP on MS COCO, outperforming all existing one-stage detectors.

...read moreread less

739 citations

Proceedings Article•DOI•

Rethinking the Faster R-CNN Architecture for Temporal Action Localization

[...]

Yu-Wei Chao¹, Sudheendra Vijayanarasimhan, Bryan Seybold, David A. Ross, Jia Deng¹, Rahul Sukthankar - Show less +2 more•Institutions (1)

University of Michigan¹

01 Jun 2018

TL;DR: TAL-Net as mentioned in this paper improves receptive field alignment using a multi-scale architecture that can accommodate extreme variation in action durations and better exploit the temporal context of actions for both proposal generation and action classification by appropriately extending receptive fields.

...read moreread less

Abstract: We propose TAL-Net, an improved approach to temporal action localization in video that is inspired by the Faster RCNN object detection framework. TAL-Net addresses three key shortcomings of existing approaches: (1) we improve receptive field alignment using a multi-scale architecture that can accommodate extreme variation in action durations; (2) we better exploit the temporal context of actions for both proposal generation and action classification by appropriately extending receptive fields; and (3) we explicitly consider multi-stream feature fusion and demonstrate that fusing motion late is important. We achieve state-of-the-art performance for both action proposal and localization on THUMOS'14 detection benchmark and competitive performance on ActivityNet challenge.

...read moreread less

647 citations

Proceedings Article•DOI•

Learning to Detect Human-Object Interactions

[...]

Yu-Wei Chao¹, Yunfan Liu¹, Xieyang Liu¹, Huayi Zeng², Jia Deng¹ - Show less +1 more•Institutions (2)

University of Michigan¹, Washington University in St. Louis²

12 Mar 2018

TL;DR: Li et al. as mentioned in this paper proposed a human-object region-based convolutional neural network (HO-RCNN) for HOI detection, which is based on a novel DNN input that characterizes the spatial relations between two bounding boxes.

...read moreread less

Abstract: We study the problem of detecting human-object interactions (HOI) in static images, defined as predicting a human and an object bounding box with an interaction class label that connects them. HOI detection is a fundamental problem in computer vision as it provides semantic information about the interactions among the detected objects. We introduce HICO-DET, a new large benchmark for HOI detection, by augmenting the current HICO classification benchmark with instance annotations. To solve the task, we propose Human-Object Region-based Convolutional Neural Networks (HO-RCNN). At the core of our HO-RCNN is the Interaction Pattern, a novel DNN input that characterizes the spatial relations between two bounding boxes. Experiments on HICO-DET demonstrate that our HO-RCNN, by exploiting human-object spatial relations through Interaction Patterns, significantly improves the performance of HOI detection over baseline approaches.

...read moreread less

377 citations

Proceedings Article•

Dynamic Deep Neural Networks: Optimizing Accuracy-Efficiency Trade-Offs by Selective Execution

[...]

Lanlan Liu¹, Jia Deng¹•Institutions (1)

University of Michigan¹

29 Apr 2018

TL;DR: Dynamic Deep Neural Networks (D2NNs) as mentioned in this paper augments a feed-forward deep neural network with controller modules, each controller module is a sub-network whose output is a decision that controls whether other modules can execute.

...read moreread less

Abstract: We introduce Dynamic Deep Neural Networks (D2NN), a new type of feed-forward deep neural network that allows selective execution. Given an input, only a subset of D2NN neurons are executed, and the particular subset is determined by the D2NN itself. By pruning unnecessary computation depending on input, D2NNs provide a way to improve computational efficiency. To achieve dynamic selective execution, a D2NN augments a feed-forward deep neural network (directed acyclic graph of differentiable modules) with controller modules. Each controller module is a sub-network whose output is a decision that controls whether other modules can execute. A D2NN is trained end to end. Both regular and controller modules in a D2NN are learnable and are jointly trained to optimize both accuracy and efficiency. Such training is achieved by integrating backpropagation with reinforcement learning. With extensive experiments of various D2NN architectures on image classification tasks, we demonstrate that D2NNs are general and flexible, and can effectively optimize accuracy-efficiency trade-offs.

...read moreread less

169 citations

Proceedings Article•DOI•

Decorrelated Batch Normalization

[...]

Lei Huangi¹, Lei Huangi², Dawei Yang¹, Bo Lang², Jia Deng² - Show less +1 more•Institutions (2)

Beihang University¹, University of Michigan²

14 Dec 2018

TL;DR: Decorrelated batch normalization (DBN) as mentioned in this paper whitens the activations to accelerate the training of deep models by centering and scaling activations within mini-batches.

...read moreread less

Abstract: Batch Normalization (BN) is capable of accelerating the training of deep models by centering and scaling activations within mini-batches. In this work, we propose Decorrelated Batch Normalization (DBN), which not just centers and scales activations but whitens them. We explore multiple whitening techniques, and find that PCA whitening causes a problem we call stochastic axis swapping, which is detrimental to learning. We show that ZCA whitening does not suffer from this problem, permitting successful learning. DBN retains the desirable qualities of BN and further improves BN's optimization efficiency and generalization ability. We design comprehensive experiments to show that DBN can improve the performance of BN on multilayer perceptrons and convolutional neural networks. Furthermore, we consistently improve the accuracy of residual networks on CIFAR-10, CIFAR-100, and ImageNet.

...read moreread less

164 citations

Posted Content•

DeepV2D: Video to Depth with Differentiable Structure from Motion.

[...]

Zachary Teed¹, Jia Deng¹•Institutions (1)

Princeton University¹

11 Dec 2018-arXiv: Computer Vision and Pattern Recognition

TL;DR: DeepV2D combines the representation ability of neural networks with the geometric principles governing image formation and compose a collection of classical geometric algorithms, which are converted into trainable modules and combined into an end-to-end differentiable architecture.

...read moreread less

Abstract: We propose DeepV2D, an end-to-end deep learning architecture for predicting depth from video. DeepV2D combines the representation ability of neural networks with the geometric principles governing image formation. We compose a collection of classical geometric algorithms, which are converted into trainable modules and combined into an end-to-end differentiable architecture. DeepV2D interleaves two stages: motion estimation and depth estimation. During inference, motion and depth estimation are alternated and converge to accurate depth. Code is available this https URL.

...read moreread less

89 citations

Posted Content•

Decorrelated Batch Normalization

[...]

Lei Huang, Dawei Yang, Bo Lang, Jia Deng

23 Apr 2018-arXiv: Computer Vision and Pattern Recognition

TL;DR: Decorrelated batch normalization (DBN) as discussed by the authors whitens the activations to accelerate the training of deep models by centering and scaling activations within mini-batches.

...read moreread less

87 citations

Posted Content•

D3D: Distilled 3D Networks for Video Action Recognition

[...]

Jonathan C. Stroud¹, David A. Ross², Chen Sun², Jia Deng³, Rahul Sukthankar² - Show less +1 more•Institutions (3)

University of Michigan¹, Google², Princeton University³

19 Dec 2018-arXiv: Computer Vision and Pattern Recognition

TL;DR: This work investigates whether motion representations are indeed missing in the spatial stream, and shows that there is significant room for improvement, and demonstrates that these motion representations can be improved using distillation, that is, by tuning the spatial streams to mimic the temporal stream, effectively combining both models into a single stream.

...read moreread less

Abstract: State-of-the-art methods for video action recognition commonly use an ensemble of two networks: the spatial stream, which takes RGB frames as input, and the temporal stream, which takes optical flow as input. In recent work, both of these streams consist of 3D Convolutional Neural Networks, which apply spatiotemporal filters to the video clip before performing classification. Conceptually, the temporal filters should allow the spatial stream to learn motion representations, making the temporal stream redundant. However, we still see significant benefits in action recognition performance by including an entirely separate temporal stream, indicating that the spatial stream is "missing" some of the signal captured by the temporal stream. In this work, we first investigate whether motion representations are indeed missing in the spatial stream of 3D CNNs. Second, we demonstrate that these motion representations can be improved by distillation, by tuning the spatial stream to predict the outputs of the temporal stream, effectively combining both models into a single stream. Finally, we show that our Distilled 3D Network (D3D) achieves performance on par with two-stream approaches, using only a single model and with no need to compute optical flow.

...read moreread less

50 citations

Proceedings Article•DOI•

Shape from Shading Through Shape Evolution

[...]

Dawei Yang¹, Jia Deng¹•Institutions (1)

University of Michigan¹

18 Jun 2018

TL;DR: This paper proposes an approach that does not need any external shape dataset to render synthetic images and achieves state-of-the-art performance on a shape-from-shading benchmark.

...read moreread less

Abstract: In this paper, we address the shape-from-shading problem by training deep networks with synthetic images. Unlike conventional approaches that combine deep learning and synthetic imagery, we propose an approach that does not need any external shape dataset to render synthetic images. Our approach consists of two synergistic processes: the evolution of complex shapes from simple primitives, and the training of a deep network for shape-from-shading. The evolution generates better shapes guided by the network training, while the training improves by using the evolved shapes. We show that our approach achieves state-of-the-art performance on a shape-from-shading benchmark.

...read moreread less

22 citations

Posted Content•

Realistic Adversarial Examples in 3D Meshes

[...]

Dawei Yang, Chaowei Xiao, Bo Li, Jia Deng, Mingyan Liu - Show less +1 more

27 Sep 2018-arXiv: Cryptography and Security

TL;DR: This paper aims to project the optimized "adversarial meshes" to 2D with a photorealistic renderer, and still able to mislead different machine learning models, and proposes to synthesize a realistic 3D mesh and put in a scene mimicking similar rendering conditions and therefore attack different machineLearning models.

...read moreread less

Abstract: Highly expressive models such as deep neural networks (DNNs) have been widely applied to various applications and achieved increasing success. However, recent studies show that such machine learning models appear to be vulnerable against adversarial examples. So far adversarial examples have been heavily explored for 2D images, while few works have conducted to understand vulnerabilities of 3D objects which exist in real world, where 3D objects are projected to 2D domains by photo taking for different learning (recognition) tasks. In this paper, we consider adversarial behaviors in practical scenarios by manipulating the shape and texture of a given 3D mesh representation of an object. Our goal is to project the optimized "adversarial meshes" to 2D with a photorealistic renderer, and still able to mislead different machine learning models. Extensive experiments show that by generating unnoticeable 3D adversarial perturbation on shape or texture for a 3D mesh, the corresponding projected 2D instance can either lead classifiers to misclassify the victim object as an arbitrary malicious target, or hide any target object within the scene from object detectors. We conduct human studies to show that our optimized adversarial 3D perturbation is highly unnoticeable for human vision systems. In addition to the subtle perturbation for a given 3D mesh, we also propose to synthesize a realistic 3D mesh and put in a scene mimicking similar rendering conditions and therefore attack different machine learning models. In-depth analysis of transferability among various 3D renderers and vulnerable regions of meshes are provided to help better understand adversarial behaviors in real-world.

...read moreread less

Journal Article•DOI•

Pd58-12 surgeon technical skill assessment using computer vision-based analysis

[...]

Yetong Zhang, Hei Law, Tae-Kyung Kim, David Miller, James E. Montie, Jia Deng, Khurshid R. Ghani - Show less +3 more

01 Apr 2018-The Journal of Urology

TL;DR: A computer vision based method to assess the technical skill level of surgeons by analyzing the movement of robotic instruments in robotic surgical videos that leverages the power of crowd workers on the internet to obtain high quality data in a scalable and cost-efficient way.

...read moreread less

Posted Content•

Rethinking Numerical Representations for Deep Neural Networks

[...]

Parker Hill, Babak Zamirai, Shengshuo Lu, Yu-Wei Chao, Michael A. Laurenzano, Mehrzad Samadi, Marios C. Papaefthymiou, Scott Mahlke, Thomas F. Wenisch, Jia Deng, Lingjia Tang, Jason Mars - Show less +8 more

07 Aug 2018-arXiv: Learning

TL;DR: This work explores unconventional narrow-precision floating-point representations as it relates to inference accuracy and efficiency to steer the improved design of future DNN platforms and presents a novel technique that drastically reduces the time required to derive the optimal precision configuration.

...read moreread less

Abstract: With ever-increasing computational demand for deep learning, it is critical to investigate the implications of the numeric representation and precision of DNN model weights and activations on computational efficiency. In this work, we explore unconventional narrow-precision floating-point representations as it relates to inference accuracy and efficiency to steer the improved design of future DNN platforms. We show that inference using these custom numeric representations on production-grade DNNs, including GoogLeNet and VGG, achieves an average speedup of 7.6x with less than 1% degradation in inference accuracy relative to a state-of-the-art baseline platform representing the most sophisticated hardware using single-precision floating point. To facilitate the use of such customized precision, we also present a novel technique that drastically reduces the time required to derive the optimal precision configuration.

...read moreread less

Proceedings Article•DOI•

Speaker Naming in Movies

[...]

Mahmoud Azab¹, Mingzhe Wang¹, Max Smith¹, Noriyuki Kojima¹, Jia Deng¹, Rada Mihalcea¹ - Show less +2 more•Institutions (1)

University of Michigan¹

01 Jan 2018

TL;DR: A new model for speaker naming in movies that leverages visual, textual, and acoustic modalities in an unified optimization framework is proposed and significantly outperforms several competitive baselines on the average weighted F-score metric.

...read moreread less

Abstract: We propose a new model for speaker naming in movies that leverages visual, textual, and acoustic modalities in an unified optimization framework. To evaluate the performance of our model, we introduce a new dataset consisting of six episodes of the Big Bang Theory TV show and eighteen full movies covering different genres. Our experiments show that our multimodal model significantly outperforms several competitive baselines on the average weighted F-score metric. To demonstrate the effectiveness of our framework, we design an end-to-end memory network model that leverages our speaker naming model and achieves state-of-the-art results on the subtitles task of the MovieQA 2017 Challenge.

...read moreread less

Posted Content•

MeshAdv: Adversarial Meshes for Visual Recognition

[...]

Chaowei Xiao¹, Dawei Yang¹, Bo Li², Jia Deng³, Mingyan Liu¹ - Show less +1 more•Institutions (3)

University of Michigan¹, University of Illinois at Urbana–Champaign², Princeton University³

11 Oct 2018-arXiv: Cryptography and Security

TL;DR: Zhang et al. as mentioned in this paper proposed meshAdv to generate "adversarial 3D meshes" from objects that have rich shape features but minimal textural variation, which can be used to manipulate the shape or texture of the objects.

...read moreread less

Abstract: Highly expressive models such as deep neural networks (DNNs) have been widely applied to various applications. However, recent studies show that DNNs are vulnerable to adversarial examples, which are carefully crafted inputs aiming to mislead the predictions. Currently, the majority of these studies have focused on perturbation added to image pixels, while such manipulation is not physically realistic. Some works have tried to overcome this limitation by attaching printable 2D patches or painting patterns onto surfaces, but can be potentially defended because 3D shape features are intact. In this paper, we propose meshAdv to generate "adversarial 3D meshes" from objects that have rich shape features but minimal textural variation. To manipulate the shape or texture of the objects, we make use of a differentiable renderer to compute accurate shading on the shape and propagate the gradient. Extensive experiments show that the generated 3D meshes are effective in attacking both classifiers and object detectors. We evaluate the attack under different viewpoints. In addition, we design a pipeline to perform black-box attack on a photorealistic renderer with unknown rendering parameters.

...read moreread less

Proceedings Article•DOI•

Think Visually: Question Answering through Virtual Imagery

[...]

Ankit Goyal¹, Jian Wang¹, Jia Deng¹•Institutions (1)

University of Michigan¹

01 Jan 2018

TL;DR: In this article, a new deep network architecture called Dynamic Spatial Memory Network (DSMN) is proposed for question-answering that specializes in answering questions that admit latent visual representations and learns to generate and reason over such representations.

...read moreread less

Abstract: In this paper, we study the problem of geometric reasoning (a form of visual reasoning) in the context of question-answering. We introduce Dynamic Spatial Memory Network (DSMN), a new deep network architecture that specializes in answering questions that admit latent visual representations, and learns to generate and reason over such representations. Further, we propose two synthetic benchmarks, FloorPlanQA and ShapeIntersection, to evaluate the geometric reasoning capability of QA systems. Experimental results validate the effectiveness of our proposed DSMN for visual thinking tasks.

...read moreread less

Posted Content•

Rethinking the Faster R-CNN Architecture for Temporal Action Localization.

[...]

Yu-Wei Chao¹, Sudheendra Vijayanarasimhan, Bryan Seybold, David A. Ross, Jia Deng¹, Rahul Sukthankar - Show less +2 more•Institutions (1)

University of Michigan¹

20 Apr 2018-arXiv: Computer Vision and Pattern Recognition

TL;DR: TAL-Net as discussed by the authors improves receptive field alignment using a multi-scale architecture that can accommodate extreme variation in action durations and better exploit the temporal context of actions for both proposal generation and action classification by appropriately extending receptive fields.

...read moreread less

Abstract: We propose TAL-Net, an improved approach to temporal action localization in video that is inspired by the Faster R-CNN object detection framework. TAL-Net addresses three key shortcomings of existing approaches: (1) we improve receptive field alignment using a multi-scale architecture that can accommodate extreme variation in action durations; (2) we better exploit the temporal context of actions for both proposal generation and action classification by appropriately extending receptive fields; and (3) we explicitly consider multi-stream feature fusion and demonstrate that fusing motion late is important. We achieve state-of-the-art performance for both action proposal and localization on THUMOS'14 detection benchmark and competitive performance on ActivityNet challenge.

...read moreread less

Posted Content•

Learning Single-Image Depth from Videos using Quality Assessment Networks.

[...]

Weifeng Chen¹, Shengyi Qian¹, Jia Deng²•Institutions (2)

University of Michigan¹, Princeton University²

25 Jun 2018-arXiv: Computer Vision and Pattern Recognition

TL;DR: In this paper, a quality assessment network is proposed to identify high-quality reconstructions obtained from Structure-from-Motion (SfM) on Internet videos and a new dataset called YouTube3D is constructed.

...read moreread less

Abstract: Depth estimation from a single image in the wild remains a challenging problem. One main obstacle is the lack of high-quality training data for images in the wild. In this paper we propose a method to automatically generate such data through Structure-from-Motion (SfM) on Internet videos. The core of this method is a Quality Assessment Network that identifies high-quality reconstructions obtained from SfM. Using this method, we collect single-view depth training data from a large number of YouTube videos and construct a new dataset called YouTube3D. Experiments show that YouTube3D is useful in training depth estimation networks and advances the state of the art of single-view depth estimation in the wild.

...read moreread less

Posted Content•

Think Visually: Question Answering through Virtual Imagery.

[...]

Ankit Goyal¹, Jian Wang¹, Jia Deng¹•Institutions (1)

University of Michigan¹

25 May 2018-arXiv: Computation and Language

TL;DR: Experimental results validate the effectiveness of the proposed Dynamic Spatial Memory Network for visual thinking tasks, and propose two synthetic benchmarks, FloorPlanQA and ShapeIntersection, to evaluate the geometric reasoning capability of QA systems.

...read moreread less

Abstract: In this paper, we study the problem of geometric reasoning in the context of question-answering We introduce Dynamic Spatial Memory Network (DSMN), a new deep network architecture designed for answering questions that admit latent visual representations DSMN learns to generate and reason over such representations Further, we propose two synthetic benchmarks, FloorPlanQA and ShapeIntersection, to evaluate the geometric reasoning capability of QA systems Experimental results validate the effectiveness of our proposed DSMN for visual thinking tasks

...read moreread less

Posted Content•

Speaker Naming in Movies

[...]

Mahmoud Azab¹, Mingzhe Wang¹, Max Smith¹, Noriyuki Kojima¹, Jia Deng¹, Rada Mihalcea¹ - Show less +2 more•Institutions (1)

University of Michigan¹

24 Sep 2018-arXiv: Computation and Language

TL;DR: This article proposed a new model for speaker naming in movies that leverages visual, textual, and acoustic modalities in an unified optimization framework and achieved state-of-the-art results on the subtitles task of the MovieQA 2017 Challenge.

...read moreread less

Patent•

Identifying objects in images

[...]

Yuan Li¹, Hartwig Adam¹, Jia Deng¹, Nan Ding¹•Institutions (1)

Google¹

10 Jul 2018

Journal Article•

Entity and Event Extraction from Scratch Using Minimal Training Data.

[...]

Laura Wendlandt, Steve Wilson, Oana Ignat, Charles Welch, Li Zhang, Mingzhe Wang, Jia Deng, Rada Mihalcea - Show less +4 more

01 Jan 2018-Theory and Applications of Categories

TL;DR: This work is building the first step of the overall system, which involves translating all the raw documents, as well as transcribing and translating audio and video data, and building a graph from all the entities, events, and relations.

...read moreread less

Abstract: Understanding current world events in real-time involves sifting through news articles, tweets, photos, and videos from many different perspectives. The goal of the DARPA-funded AIDA project is to automate much of this process, building a knowledge base that can be queried to strategically generate hypotheses about different aspects of an event. We are participating in this project as a TA1 team, and we are building the first step of the overall system. Given raw multimodal input (e.g., text, images, video), our goal is to generate a knowledge graph with entities, events, and relations. Figure 1 shows an overview of our pipeline. The first stage is pre-processing. This involves translating all the raw documents, as well as transcribing and translating audio and video data. All the translated information is input to our main processing module that extracts entities, events, and relations. Entities are extracted from both text and video data. In the final, output generation stage of the pipeline, we build a graph from all of the entities, events, and relations.

...read moreread less