scispace - formally typeset
Search or ask a question

Showing papers by "Jia Deng published in 2018"


Book Chapter•DOI•
Hei Law1, Jia Deng1•
08 Sep 2018
TL;DR: CornerNet as mentioned in this paper detects an object bounding box as a pair of keypoints, the top-left corner and the bottom-right corner, using a single convolution neural network.
Abstract: We propose CornerNet, a new approach to object detection where we detect an object bounding box as a pair of keypoints, the top-left corner and the bottom-right corner, using a single convolution neural network. By detecting objects as paired keypoints, we eliminate the need for designing a set of anchor boxes commonly used in prior single-stage detectors. In addition to our novel formulation, we introduce corner pooling, a new type of pooling layer that helps the network better localize corners. Experiments show that CornerNet achieves a 42.1% AP on MS COCO, outperforming all existing one-stage detectors.

1,642 citations


Posted Content•
TL;DR: CornerNet, a new approach to object detection where an object bounding box is detected as a pair of keypoints, the top-left corner and the bottom-right corner, using a single convolution neural network, is proposed.
Abstract: We propose CornerNet, a new approach to object detection where we detect an object bounding box as a pair of keypoints, the top-left corner and the bottom-right corner, using a single convolution neural network. By detecting objects as paired keypoints, we eliminate the need for designing a set of anchor boxes commonly used in prior single-stage detectors. In addition to our novel formulation, we introduce corner pooling, a new type of pooling layer that helps the network better localize corners. Experiments show that CornerNet achieves a 42.2% AP on MS COCO, outperforming all existing one-stage detectors.

739 citations


Proceedings Article•DOI•
01 Jun 2018
TL;DR: TAL-Net as mentioned in this paper improves receptive field alignment using a multi-scale architecture that can accommodate extreme variation in action durations and better exploit the temporal context of actions for both proposal generation and action classification by appropriately extending receptive fields.
Abstract: We propose TAL-Net, an improved approach to temporal action localization in video that is inspired by the Faster RCNN object detection framework. TAL-Net addresses three key shortcomings of existing approaches: (1) we improve receptive field alignment using a multi-scale architecture that can accommodate extreme variation in action durations; (2) we better exploit the temporal context of actions for both proposal generation and action classification by appropriately extending receptive fields; and (3) we explicitly consider multi-stream feature fusion and demonstrate that fusing motion late is important. We achieve state-of-the-art performance for both action proposal and localization on THUMOS'14 detection benchmark and competitive performance on ActivityNet challenge.

647 citations


Proceedings Article•DOI•
12 Mar 2018
TL;DR: Li et al. as mentioned in this paper proposed a human-object region-based convolutional neural network (HO-RCNN) for HOI detection, which is based on a novel DNN input that characterizes the spatial relations between two bounding boxes.
Abstract: We study the problem of detecting human-object interactions (HOI) in static images, defined as predicting a human and an object bounding box with an interaction class label that connects them. HOI detection is a fundamental problem in computer vision as it provides semantic information about the interactions among the detected objects. We introduce HICO-DET, a new large benchmark for HOI detection, by augmenting the current HICO classification benchmark with instance annotations. To solve the task, we propose Human-Object Region-based Convolutional Neural Networks (HO-RCNN). At the core of our HO-RCNN is the Interaction Pattern, a novel DNN input that characterizes the spatial relations between two bounding boxes. Experiments on HICO-DET demonstrate that our HO-RCNN, by exploiting human-object spatial relations through Interaction Patterns, significantly improves the performance of HOI detection over baseline approaches.

377 citations


Proceedings Article•
Lanlan Liu1, Jia Deng1•
29 Apr 2018
TL;DR: Dynamic Deep Neural Networks (D2NNs) as mentioned in this paper augments a feed-forward deep neural network with controller modules, each controller module is a sub-network whose output is a decision that controls whether other modules can execute.
Abstract: We introduce Dynamic Deep Neural Networks (D2NN), a new type of feed-forward deep neural network that allows selective execution. Given an input, only a subset of D2NN neurons are executed, and the particular subset is determined by the D2NN itself. By pruning unnecessary computation depending on input, D2NNs provide a way to improve computational efficiency. To achieve dynamic selective execution, a D2NN augments a feed-forward deep neural network (directed acyclic graph of differentiable modules) with controller modules. Each controller module is a sub-network whose output is a decision that controls whether other modules can execute. A D2NN is trained end to end. Both regular and controller modules in a D2NN are learnable and are jointly trained to optimize both accuracy and efficiency. Such training is achieved by integrating backpropagation with reinforcement learning. With extensive experiments of various D2NN architectures on image classification tasks, we demonstrate that D2NNs are general and flexible, and can effectively optimize accuracy-efficiency trade-offs.

169 citations


Proceedings Article•DOI•
Lei Huangi1, Lei Huangi2, Dawei Yang1, Bo Lang2, Jia Deng2 •
14 Dec 2018
TL;DR: Decorrelated batch normalization (DBN) as mentioned in this paper whitens the activations to accelerate the training of deep models by centering and scaling activations within mini-batches.
Abstract: Batch Normalization (BN) is capable of accelerating the training of deep models by centering and scaling activations within mini-batches. In this work, we propose Decorrelated Batch Normalization (DBN), which not just centers and scales activations but whitens them. We explore multiple whitening techniques, and find that PCA whitening causes a problem we call stochastic axis swapping, which is detrimental to learning. We show that ZCA whitening does not suffer from this problem, permitting successful learning. DBN retains the desirable qualities of BN and further improves BN's optimization efficiency and generalization ability. We design comprehensive experiments to show that DBN can improve the performance of BN on multilayer perceptrons and convolutional neural networks. Furthermore, we consistently improve the accuracy of residual networks on CIFAR-10, CIFAR-100, and ImageNet.

164 citations


Posted Content•
Zachary Teed1, Jia Deng1•
TL;DR: DeepV2D combines the representation ability of neural networks with the geometric principles governing image formation and compose a collection of classical geometric algorithms, which are converted into trainable modules and combined into an end-to-end differentiable architecture.
Abstract: We propose DeepV2D, an end-to-end deep learning architecture for predicting depth from video. DeepV2D combines the representation ability of neural networks with the geometric principles governing image formation. We compose a collection of classical geometric algorithms, which are converted into trainable modules and combined into an end-to-end differentiable architecture. DeepV2D interleaves two stages: motion estimation and depth estimation. During inference, motion and depth estimation are alternated and converge to accurate depth. Code is available this https URL.

89 citations


Posted Content•
TL;DR: Decorrelated batch normalization (DBN) as discussed by the authors whitens the activations to accelerate the training of deep models by centering and scaling activations within mini-batches.
Abstract: Batch Normalization (BN) is capable of accelerating the training of deep models by centering and scaling activations within mini-batches. In this work, we propose Decorrelated Batch Normalization (DBN), which not just centers and scales activations but whitens them. We explore multiple whitening techniques, and find that PCA whitening causes a problem we call stochastic axis swapping, which is detrimental to learning. We show that ZCA whitening does not suffer from this problem, permitting successful learning. DBN retains the desirable qualities of BN and further improves BN's optimization efficiency and generalization ability. We design comprehensive experiments to show that DBN can improve the performance of BN on multilayer perceptrons and convolutional neural networks. Furthermore, we consistently improve the accuracy of residual networks on CIFAR-10, CIFAR-100, and ImageNet.

87 citations


Posted Content•
TL;DR: This work investigates whether motion representations are indeed missing in the spatial stream, and shows that there is significant room for improvement, and demonstrates that these motion representations can be improved using distillation, that is, by tuning the spatial streams to mimic the temporal stream, effectively combining both models into a single stream.
Abstract: State-of-the-art methods for video action recognition commonly use an ensemble of two networks: the spatial stream, which takes RGB frames as input, and the temporal stream, which takes optical flow as input. In recent work, both of these streams consist of 3D Convolutional Neural Networks, which apply spatiotemporal filters to the video clip before performing classification. Conceptually, the temporal filters should allow the spatial stream to learn motion representations, making the temporal stream redundant. However, we still see significant benefits in action recognition performance by including an entirely separate temporal stream, indicating that the spatial stream is "missing" some of the signal captured by the temporal stream. In this work, we first investigate whether motion representations are indeed missing in the spatial stream of 3D CNNs. Second, we demonstrate that these motion representations can be improved by distillation, by tuning the spatial stream to predict the outputs of the temporal stream, effectively combining both models into a single stream. Finally, we show that our Distilled 3D Network (D3D) achieves performance on par with two-stream approaches, using only a single model and with no need to compute optical flow.

50 citations


Proceedings Article•DOI•
Dawei Yang1, Jia Deng1•
18 Jun 2018
TL;DR: This paper proposes an approach that does not need any external shape dataset to render synthetic images and achieves state-of-the-art performance on a shape-from-shading benchmark.
Abstract: In this paper, we address the shape-from-shading problem by training deep networks with synthetic images. Unlike conventional approaches that combine deep learning and synthetic imagery, we propose an approach that does not need any external shape dataset to render synthetic images. Our approach consists of two synergistic processes: the evolution of complex shapes from simple primitives, and the training of a deep network for shape-from-shading. The evolution generates better shapes guided by the network training, while the training improves by using the evolved shapes. We show that our approach achieves state-of-the-art performance on a shape-from-shading benchmark.

22 citations


Posted Content•
Dawei Yang, Chaowei Xiao, Bo Li, Jia Deng, Mingyan Liu 
TL;DR: This paper aims to project the optimized "adversarial meshes" to 2D with a photorealistic renderer, and still able to mislead different machine learning models, and proposes to synthesize a realistic 3D mesh and put in a scene mimicking similar rendering conditions and therefore attack different machineLearning models.
Abstract: Highly expressive models such as deep neural networks (DNNs) have been widely applied to various applications and achieved increasing success. However, recent studies show that such machine learning models appear to be vulnerable against adversarial examples. So far adversarial examples have been heavily explored for 2D images, while few works have conducted to understand vulnerabilities of 3D objects which exist in real world, where 3D objects are projected to 2D domains by photo taking for different learning (recognition) tasks. In this paper, we consider adversarial behaviors in practical scenarios by manipulating the shape and texture of a given 3D mesh representation of an object. Our goal is to project the optimized "adversarial meshes" to 2D with a photorealistic renderer, and still able to mislead different machine learning models. Extensive experiments show that by generating unnoticeable 3D adversarial perturbation on shape or texture for a 3D mesh, the corresponding projected 2D instance can either lead classifiers to misclassify the victim object as an arbitrary malicious target, or hide any target object within the scene from object detectors. We conduct human studies to show that our optimized adversarial 3D perturbation is highly unnoticeable for human vision systems. In addition to the subtle perturbation for a given 3D mesh, we also propose to synthesize a realistic 3D mesh and put in a scene mimicking similar rendering conditions and therefore attack different machine learning models. In-depth analysis of transferability among various 3D renderers and vulnerable regions of meshes are provided to help better understand adversarial behaviors in real-world.

Journal Article•DOI•
TL;DR: A computer vision based method to assess the technical skill level of surgeons by analyzing the movement of robotic instruments in robotic surgical videos that leverages the power of crowd workers on the internet to obtain high quality data in a scalable and cost-efficient way.

Posted Content•
TL;DR: This work explores unconventional narrow-precision floating-point representations as it relates to inference accuracy and efficiency to steer the improved design of future DNN platforms and presents a novel technique that drastically reduces the time required to derive the optimal precision configuration.
Abstract: With ever-increasing computational demand for deep learning, it is critical to investigate the implications of the numeric representation and precision of DNN model weights and activations on computational efficiency. In this work, we explore unconventional narrow-precision floating-point representations as it relates to inference accuracy and efficiency to steer the improved design of future DNN platforms. We show that inference using these custom numeric representations on production-grade DNNs, including GoogLeNet and VGG, achieves an average speedup of 7.6x with less than 1% degradation in inference accuracy relative to a state-of-the-art baseline platform representing the most sophisticated hardware using single-precision floating point. To facilitate the use of such customized precision, we also present a novel technique that drastically reduces the time required to derive the optimal precision configuration.

Proceedings Article•DOI•
Mahmoud Azab1, Mingzhe Wang1, Max Smith1, Noriyuki Kojima1, Jia Deng1, Rada Mihalcea1 •
01 Jan 2018
TL;DR: A new model for speaker naming in movies that leverages visual, textual, and acoustic modalities in an unified optimization framework is proposed and significantly outperforms several competitive baselines on the average weighted F-score metric.
Abstract: We propose a new model for speaker naming in movies that leverages visual, textual, and acoustic modalities in an unified optimization framework. To evaluate the performance of our model, we introduce a new dataset consisting of six episodes of the Big Bang Theory TV show and eighteen full movies covering different genres. Our experiments show that our multimodal model significantly outperforms several competitive baselines on the average weighted F-score metric. To demonstrate the effectiveness of our framework, we design an end-to-end memory network model that leverages our speaker naming model and achieves state-of-the-art results on the subtitles task of the MovieQA 2017 Challenge.

Posted Content•
TL;DR: Zhang et al. as mentioned in this paper proposed meshAdv to generate "adversarial 3D meshes" from objects that have rich shape features but minimal textural variation, which can be used to manipulate the shape or texture of the objects.
Abstract: Highly expressive models such as deep neural networks (DNNs) have been widely applied to various applications. However, recent studies show that DNNs are vulnerable to adversarial examples, which are carefully crafted inputs aiming to mislead the predictions. Currently, the majority of these studies have focused on perturbation added to image pixels, while such manipulation is not physically realistic. Some works have tried to overcome this limitation by attaching printable 2D patches or painting patterns onto surfaces, but can be potentially defended because 3D shape features are intact. In this paper, we propose meshAdv to generate "adversarial 3D meshes" from objects that have rich shape features but minimal textural variation. To manipulate the shape or texture of the objects, we make use of a differentiable renderer to compute accurate shading on the shape and propagate the gradient. Extensive experiments show that the generated 3D meshes are effective in attacking both classifiers and object detectors. We evaluate the attack under different viewpoints. In addition, we design a pipeline to perform black-box attack on a photorealistic renderer with unknown rendering parameters.

Proceedings Article•DOI•
01 Jan 2018
TL;DR: In this article, a new deep network architecture called Dynamic Spatial Memory Network (DSMN) is proposed for question-answering that specializes in answering questions that admit latent visual representations and learns to generate and reason over such representations.
Abstract: In this paper, we study the problem of geometric reasoning (a form of visual reasoning) in the context of question-answering. We introduce Dynamic Spatial Memory Network (DSMN), a new deep network architecture that specializes in answering questions that admit latent visual representations, and learns to generate and reason over such representations. Further, we propose two synthetic benchmarks, FloorPlanQA and ShapeIntersection, to evaluate the geometric reasoning capability of QA systems. Experimental results validate the effectiveness of our proposed DSMN for visual thinking tasks.

Posted Content•
TL;DR: TAL-Net as discussed by the authors improves receptive field alignment using a multi-scale architecture that can accommodate extreme variation in action durations and better exploit the temporal context of actions for both proposal generation and action classification by appropriately extending receptive fields.
Abstract: We propose TAL-Net, an improved approach to temporal action localization in video that is inspired by the Faster R-CNN object detection framework. TAL-Net addresses three key shortcomings of existing approaches: (1) we improve receptive field alignment using a multi-scale architecture that can accommodate extreme variation in action durations; (2) we better exploit the temporal context of actions for both proposal generation and action classification by appropriately extending receptive fields; and (3) we explicitly consider multi-stream feature fusion and demonstrate that fusing motion late is important. We achieve state-of-the-art performance for both action proposal and localization on THUMOS'14 detection benchmark and competitive performance on ActivityNet challenge.

Posted Content•
TL;DR: In this paper, a quality assessment network is proposed to identify high-quality reconstructions obtained from Structure-from-Motion (SfM) on Internet videos and a new dataset called YouTube3D is constructed.
Abstract: Depth estimation from a single image in the wild remains a challenging problem. One main obstacle is the lack of high-quality training data for images in the wild. In this paper we propose a method to automatically generate such data through Structure-from-Motion (SfM) on Internet videos. The core of this method is a Quality Assessment Network that identifies high-quality reconstructions obtained from SfM. Using this method, we collect single-view depth training data from a large number of YouTube videos and construct a new dataset called YouTube3D. Experiments show that YouTube3D is useful in training depth estimation networks and advances the state of the art of single-view depth estimation in the wild.

Posted Content•
TL;DR: Experimental results validate the effectiveness of the proposed Dynamic Spatial Memory Network for visual thinking tasks, and propose two synthetic benchmarks, FloorPlanQA and ShapeIntersection, to evaluate the geometric reasoning capability of QA systems.
Abstract: In this paper, we study the problem of geometric reasoning in the context of question-answering We introduce Dynamic Spatial Memory Network (DSMN), a new deep network architecture designed for answering questions that admit latent visual representations DSMN learns to generate and reason over such representations Further, we propose two synthetic benchmarks, FloorPlanQA and ShapeIntersection, to evaluate the geometric reasoning capability of QA systems Experimental results validate the effectiveness of our proposed DSMN for visual thinking tasks

Posted Content•
Mahmoud Azab1, Mingzhe Wang1, Max Smith1, Noriyuki Kojima1, Jia Deng1, Rada Mihalcea1 •
TL;DR: This article proposed a new model for speaker naming in movies that leverages visual, textual, and acoustic modalities in an unified optimization framework and achieved state-of-the-art results on the subtitles task of the MovieQA 2017 Challenge.
Abstract: We propose a new model for speaker naming in movies that leverages visual, textual, and acoustic modalities in an unified optimization framework. To evaluate the performance of our model, we introduce a new dataset consisting of six episodes of the Big Bang Theory TV show and eighteen full movies covering different genres. Our experiments show that our multimodal model significantly outperforms several competitive baselines on the average weighted F-score metric. To demonstrate the effectiveness of our framework, we design an end-to-end memory network model that leverages our speaker naming model and achieves state-of-the-art results on the subtitles task of the MovieQA 2017 Challenge.

Patent•
Yuan Li1, Hartwig Adam1, Jia Deng1, Nan Ding1•
10 Jul 2018

Journal Article•
TL;DR: This work is building the first step of the overall system, which involves translating all the raw documents, as well as transcribing and translating audio and video data, and building a graph from all the entities, events, and relations.
Abstract: Understanding current world events in real-time involves sifting through news articles, tweets, photos, and videos from many different perspectives. The goal of the DARPA-funded AIDA project is to automate much of this process, building a knowledge base that can be queried to strategically generate hypotheses about different aspects of an event. We are participating in this project as a TA1 team, and we are building the first step of the overall system. Given raw multimodal input (e.g., text, images, video), our goal is to generate a knowledge graph with entities, events, and relations. Figure 1 shows an overview of our pipeline. The first stage is pre-processing. This involves translating all the raw documents, as well as transcribing and translating audio and video data. All the translated information is input to our main processing module that extracts entities, events, and relations. Entities are extracted from both text and video data. In the final, output generation stage of the pipeline, we build a graph from all of the entities, events, and relations.