Top 26 papers published by Jia Deng from Princeton University in 2020

Book Chapter•DOI•

RAFT: Recurrent All-Pairs Field Transforms for Optical Flow

[...]

Zachary Teed¹, Jia Deng¹•Institutions (1)

23 Aug 2020

TL;DR: RAFT as mentioned in this paper extracts per-pixel features, builds multi-scale 4D correlation volumes for all pairs of pixels, and iteratively updates a flow field through a recurrent unit that performs lookups on the correlation volumes.

...read moreread less

Abstract: We introduce Recurrent All-Pairs Field Transforms (RAFT), a new deep network architecture for optical flow. RAFT extracts per-pixel features, builds multi-scale 4D correlation volumes for all pairs of pixels, and iteratively updates a flow field through a recurrent unit that performs lookups on the correlation volumes. RAFT achieves state-of-the-art performance. On KITTI, RAFT achieves an F1-all error of 5.10%, a 16% error reduction from the best published result (6.10%). On Sintel (final pass), RAFT obtains an end-point-error of 2.855 pixels, a 30% error reduction from the best published result (4.098 pixels). In addition, RAFT has strong cross-dataset generalization as well as high efficiency in inference time, training speed, and parameter count. Code is available at https://github.com/princeton-vl/RAFT.

...read moreread less

1,006 citations

Journal Article•DOI•

CornerNet: Detecting Objects as Paired Keypoints

[...]

Hei Law¹, Jia Deng¹•Institutions (1)

University of Michigan¹

01 Mar 2020-International Journal of Computer Vision

TL;DR: CornerNet, a new approach to object detection where an object bounding box is detected as a pair of keypoints, the top-left corner and the bottom-right corner, using a single convolution neural network, is proposed.

...read moreread less

Abstract: We propose CornerNet, a new approach to object detection where we detect an object bounding box as a pair of keypoints, the top-left corner and the bottom-right corner, using a single convolution neural network. By detecting objects as paired keypoints, we eliminate the need for designing a set of anchor boxes commonly used in prior single-stage detectors. In addition to our novel formulation, we introduce corner pooling, a new type of pooling layer that helps the network better localize corners. Experiments show that CornerNet achieves a 42.1% AP on MS COCO, outperforming all existing one-stage detectors.

...read moreread less

539 citations

Proceedings Article•DOI•

D3D: Distilled 3D Networks for Video Action Recognition

[...]

Jonathan C. Stroud¹, David A. Ross¹, Chen Sun¹, Jia Deng¹, Rahul Sukthankar¹ - Show less +1 more•Institutions (1)

Google¹

01 Mar 2020

TL;DR: Distilled 3D Network (D3D) as mentioned in this paper improves the performance by tuning the spatial stream to mimic the temporal stream, effectively combining both models into a single stream, and achieves performance on par with the two-stream approach.

...read moreread less

Abstract: State-of-the-art methods for action recognition commonly use two networks: the spatial stream, which takes RGB frames as input, and the temporal stream, which takes optical flow as input. In recent work, both streams are 3D Convolutional Neural Networks, which use spatiotemporal filters. These filters can respond to motion, and therefore should allow the network to learn motion representations, removing the need for optical flow. However, we still see significant benefits in performance by feeding optical flow into the temporal stream, indicating that the spatial stream is "missing" some of the signal that the temporal stream captures. In this work, we first investigate whether motion representations are indeed missing in the spatial stream, and show that there is significant room for improvement. Second, we demonstrate that these motion representations can be improved using distillation, that is, by tuning the spatial stream to mimic the temporal stream, effectively combining both models into a single stream. Finally, we show that our Distilled 3D Network (D3D) achieves performance on par with the two-stream approach, with no need to compute optical flow during inference.

...read moreread less

134 citations

Posted Content•

Rearrangement: A Challenge for Embodied AI.

[...]

Dhruv Batra, Angel X. Chang, Sonia Chernova, Andrew J. Davison, Jia Deng, Vladlen Koltun, Sergey Levine, Jitendra Malik, Igor Mordatch, Roozbeh Mottaghi, Manolis Savva, Hao Su - Show less +8 more

03 Nov 2020-arXiv: Artificial Intelligence

TL;DR: A framework for research and evaluation in Embodied AI is described, based on a canonical task: Rearrangement, that can focus the development of new techniques and serve as a source of trained models that can be transferred to other settings.

...read moreread less

Abstract: We describe a framework for research and evaluation in Embodied AI. Our proposal is based on a canonical task: Rearrangement. A standard task can focus the development of new techniques and serve as a source of trained models that can be transferred to other settings. In the rearrangement task, the goal is to bring a given physical environment into a specified state. The goal state can be specified by object poses, by images, by a description in language, or by letting the agent experience the environment in the goal state. We characterize rearrangement scenarios along different axes and describe metrics for benchmarking rearrangement performance. To facilitate research and exploration, we present experimental testbeds of rearrangement scenarios in four different simulation environments. We anticipate that other datasets will be released and new simulation platforms will be built to support training of rearrangement agents and their deployment on physical systems.

...read moreread less

111 citations

Proceedings Article•DOI•

Towards fairer datasets: filtering and balancing the distribution of the people subtree in the ImageNet hierarchy

[...]

Kaiyu Yang¹, Klint Qinami¹, Li Fei-Fei², Jia Deng¹, Olga Russakovsky¹ - Show less +1 more•Institutions (2)

Princeton University¹, Stanford University²

27 Jan 2020

TL;DR: In this article, the authors examine three key factors within the person subtree of ImageNet that may lead to problematic behavior in downstream computer vision technology: the stagnant concept vocabulary of WordNet, the attempt at exhaustive illustration of all categories with images, and the inequality of representation in the images within concepts.

...read moreread less

Abstract: Computer vision technology is being used by many but remains representative of only a few. People have reported misbehavior of computer vision models, including offensive prediction results and lower performance for underrepresented groups. Current computer vision models are typically developed using datasets consisting of manually annotated images or videos; the data and label distributions in these datasets are critical to the models' behavior. In this paper, we examine ImageNet, a large-scale ontology of images that has spurred the development of many modern computer vision methods. We consider three key factors within the person subtree of ImageNet that may lead to problematic behavior in downstream computer vision technology: (1) the stagnant concept vocabulary of WordNet, (2) the attempt at exhaustive illustration of all categories with images, and (3) the inequality of representation in the images within concepts. We seek to illuminate the root causes of these concerns and take the first steps to mitigate them constructively.

...read moreread less

108 citations

Journal Article•DOI•

Deep learning computer vision algorithm for detecting kidney stone composition

[...]

Kristian M Black¹, Hei Law², Ali H Aldoukhi¹, Jia Deng², Khurshid R. Ghani¹ - Show less +1 more•Institutions (2)

University of Michigan¹, Princeton University²

11 Feb 2020-BJUI

TL;DR: To assess the recall of a deep learning method to automatically detect kidney stones composition from digital photographs of stones, a large number of cases where the method was used to identify stones with different compositions are studied.

...read moreread less

Abstract: Objectives To assess the recall of a deep learning (DL) method to automatically detect kidney stones composition from digital photographs of stones. Materials and methods A total of 63 human kidney stones of varied compositions were obtained from a stone laboratory including calcium oxalate monohydrate (COM), uric acid (UA), magnesium ammonium phosphate hexahydrate (MAPH/struvite), calcium hydrogen phosphate dihydrate (CHPD/brushite), and cystine stones. At least two images of the stones, both surface and inner core, were captured on a digital camera for all stones. A deep convolutional neural network (CNN), ResNet-101 (ResNet, Microsoft), was applied as a multi-class classification model, to each image. This model was assessed using leave-one-out cross-validation with the primary outcome being network prediction recall. Results The composition prediction recall for each composition was as follows: UA 94% (n = 17), COM 90% (n = 21), MAPH/struvite 86% (n = 7), cystine 75% (n = 4), CHPD/brushite 71% (n = 14). The overall weighted recall of the CNNs composition analysis was 85% for the entire cohort. Specificity and precision for each stone type were as follows: UA (97.83%, 94.12%), COM (97.62%, 95%), struvite (91.84%, 71.43%), cystine (98.31%, 75%), and brushite (96.43%, 75%). Conclusion Deep CNNs can be used to identify kidney stone composition from digital photographs with good recall. Future work is needed to see if DL can be used for detecting stone composition during digital endoscopy. This technology may enable integrated endoscopic and laser systems that automatically provide laser settings based on stone composition recognition with the goal to improve surgical efficiency.

...read moreread less

77 citations

Posted Content•

RAFT: Recurrent All-Pairs Field Transforms for Optical Flow

[...]

Zachary Teed¹, Jia Deng¹•Institutions (1)

Princeton University¹

26 Mar 2020-arXiv: Computer Vision and Pattern Recognition

TL;DR: Recurrent All-Pairs Field Transforms (RAFT), a new deep network architecture for optical flow that extracts per-pixel features, builds multi-scale 4D correlation volumes for all pairs of pixels, and iteratively updates a flow field through a recurrent unit that performs lookups on the correlation volumes achieves state-of-the-art performance.

...read moreread less

Abstract: We introduce Recurrent All-Pairs Field Transforms (RAFT), a new deep network architecture for optical flow. RAFT extracts per-pixel features, builds multi-scale 4D correlation volumes for all pairs of pixels, and iteratively updates a flow field through a recurrent unit that performs lookups on the correlation volumes. RAFT achieves state-of-the-art performance. On KITTI, RAFT achieves an F1-all error of 5.10%, a 16% error reduction from the best published result (6.10%). On Sintel (final pass), RAFT obtains an end-point-error of 2.855 pixels, a 30% error reduction from the best published result (4.098 pixels). In addition, RAFT has strong cross-dataset generalization as well as high efficiency in inference time, training speed, and parameter count. Code is available at this https URL.

...read moreread less

70 citations

Proceedings Article•DOI•

How Useful Is Self-Supervised Pretraining for Visual Tasks?

[...]

Alejandro Newell¹, Jia Deng¹•Institutions (1)

Princeton University¹

14 Jun 2020

TL;DR: This work evaluates various self-supervised algorithms across a comprehensive array of synthetic datasets and downstream tasks, preparing a suite of synthetic data that enables an endless supply of annotated images as well as full control over dataset difficulty.

...read moreread less

Abstract: Recent advances have spurred incredible progress in self-supervised pretraining for vision. We investigate what factors may play a role in the utility of these pretraining methods for practitioners. To do this, we evaluate various self-supervised algorithms across a comprehensive array of synthetic datasets and downstream tasks. We prepare a suite of synthetic data that enables an endless supply of annotated images as well as full control over dataset difficulty. Our experiments offer insights into how the utility of self-supervision changes as the number of available labels grows as well as how the utility changes as a function of the downstream task and the properties of the training data. We also find that linear evaluation does not correlate with finetuning performance. Code and data is available at \href{https://www.github.com/princeton-vl/selfstudy}{github.com/princeton-vl/selfstudy}.

...read moreread less

64 citations

Posted Content•

How Useful is Self-Supervised Pretraining for Visual Tasks?

[...]

Alejandro Newell¹, Jia Deng¹•Institutions (1)

Princeton University¹

31 Mar 2020-arXiv: Computer Vision and Pattern Recognition

TL;DR: In this paper, the authors investigate the utility of self-supervised pretraining for vision and find that the utility changes as a function of the downstream task and the properties of the training data.

...read moreread less

Abstract: Recent advances have spurred incredible progress in self-supervised pretraining for vision. We investigate what factors may play a role in the utility of these pretraining methods for practitioners. To do this, we evaluate various self-supervised algorithms across a comprehensive array of synthetic datasets and downstream tasks. We prepare a suite of synthetic data that enables an endless supply of annotated images as well as full control over dataset difficulty. Our experiments offer insights into how the utility of self-supervision changes as the number of available labels grows as well as how the utility changes as a function of the downstream task and the properties of the training data. We also find that linear evaluation does not correlate with finetuning performance. Code and data is available at \href{this https URL}{this http URL}.

...read moreread less

51 citations

Proceedings Article•

DeepV2D: Video to Depth with Differentiable Structure from Motion

[...]

Zachary Teed¹, Jia Deng¹•Institutions (1)

Princeton University¹

30 Apr 2020

TL;DR: DeepV2D as discussed by the authors combines the representation ability of neural networks with the geometric principles governing image formation for predicting depth from video. But it is not an end-to-end architecture.

...read moreread less

Abstract: We propose DeepV2D, an end-to-end deep learning architecture for predicting depth from video. DeepV2D combines the representation ability of neural networks with the geometric principles governing image formation. We compose a collection of classical geometric algorithms, which are converted into trainable modules and combined into an end-to-end differentiable architecture. DeepV2D interleaves two stages: motion estimation and depth estimation. During inference, motion and depth estimation are alternated and converge to accurate depth.

...read moreread less

48 citations

Posted Content•

RAFT-3D: Scene Flow using Rigid-Motion Embeddings.

[...]

Zachary Teed¹, Jia Deng¹•Institutions (1)

Princeton University¹

01 Dec 2020-arXiv: Computer Vision and Pattern Recognition

TL;DR: RAFT-3D is introduced, a new deep architecture for scene flow based on the RAFT model developed for optical flow but iteratively updates a dense field of pixelwise SE3 motion instead of 2D motion, which represents a soft grouping of pixels into rigid objects.

...read moreread less

Abstract: We address the problem of scene flow: given a pair of stereo or RGB-D video frames, estimate pixelwise 3D motion. We introduce RAFT-3D, a new deep architecture for scene flow. RAFT-3D is based on the RAFT model developed for optical flow but iteratively updates a dense field of pixelwise SE3 motion instead of 2D motion. A key innovation of RAFT-3D is rigid-motion embeddings, which represent a soft grouping of pixels into rigid objects. Integral to rigid-motion embeddings is Dense-SE3, a differentiable layer that enforces geometric consistency of the embeddings. Experiments show that RAFT-3D achieves state-of-the-art performance. On FlyingThings3D, under the two-view evaluation, we improved the best published accuracy (d < 0.05) from 34.3% to 83.7%. On KITTI, we achieve an error of 5.77, outperforming the best published method (6.31), despite using no object instance supervision. Code is available at this https URL.

...read moreread less

Proceedings Article•DOI•

OASIS: A Large-Scale Dataset for Single Image 3D in the Wild

[...]

Weifeng Chen¹, Shengyi Qian¹, David Fan², Noriyuki Kojima¹, Max Hamilton¹, Jia Deng² - Show less +2 more•Institutions (2)

University of Michigan¹, Princeton University²

14 Jun 2020

TL;DR: This work presents Open Annotations of Single Image Surfaces (OASIS), a dataset for single-image 3D in the wild consisting of annotations of detailed 3D geometry for 140,000 images, and expects OASIS to be a useful resource for 3D vision research.

...read moreread less

Abstract: Single-view 3D is the task of recovering 3D properties such as depth and surface normals from a single image. We hypothesize that a major obstacle to single-image 3D is data. We address this issue by presenting Open Annotations of Single Image Surfaces (OASIS), a dataset for single-image 3D in the wild consisting of annotations of detailed 3D geometry for 140,000 images. We train and evaluate leading models on a variety of single-image 3D tasks. We expect OASIS to be a useful resource for 3D vision research. Project site: https://pvl.cs.princeton.edu/OASIS.

...read moreread less

Posted Content•

Learning Video Representations from Textual Web Supervision

[...]

Jonathan C. Stroud¹, David A. Ross, Chen Sun, Jia Deng, Rahul Sukthankar, Cordelia Schmid - Show less +2 more•Institutions (1)

Google¹

29 Jul 2020-arXiv: Computer Vision and Pattern Recognition

TL;DR: This work proposes a data collection process and uses it to collect 70M video clips, and trains a model to pair each video with its associated text, which leads to improvements over from-scratch training on all benchmarks, and outperforms many methods for self-supervised and webly-super supervised video representation learning.

...read moreread less

Abstract: Videos found on the Internet are paired with pieces of text, such as titles and descriptions. This text typically describes the most important content in the video, such as the objects in the scene and the actions being performed. Based on this observation, we propose to use such text as a method for learning video representations. To accomplish this, we propose a data collection process and use it to collect 70M video clips shared publicly on the Internet, and we then train a model to pair each video with its associated text. We fine-tune the model on several down-stream action recognition tasks, including Kinetics, HMDB-51, and UCF-101. We find that this approach is an effective method of pretraining video representations. Specifically, it leads to improvements over from-scratch training on all benchmarks, outperforms many methods for self-supervised and webly-supervised video representation learning, and achieves an improvement of 2.2% accuracy on HMDB-51.

...read moreread less

Proceedings Article•

LifeQA: A Real-life Dataset for Video Question Answering

[...]

Santiago Castro¹, Mahmoud Azab², Jonathan C. Stroud³, Cristina Noujaim, Ruoyao Wang, Jia Deng⁴, Rada Mihalcea¹ - Show less +3 more•Institutions (4)

University of Michigan¹, Palestine Technical University - Kadoorie², Google³, Princeton University⁴

01 May 2020

TL;DR: The challenging but realistic aspects of LifeQA are analyzed, and several state-of-the-art video question answering models are applied to provide benchmarks for future research.

...read moreread less

Abstract: We introduce LifeQA, a benchmark dataset for video question answering that focuses on day-to-day real-life situations. Current video question answering datasets consist of movies and TV shows. However, it is well-known that these visual domains are not representative of our day-to-day lives. Movies and TV shows, for example, benefit from professional camera movements, clean editing, crisp audio recordings, and scripted dialog between professional actors. While these domains provide a large amount of data for training models, their properties make them unsuitable for testing real-life question answering systems. Our dataset, by contrast, consists of video clips that represent only real-life scenarios. We collect 275 such video clips and over 2.3k multiple-choice questions. In this paper, we analyze the challenging but realistic aspects of LifeQA, and we apply several state-of-the-art video question answering models to provide benchmarks for future research. The full dataset is publicly available at https://lit.eecs.umich.edu/lifeqa/.

...read moreread less

Posted Content•

Strongly Incremental Constituency Parsing with Graph Neural Networks

[...]

Kaiyu Yang¹, Jia Deng¹•Institutions (1)

Princeton University¹

27 Oct 2020-arXiv: Computation and Language

TL;DR: This paper proposes a novel transition system called attach-juxtapose, which represents a partial sentence using a single tree; each action adds exactly one token into the partial tree, and develops a strongly incremental parser.

...read moreread less

Abstract: Parsing sentences into syntax trees can benefit downstream applications in NLP. Transition-based parsers build trees by executing actions in a state transition system. They are computationally efficient, and can leverage machine learning to predict actions based on partial trees. However, existing transition-based parsers are predominantly based on the shift-reduce transition system, which does not align with how humans are known to parse sentences. Psycholinguistic research suggests that human parsing is strongly incremental: humans grow a single parse tree by adding exactly one token at each step. In this paper, we propose a novel transition system called attach-juxtapose. It is strongly incremental; it represents a partial sentence using a single tree; each action adds exactly one token into the partial tree. Based on our transition system, we develop a strongly incremental parser. At each step, it encodes the partial tree using a graph neural network and predicts an action. We evaluate our parser on Penn Treebank (PTB) and Chinese Treebank (CTB). On PTB, it outperforms existing parsers trained with only constituency trees; and it performs on par with state-of-the-art parsers that use dependency trees as additional training data. On CTB, our parser establishes a new state of the art. Code is available at this https URL.

...read moreread less

Proceedings Article•

Learning to Prove Theorems by Learning to Generate Theorems

[...]

Mingzhe Wang¹, Jia Deng•Institutions (1)

Princeton University¹

17 Feb 2020

TL;DR: In this article, a neural generator that automatically synthesizes theorems and proofs for the purpose of training a theorem prover is proposed, and experiments on real-world tasks demonstrate that synthetic data from their approach improves the theorem provers and advances the state of the art of automated theorem proving.

...read moreread less

Abstract: We consider the task of automated theorem proving, a key AI task. Deep learning has shown promise for training theorem provers, but there are limited human-written theorems and proofs available for supervised learning. To address this limitation, we propose to learn a neural generator that automatically synthesizes theorems and proofs for the purpose of training a theorem prover. Experiments on real-world tasks demonstrate that synthetic data from our approach improves the theorem prover and advances the state of the art of automated theorem proving in Metamath. Code is available at this https URL.

...read moreread less

Posted Content•

Rel3D: A Minimally Contrastive Benchmark for Grounding Spatial Relations in 3D

[...]

Ankit Goyal¹, Kaiyu Yang¹, Dawei Yang², Jia Deng¹•Institutions (2)

Princeton University¹, University of Michigan²

03 Dec 2020-arXiv: Computer Vision and Pattern Recognition

TL;DR: Rel3D is constructed: the first large-scale, human-annotated dataset for grounding spatial relations in 3D, and it is empirically validate that minimally contrastive examples can diagnose issues with current relation detection models as well as lead to sample-efficient training.

...read moreread less

Abstract: Understanding spatial relations (e.g., "laptop on table") in visual input is important for both humans and robots. Existing datasets are insufficient as they lack large-scale, high-quality 3D ground truth information, which is critical for learning spatial relations. In this paper, we fill this gap by constructing Rel3D: the first large-scale, human-annotated dataset for grounding spatial relations in 3D. Rel3D enables quantifying the effectiveness of 3D information in predicting spatial relations on large-scale human data. Moreover, we propose minimally contrastive data collection -- a novel crowdsourcing method for reducing dataset bias. The 3D scenes in our dataset come in minimally contrastive pairs: two scenes in a pair are almost identical, but a spatial relation holds in one and fails in the other. We empirically validate that minimally contrastive examples can diagnose issues with current relation detection models as well as lead to sample-efficient training. Code and data are available at this https URL.

...read moreread less

Posted Content•

Learning to Prove Theorems by Learning to Generate Theorems

[...]

Mingzhe Wang¹, Jia Deng¹•Institutions (1)

Princeton University¹

17 Feb 2020-arXiv: Logic in Computer Science

TL;DR: This work proposes to learn a neural generator that automatically synthesizes theorems and proofs for the purpose of training a theorem prover, and demonstrates that synthetic data from this approach improves the theorem provers and advances the state of the art of automated theorem proving in Metamath.

...read moreread less

Abstract: We consider the task of automated theorem proving, a key AI task. Deep learning has shown promise for training theorem provers, but there are limited human-written theorems and proofs available for supervised learning. To address this limitation, we propose to learn a neural generator that automatically synthesizes theorems and proofs for the purpose of training a theorem prover. Experiments on real-world tasks demonstrate that synthetic data from our approach improves the theorem prover and advances the state of the art of automated theorem proving in Metamath. Code is available at this https URL.

...read moreread less

Proceedings Article•

Strongly Incremental Constituency Parsing with Graph Neural Networks

[...]

Kaiyu Yang¹, Jia Deng•Institutions (1)

Princeton University¹

01 Jan 2020

TL;DR: This article proposed a transition-based parser called attach-juxtapose, which represents a partial sentence using a single tree, and each action adds exactly one token into the partial tree.

...read moreread less

Abstract: Parsing sentences into syntax trees can benefit downstream applications in NLP. Transition-based parsers build trees by executing actions in a state transition system. They are computationally efficient, and can leverage machine learning to predict actions based on partial trees. However, existing transition-based parsers are predominantly based on the shift-reduce transition system, which does not align with how humans are known to parse sentences. Psycholinguistic research suggests that human parsing is strongly incremental: humans grow a single parse tree by adding exactly one token at each step. In this paper, we propose a novel transition system called attach-juxtapose. It is strongly incremental; it represents a partial sentence using a single tree; each action adds exactly one token into the partial tree. Based on our transition system, we develop a strongly incremental parser. At each step, it encodes the partial tree using a graph neural network and predicts an action. We evaluate our parser on Penn Treebank (PTB) and Chinese Treebank (CTB). On PTB, it outperforms existing parsers trained with only constituency trees; and it performs on par with state-of-the-art parsers that use dependency trees as additional training data. On CTB, our parser establishes a new state of the art. Code is available at this https URL.

...read moreread less

Proceedings Article•DOI•

Learning to Generate 3D Training Data Through Hybrid Gradient

[...]

Dawei Yang¹, Jia Deng²•Institutions (2)

University of Michigan¹, Princeton University²

14 Jun 2020

TL;DR: This work proposes a new method that optimizes the generation of 3D training data based on what it calls "hybrid gradient", which parametrize the design decisions as a real vector, and combines the approximate gradient and the analytical gradient to obtain the hybrid gradient of the network performance with respect to this vector.

...read moreread less

Abstract: Synthetic images rendered by graphics engines are a promising source for training deep networks. However, it is challenging to ensure that they can help train a network to perform well on real images, because a graphics-based generation pipeline requires numerous design decisions such as the selection of 3D shapes and the placement of the camera. In this work, we propose a new method that optimizes the generation of 3D training data based on what we call "hybrid gradient". We parametrize the design decisions as a real vector, and combine the approximate gradient and the analytical gradient to obtain the hybrid gradient of the network performance with respect to this vector. We evaluate our approach on the task of estimating surface normal, depth or intrinsic decomposition from a single image. Experiments on standard benchmarks show that our approach can outperform the prior state of the art on optimizing the generation of 3D training data, particularly in terms of computational efficiency.

...read moreread less

Book Chapter•DOI•

A Unified Framework of Surrogate Loss by Refactoring and Interpolation.

[...]

Lanlan Liu¹, Mingzhe Wang², Jia Deng²•Institutions (2)

University of Michigan¹, Princeton University²

23 Aug 2020

TL;DR: UniLoss as discussed by the authors is a unified framework to generate surrogate losses for training deep networks with gradient descent, reducing the amount of manual design of task-specific surrogate losses, which can optimize for different tasks and metrics using one unified framework.

...read moreread less

Abstract: We introduce UniLoss, a unified framework to generate surrogate losses for training deep networks with gradient descent, reducing the amount of manual design of task-specific surrogate losses. Our key observation is that in many cases, evaluating a model with a performance metric on a batch of examples can be refactored into four steps: from input to real-valued scores, from scores to comparisons of pairs of scores, from comparisons to binary variables, and from binary variables to the final performance metric. Using this refactoring we generate differentiable approximations for each non-differentiable step through interpolation. Using UniLoss, we can optimize for different tasks and metrics using one unified framework, achieving comparable performance compared with task-specific losses. We validate the effectiveness of UniLoss on three tasks and four datasets. Code is available at https://github.com/princeton-vl/uniloss.

...read moreread less

Posted Content•

PackIt: A Virtual Environment for Geometric Planning

[...]

Ankit Goyal¹, Jia Deng¹•Institutions (1)

Princeton University¹

21 Jul 2020-arXiv: Learning

TL;DR: PackIt is presented, a virtual environment to evaluate and potentially learn the ability to do geometric planning, where an agent needs to take a sequence of actions to pack a set of objects into a box with limited space.

...read moreread less

Abstract: The ability to jointly understand the geometry of objects and plan actions for manipulating them is crucial for intelligent agents. We refer to this ability as geometric planning. Recently, many interactive environments have been proposed to evaluate intelligent agents on various skills, however, none of them cater to the needs of geometric planning. We present PackIt, a virtual environment to evaluate and potentially learn the ability to do geometric planning, where an agent needs to take a sequence of actions to pack a set of objects into a box with limited space. We also construct a set of challenging packing tasks using an evolutionary algorithm. Further, we study various baselines for the task that include model-free learning-based and heuristic-based methods, as well as search-based optimization methods that assume access to the model of the environment. Code and data are available at this https URL.

...read moreread less

Posted Content•

A Unified Framework of Surrogate Loss by Refactoring and Interpolation

[...]

Lanlan Liu¹, Mingzhe Wang², Jia Deng²•Institutions (2)

University of Michigan¹, Princeton University²

27 Jul 2020-arXiv: Computer Vision and Pattern Recognition

TL;DR: UniLoss, a unified framework to generate surrogate losses for training deep networks with gradient descent, reduces the amount of manual design of task-specific surrogate losses, achieving comparable performance compared with task- specific losses.

...read moreread less

Abstract: We introduce UniLoss, a unified framework to generate surrogate losses for training deep networks with gradient descent, reducing the amount of manual design of task-specific surrogate losses. Our key observation is that in many cases, evaluating a model with a performance metric on a batch of examples can be refactored into four steps: from input to real-valued scores, from scores to comparisons of pairs of scores, from comparisons to binary variables, and from binary variables to the final performance metric. Using this refactoring we generate differentiable approximations for each non-differentiable step through interpolation. Using UniLoss, we can optimize for different tasks and metrics using one unified framework, achieving comparable performance compared with task-specific losses. We validate the effectiveness of UniLoss on three tasks and four datasets. Code is available at this https URL.

...read moreread less

Proceedings Article•

PackIt: A Virtual Environment for Geometric Planning

[...]

Ankit Goyal¹, Jia Deng¹•Institutions (1)

Princeton University¹

12 Jul 2020

TL;DR: In this article, the authors present PackIt, a virtual environment to evaluate and potentially learn the ability to do geometric planning, where an agent needs to take a sequence of actions to pack a set of objects into a box with limited space.

...read moreread less

Abstract: The ability to jointly understand the geometry of objects and plan actions for manipulating them is crucial for intelligent agents. We refer to this ability as geometric planning. Recently, many interactive environments have been proposed to evaluate intelligent agents on various skills, however, none of them cater to the needs of geometric planning. We present PackIt, a virtual environment to evaluate and potentially learn the ability to do geometric planning, where an agent needs to take a sequence of actions to pack a set of objects into a box with limited space. We also construct a set of challenging packing tasks using an evolutionary algorithm. Further, we study various baselines for the task that include model-free learning-based and heuristic-based methods, as well as search-based optimization methods that assume access to the model of the environment. Code and data are available at this https URL.

...read moreread less

Posted Content•

OASIS: A Large-Scale Dataset for Single Image 3D in the Wild

[...]

Weifeng Chen¹, Shengyi Qian¹, David Fan², Noriyuki Kojima¹, Max Hamilton¹, Jia Deng² - Show less +2 more•Institutions (2)

University of Michigan¹, Princeton University²

26 Jul 2020-arXiv: Computer Vision and Pattern Recognition

TL;DR: Open Annotations of Single Image Surfaces (OASIS) as mentioned in this paper is a dataset for single-image 3D in the wild consisting of annotations of detailed 3D geometry for 140,000 images.

...read moreread less

Abstract: Single-view 3D is the task of recovering 3D properties such as depth and surface normals from a single image. We hypothesize that a major obstacle to single-image 3D is data. We address this issue by presenting Open Annotations of Single Image Surfaces (OASIS), a dataset for single-image 3D in the wild consisting of annotations of detailed 3D geometry for 140,000 images. We train and evaluate leading models on a variety of single-image 3D tasks. We expect OASIS to be a useful resource for 3D vision research. Project site: this https URL.

...read moreread less

Proceedings Article•

Rel3D: A Minimally Contrastive Benchmark for Grounding Spatial Relations in 3D

[...]

Ankit Goyal¹, Kaiyu Yang¹, Dawei Yang², Jia Deng¹•Institutions (2)

Princeton University¹, University of Michigan²

01 Jan 2020

TL;DR: Rel3D as discussed by the authors is the first large-scale, human-annotated dataset for grounding spatial relations in 3D, which enables quantifying the effectiveness of 3D information in predicting spatial relations on large scale human data.

...read moreread less

Abstract: Understanding spatial relations (e.g., "laptop on table") in visual input is important for both humans and robots. Existing datasets are insufficient as they lack large-scale, high-quality 3D ground truth information, which is critical for learning spatial relations. In this paper, we fill this gap by constructing Rel3D: the first large-scale, human-annotated dataset for grounding spatial relations in 3D. Rel3D enables quantifying the effectiveness of 3D information in predicting spatial relations on large-scale human data. Moreover, we propose minimally contrastive data collection -- a novel crowdsourcing method for reducing dataset bias. The 3D scenes in our dataset come in minimally contrastive pairs: two scenes in a pair are almost identical, but a spatial relation holds in one and fails in the other. We empirically validate that minimally contrastive examples can diagnose issues with current relation detection models as well as lead to sample-efficient training. Code and data are available at this https URL.

...read moreread less

Showing papers by "Jia Deng published in 2020"