scispace - formally typeset
Search or ask a question
Author

Jia Deng

Bio: Jia Deng is an academic researcher from Princeton University. The author has contributed to research in topics: Object detection & Convolutional neural network. The author has an hindex of 50, co-authored 148 publications receiving 73461 citations. Previous affiliations of Jia Deng include University of Michigan & Carnegie Mellon University.


Papers
More filters
Proceedings Article
01 Sep 2017
TL;DR: A deep learning-based approach to the problem of premise selection: selecting mathematical statements relevant for proving a given conjecture by representing a higher-order logic formula as a graph that is invariant to variable renaming but still fully preserves syntactic and semantic information.
Abstract: We propose a deep learning-based approach to the problem of premise selection: selecting mathematical statements relevant for proving a given conjecture. We represent a higher-order logic formula as a graph that is invariant to variable renaming but still fully preserves syntactic and semantic information. We then embed the graph into a vector via a novel embedding method that preserves the information of edge ordering. Our approach achieves state-of-the-art results on the HolStep dataset, improving the classification accuracy from 83% to 90.3%.

85 citations

Journal ArticleDOI
11 Feb 2020-BJUI
TL;DR: To assess the recall of a deep learning method to automatically detect kidney stones composition from digital photographs of stones, a large number of cases where the method was used to identify stones with different compositions are studied.
Abstract: Objectives To assess the recall of a deep learning (DL) method to automatically detect kidney stones composition from digital photographs of stones. Materials and methods A total of 63 human kidney stones of varied compositions were obtained from a stone laboratory including calcium oxalate monohydrate (COM), uric acid (UA), magnesium ammonium phosphate hexahydrate (MAPH/struvite), calcium hydrogen phosphate dihydrate (CHPD/brushite), and cystine stones. At least two images of the stones, both surface and inner core, were captured on a digital camera for all stones. A deep convolutional neural network (CNN), ResNet-101 (ResNet, Microsoft), was applied as a multi-class classification model, to each image. This model was assessed using leave-one-out cross-validation with the primary outcome being network prediction recall. Results The composition prediction recall for each composition was as follows: UA 94% (n = 17), COM 90% (n = 21), MAPH/struvite 86% (n = 7), cystine 75% (n = 4), CHPD/brushite 71% (n = 14). The overall weighted recall of the CNNs composition analysis was 85% for the entire cohort. Specificity and precision for each stone type were as follows: UA (97.83%, 94.12%), COM (97.62%, 95%), struvite (91.84%, 71.43%), cystine (98.31%, 75%), and brushite (96.43%, 75%). Conclusion Deep CNNs can be used to identify kidney stone composition from digital photographs with good recall. Future work is needed to see if DL can be used for detecting stone composition during digital endoscopy. This technology may enable integrated endoscopic and laser systems that automatically provide laser settings based on stone composition recognition with the goal to improve surgical efficiency.

77 citations

Posted Content
TL;DR: Recurrent All-Pairs Field Transforms (RAFT), a new deep network architecture for optical flow that extracts per-pixel features, builds multi-scale 4D correlation volumes for all pairs of pixels, and iteratively updates a flow field through a recurrent unit that performs lookups on the correlation volumes achieves state-of-the-art performance.
Abstract: We introduce Recurrent All-Pairs Field Transforms (RAFT), a new deep network architecture for optical flow. RAFT extracts per-pixel features, builds multi-scale 4D correlation volumes for all pairs of pixels, and iteratively updates a flow field through a recurrent unit that performs lookups on the correlation volumes. RAFT achieves state-of-the-art performance. On KITTI, RAFT achieves an F1-all error of 5.10%, a 16% error reduction from the best published result (6.10%). On Sintel (final pass), RAFT obtains an end-point-error of 2.855 pixels, a 30% error reduction from the best published result (4.098 pixels). In addition, RAFT has strong cross-dataset generalization as well as high efficiency in inference time, training speed, and parameter count. Code is available at this https URL.

70 citations

01 Jun 2012
TL;DR: A novel learning technique is proposed that scales logarithmically with the number of classes in both training and testing, improving both accuracy and efficiency of the previous state of the art while reducing training time by 31 fold on 10 thousand classes.
Abstract: : Visual recognition remains one of the grand goals of artificial intelligence research. One major challenge is endowing machines with human ability to recognize tens of thousands of categories. Moving beyond previous work that is mostly focused on hundreds of categories we make progress toward human scale visual recognition. Specifically, our contributions are as follows First, we have constructed ImageNet, a large scale image ontology. The Fall 2011 version consists of 22 thousand categories and 14 million images; it depicts each category by an average of 650 images collected from the Internet and verified by multiple humans. To the best of our knowledge this is currently the largest human-verified dataset in terms of both the number of categories and the number of images. Given the large amount of human effort required, the traditional approach to dataset collection, involving in-house annotation by a small number of human subjects, becomes infeasible. In this dissertation we describe how ImageNet has been built through quality controlled, cost effective, large scale online crowdsourcing. Next, we use ImageNet to conduct the first benchmarking study of state of the art recognition algorithms at the human scale. By experimenting on 10 thousand categories, we discover that the previous state of the art performance is still low (6.4%). We further observe that the confusion among categories is hierarchically structured at large scale, a key insight that leads to our subsequent contributions. Third, we study how to efficiently classify tens of thousands of categories by exploiting the structure of visual confusion among categories. We propose a novel learning technique that scales logarithmically with the number of classes in both training and testing, improving both accuracy and efficiency of the previous state of the art while reducing training time by 31 fold on 10 thousand classes.

70 citations

Proceedings ArticleDOI
07 Jun 2015
TL;DR: A new benchmark with crowdsourced ground truth affordances on 20 PASCAL VOC object classes and 957 action classes is introduced and a number of significant insights are revealed that reveal the most effective ways of collecting knowledge of semantic affordances.
Abstract: Affordances are fundamental attributes of objects. Affordances reveal the functionalities of objects and the possible actions that can be performed on them. Understanding affordances is crucial for recognizing human activities in visual data and for robots to interact with the world. In this paper we introduce the new problem of mining the knowledge of semantic affordance: given an object, determining whether an action can be performed on it. This is equivalent to connecting verb nodes and noun nodes in WordNet, or filling an affordance matrix encoding the plausibility of each action-object pair. We introduce a new benchmark with crowdsourced ground truth affordances on 20 PASCAL VOC object classes and 957 action classes. We explore a number of approaches including text mining, visual mining, and collaborative filtering. Our analyses yield a number of significant insights that reveal the most effective ways of collecting knowledge of semantic affordances.

68 citations


Cited by
More filters
Proceedings ArticleDOI
27 Jun 2016
TL;DR: In this article, the authors proposed a residual learning framework to ease the training of networks that are substantially deeper than those used previously, which won the 1st place on the ILSVRC 2015 classification task.
Abstract: Deeper neural networks are more difficult to train. We present a residual learning framework to ease the training of networks that are substantially deeper than those used previously. We explicitly reformulate the layers as learning residual functions with reference to the layer inputs, instead of learning unreferenced functions. We provide comprehensive empirical evidence showing that these residual networks are easier to optimize, and can gain accuracy from considerably increased depth. On the ImageNet dataset we evaluate residual nets with a depth of up to 152 layers—8× deeper than VGG nets [40] but still having lower complexity. An ensemble of these residual nets achieves 3.57% error on the ImageNet test set. This result won the 1st place on the ILSVRC 2015 classification task. We also present analysis on CIFAR-10 with 100 and 1000 layers. The depth of representations is of central importance for many visual recognition tasks. Solely due to our extremely deep representations, we obtain a 28% relative improvement on the COCO object detection dataset. Deep residual nets are foundations of our submissions to ILSVRC & COCO 2015 competitions1, where we also won the 1st places on the tasks of ImageNet detection, ImageNet localization, COCO detection, and COCO segmentation.

123,388 citations

Proceedings Article
03 Dec 2012
TL;DR: The state-of-the-art performance of CNNs was achieved by Deep Convolutional Neural Networks (DCNNs) as discussed by the authors, which consists of five convolutional layers, some of which are followed by max-pooling layers, and three fully-connected layers with a final 1000-way softmax.
Abstract: We trained a large, deep convolutional neural network to classify the 1.2 million high-resolution images in the ImageNet LSVRC-2010 contest into the 1000 different classes. On the test data, we achieved top-1 and top-5 error rates of 37.5% and 17.0% which is considerably better than the previous state-of-the-art. The neural network, which has 60 million parameters and 650,000 neurons, consists of five convolutional layers, some of which are followed by max-pooling layers, and three fully-connected layers with a final 1000-way softmax. To make training faster, we used non-saturating neurons and a very efficient GPU implementation of the convolution operation. To reduce overriding in the fully-connected layers we employed a recently-developed regularization method called "dropout" that proved to be very effective. We also entered a variant of this model in the ILSVRC-2012 competition and achieved a winning top-5 test error rate of 15.3%, compared to 26.2% achieved by the second-best entry.

73,978 citations

Proceedings Article
04 Sep 2014
TL;DR: This work investigates the effect of the convolutional network depth on its accuracy in the large-scale image recognition setting using an architecture with very small convolution filters, which shows that a significant improvement on the prior-art configurations can be achieved by pushing the depth to 16-19 weight layers.
Abstract: In this work we investigate the effect of the convolutional network depth on its accuracy in the large-scale image recognition setting. Our main contribution is a thorough evaluation of networks of increasing depth using an architecture with very small (3x3) convolution filters, which shows that a significant improvement on the prior-art configurations can be achieved by pushing the depth to 16-19 weight layers. These findings were the basis of our ImageNet Challenge 2014 submission, where our team secured the first and the second places in the localisation and classification tracks respectively. We also show that our representations generalise well to other datasets, where they achieve state-of-the-art results. We have made our two best-performing ConvNet models publicly available to facilitate further research on the use of deep visual representations in computer vision.

55,235 citations

Proceedings Article
01 Jan 2015
TL;DR: In this paper, the authors investigated the effect of the convolutional network depth on its accuracy in the large-scale image recognition setting and showed that a significant improvement on the prior-art configurations can be achieved by pushing the depth to 16-19 layers.
Abstract: In this work we investigate the effect of the convolutional network depth on its accuracy in the large-scale image recognition setting. Our main contribution is a thorough evaluation of networks of increasing depth using an architecture with very small (3x3) convolution filters, which shows that a significant improvement on the prior-art configurations can be achieved by pushing the depth to 16-19 weight layers. These findings were the basis of our ImageNet Challenge 2014 submission, where our team secured the first and the second places in the localisation and classification tracks respectively. We also show that our representations generalise well to other datasets, where they achieve state-of-the-art results. We have made our two best-performing ConvNet models publicly available to facilitate further research on the use of deep visual representations in computer vision.

49,914 citations

Proceedings ArticleDOI
Jia Deng1, Wei Dong1, Richard Socher1, Li-Jia Li1, Kai Li1, Li Fei-Fei1 
20 Jun 2009
TL;DR: A new database called “ImageNet” is introduced, a large-scale ontology of images built upon the backbone of the WordNet structure, much larger in scale and diversity and much more accurate than the current image datasets.
Abstract: The explosion of image data on the Internet has the potential to foster more sophisticated and robust models and algorithms to index, retrieve, organize and interact with images and multimedia data. But exactly how such data can be harnessed and organized remains a critical problem. We introduce here a new database called “ImageNet”, a large-scale ontology of images built upon the backbone of the WordNet structure. ImageNet aims to populate the majority of the 80,000 synsets of WordNet with an average of 500-1000 clean and full resolution images. This will result in tens of millions of annotated images organized by the semantic hierarchy of WordNet. This paper offers a detailed analysis of ImageNet in its current state: 12 subtrees with 5247 synsets and 3.2 million images in total. We show that ImageNet is much larger in scale and diversity and much more accurate than the current image datasets. Constructing such a large-scale database is a challenging task. We describe the data collection scheme with Amazon Mechanical Turk. Lastly, we illustrate the usefulness of ImageNet through three simple applications in object recognition, image classification and automatic object clustering. We hope that the scale, accuracy, diversity and hierarchical structure of ImageNet can offer unparalleled opportunities to researchers in the computer vision community and beyond.

49,639 citations