Top 25 papers published by Jia Deng from Princeton University in 2017

Proceedings Article•

Associative Embedding: End-to-End Learning for Joint Detection and Grouping

[...]

Alejandro Newell¹, Zhiao Huang², Jia Deng¹•Institutions (2)

University of Michigan¹, Tsinghua University²

01 Jan 2017

TL;DR: In this article, associative embedding is used to supervise convolutional neural networks for the task of detection and grouping, which can be easily integrated into any state-of-the-art network architecture that produces pixel-wise predictions.

...read moreread less

Abstract: We introduce associative embedding, a novel method for supervising convolutional neural networks for the task of detection and grouping. A number of computer vision problems can be framed in this manner including multi-person pose estimation, instance segmentation, and multi-object tracking. Usually the grouping of detections is achieved with multi-stage pipelines, instead we propose an approach that teaches a network to simultaneously output detections and group assignments. This technique can be easily integrated into any state-of-the-art network architecture that produces pixel-wise predictions. We show how to apply this method to multi-person pose estimation and report state-of-the-art performance on the MPII and MS-COCO datasets.

...read moreread less

603 citations

Journal Article•DOI•

Using deep learning and google street view to estimate the demographic makeup of neighborhoods across the United States

[...]

Timnit Gebru¹, Jonathan Krause¹, Yilun Wang¹, Duyun Chen¹, Jia Deng², Erez Lieberman Aiden³, Erez Lieberman Aiden⁴, Li Fei-Fei¹ - Show less +4 more•Institutions (4)

Stanford University¹, University of Michigan², Rice University³, Baylor College of Medicine⁴

12 Dec 2017-Proceedings of the National Academy of Sciences of the United States of America

TL;DR: A method that estimates socioeconomic characteristics of regions spanning 200 US cities by using 50 million images of street scenes gathered with Google Street View cars, suggesting that automated systems for monitoring demographics may effectively complement labor-intensive approaches with the potential to measure demographics with fine spatial resolution, in close to real time.

...read moreread less

Abstract: The United States spends more than $250 million each year on the American Community Survey (ACS), a labor-intensive door-to-door study that measures statistics relating to race, gender, education, occupation, unemployment, and other demographic factors. Although a comprehensive source of data, the lag between demographic changes and their appearance in the ACS can exceed several years. As digital imagery becomes ubiquitous and machine vision techniques improve, automated data analysis may become an increasingly practical supplement to the ACS. Here, we present a method that estimates socioeconomic characteristics of regions spanning 200 US cities by using 50 million images of street scenes gathered with Google Street View cars. Using deep learning-based computer vision techniques, we determined the make, model, and year of all motor vehicles encountered in particular neighborhoods. Data from this census of motor vehicles, which enumerated 22 million automobiles in total (8% of all automobiles in the United States), were used to accurately estimate income, race, education, and voting patterns at the zip code and precinct level. (The average US precinct contains ∼1,000 people.) The resulting associations are surprisingly simple and powerful. For instance, if the number of sedans encountered during a drive through a city is higher than the number of pickup trucks, the city is likely to vote for a Democrat during the next presidential election (88% chance); otherwise, it is likely to vote Republican (82%). Our results suggest that automated systems for monitoring demographics may effectively complement labor-intensive approaches, with the potential to measure demographics with fine spatial resolution, in close to real time.

...read moreread less

323 citations

Proceedings Article•

Pixels to Graphs by Associative Embedding

[...]

Alejandro Newell¹, Jia Deng¹•Institutions (1)

University of Michigan¹

22 Jun 2017

TL;DR: A method for training a convolutional neural network such that it takes in an input image and produces a full graph definition and is done end-to-end in a single stage with the use of associative embeddings.

...read moreread less

Abstract: Graphs are a useful abstraction of image content. Not only can graphs represent details about individual objects in a scene but they can capture the interactions between pairs of objects. We present a method for training a convolutional neural network such that it takes in an input image and produces a full graph definition. This is done end-to-end in a single stage with the use of associative embeddings. The network learns to simultaneously identify all of the elements that make up a graph and piece them together. We benchmark on the Visual Genome dataset, and demonstrate state-of-the-art performance on the challenging task of scene graph generation.

...read moreread less

215 citations

Proceedings Article•DOI•

Temporal Action Localization by Structured Maximal Sums

[...]

Zehuan Yuan¹, Zehuan Yuan², Jonathan C. Stroud¹, Tong Lu², Jia Deng - Show less +1 more•Institutions (2)

University of Michigan¹, Nanjing University²

06 Nov 2017

TL;DR: In this paper, the authors pose action localization as a structured prediction over arbitrary-length temporal windows, where each window is scored as the sum of frame-wise classification scores, and use features from a deep Convolutional Neural Network (CNN) to directly optimize for a novel structured objective.

...read moreread less

Abstract: We address the problem of temporal action localization in videos. We pose action localization as a structured prediction over arbitrary-length temporal windows, where each window is scored as the sum of frame-wise classification scores. Additionally, our model classifies the start, middle, and end of each action as separate components, allowing our system to explicitly model each actions temporal evolution and take advantage of informative temporal dependencies present in that structure. In this framework, we localize actions by searching for the structured maximal sum, a problem for which we develop a novel, provably-efficient algorithmic solution. The frame-wise classification scores are computed using features from a deep Convolutional Neural Network (CNN), which are trained end-to-end to directly optimize for a novel structured objective. We evaluate our system on the THUMOS 14 action detection benchmark and achieve competitive performance.

...read moreread less

169 citations

Proceedings Article•DOI•

Forecasting Human Dynamics from Static Images

[...]

Yu-Wei Chao¹, Jimei Yang², Brian Price², Scott Cohen², Jia Deng¹ - Show less +1 more•Institutions (2)

University of Michigan¹, Adobe Systems²

21 Jul 2017

TL;DR: The 3D Pose Forecasting Network (3D-PFNet) is proposed, which integrates recent advances on single-image human pose estimation and sequence prediction, and converts the 2D predictions into 3D space.

...read moreread less

Abstract: This paper presents the first study on forecasting human dynamics from static images. The problem is to input a single RGB image and generate a sequence of upcoming human body poses in 3D. To address the problem, we propose the 3D Pose Forecasting Network (3D-PFNet). Our 3D-PFNet integrates recent advances on single-image human pose estimation and sequence prediction, and converts the 2D predictions into 3D space. We train our 3D-PFNet using a three-step training strategy to leverage a diverse source of training data, including image and video based human pose datasets and 3D motion capture (MoCap) data. We demonstrate competitive performance of our 3D-PFNet on 2D pose forecasting and 3D structure recovery through quantitative and qualitative results.

...read moreread less

125 citations

Proceedings Article•

Premise Selection for Theorem Proving by Deep Graph Embedding

[...]

Mingzhe Wang¹, Yihe Tang¹, Jian Wang¹, Jia Deng¹•Institutions (1)

University of Michigan¹

01 Sep 2017

TL;DR: A deep learning-based approach to the problem of premise selection: selecting mathematical statements relevant for proving a given conjecture by representing a higher-order logic formula as a graph that is invariant to variable renaming but still fully preserves syntactic and semantic information.

...read moreread less

Abstract: We propose a deep learning-based approach to the problem of premise selection: selecting mathematical statements relevant for proving a given conjecture. We represent a higher-order logic formula as a graph that is invariant to variable renaming but still fully preserves syntactic and semantic information. We then embed the graph into a vector via a novel embedding method that preserves the information of edge ordering. Our approach achieves state-of-the-art results on the HolStep dataset, improving the classification accuracy from 83% to 90.3%.

...read moreread less

85 citations

Proceedings Article•DOI•

Surface Normals in the Wild

[...]

Weifeng Chen¹, Donglai Xiang², Jia Deng³•Institutions (3)

University of Michigan¹, Tsinghua University², Carnegie Mellon University³

22 Dec 2017

TL;DR: This paper collected human annotated surface normals and used them to help train a neural network that directly predicts pixel-wise depth, and proposed two novel loss functions for training with surface normal annotations.

...read moreread less

Abstract: We study the problem of single-image depth estimation for images in the wild. We collect human annotated surface normals and use them to help train a neural network that directly predicts pixel-wise depth. We propose two novel loss functions for training with surface normal annotations. Experiments on NYU Depth, KITTI, and our own dataset demonstrate that our approach can significantly improve the quality of depth estimation in the wild.

...read moreread less

42 citations

Proceedings Article•

Fine-grained car detection for visual census estimation

[...]

Timnit Gebru¹, Jonathan Krause¹, Yilun Wang¹, Duyun Chen¹, Jia Deng², Li Fei-Fei¹ - Show less +2 more•Institutions (2)

Stanford University¹, University of Michigan²

01 Jan 2017

TL;DR: In this article, the authors leverage the ubiquity of Google Street View images and develop a computer vision pipeline to predict income, per capita carbon emission, crime rates and other city attributes from a single source of publicly available visual data.

...read moreread less

Abstract: Targeted socio-economic policies require an accurate understanding of a country’s demographic makeup. To that end, the United States spends more than 1 billion dollars a year gathering census data such as race, gender, education, occupation and unemployment rates. Compared to the traditional method of collecting surveys across many years which is costly and labor intensive, data-driven, machine learning-driven approaches are cheaper and faster—with the potential ability to detect trends in close to real time. In this work, we leverage the ubiquity of Google Street View images and develop a computer vision pipeline to predict income, per capita carbon emission, crime rates and other city attributes from a single source of publicly available visual data. We first detect cars in 50 million images across 200 of the largest US cities and train a model to predict demographic attributes using the detected cars. To facilitate our work, we have collected the largest and most challenging fine-grained dataset reported to date consisting of over 2600 classes of cars comprised of images from Google Street View and other web sources, classified by car experts to account for even the most subtle of visual differences. We use this data to construct the largest scale fine-grained detection system reported to date. Our prediction results correlate well with ground truth income data (r=0.82), Massachusetts department of vehicle registration, and sources investigating crime rates, income segregation, per capita carbon emission, and other market research. Finally, we learn interesting relationships between cars and neighborhoods allowing us to perform the first large scale sociological analysis of cities using computer vision techniques.

...read moreread less

39 citations

Posted Content•

Fine-Grained Car Detection for Visual Census Estimation

[...]

Timnit Gebru¹, Jonathan Krause¹, Yilun Wang¹, Duyun Chen¹, Jia Deng², Li Fei-Fei¹ - Show less +2 more•Institutions (2)

Stanford University¹, University of Michigan²

07 Sep 2017-arXiv: Computer Vision and Pattern Recognition

TL;DR: In this article, the authors leverage the ubiquity of Google Street View images and develop a computer vision pipeline to predict income, per capita carbon emission, crime rates and other city attributes from a single source of publicly available visual data.

...read moreread less

Abstract: Targeted socioeconomic policies require an accurate understanding of a country's demographic makeup. To that end, the United States spends more than 1 billion dollars a year gathering census data such as race, gender, education, occupation and unemployment rates. Compared to the traditional method of collecting surveys across many years which is costly and labor intensive, data-driven, machine learning driven approaches are cheaper and faster--with the potential ability to detect trends in close to real time. In this work, we leverage the ubiquity of Google Street View images and develop a computer vision pipeline to predict income, per capita carbon emission, crime rates and other city attributes from a single source of publicly available visual data. We first detect cars in 50 million images across 200 of the largest US cities and train a model to predict demographic attributes using the detected cars. To facilitate our work, we have collected the largest and most challenging fine-grained dataset reported to date consisting of over 2600 classes of cars comprised of images from Google Street View and other web sources, classified by car experts to account for even the most subtle of visual differences. We use this data to construct the largest scale fine-grained detection system reported to date. Our prediction results correlate well with ground truth income data (r=0.82), Massachusetts department of vehicle registration, and sources investigating crime rates, income segregation, per capita carbon emission, and other market research. Finally, we learn interesting relationships between cars and neighborhoods allowing us to perform the first large scale sociological analysis of cities using computer vision techniques.

...read moreread less

39 citations

Posted Content•

Using Deep Learning and Google Street View to Estimate the Demographic Makeup of the US

[...]

Timnit Gebru, Jonathan Krause, Yilun Wang, Duyun Chen, Jia Deng, Erez Aiden Lieberman, Li Fei-Fei - Show less +3 more

22 Feb 2017-arXiv: Computer Vision and Pattern Recognition

TL;DR: A method that determines socioeconomic trends from 50 million images of street scenes, gathered in 200 American cities by Google Street View cars, suggests that automated systems for monitoring demographic trends may effectively complement labor-intensive approaches, with the potential to detect trends with fine spatial resolution, in close to real time.

...read moreread less

Abstract: The United States spends more than $1B each year on initiatives such as the American Community Survey (ACS), a labor-intensive door-to-door study that measures statistics relating to race, gender, education, occupation, unemployment, and other demographic factors. Although a comprehensive source of data, the lag between demographic changes and their appearance in the ACS can exceed half a decade. As digital imagery becomes ubiquitous and machine vision techniques improve, automated data analysis may provide a cheaper and faster alternative. Here, we present a method that determines socioeconomic trends from 50 million images of street scenes, gathered in 200 American cities by Google Street View cars. Using deep learning-based computer vision techniques, we determined the make, model, and year of all motor vehicles encountered in particular neighborhoods. Data from this census of motor vehicles, which enumerated 22M automobiles in total (8% of all automobiles in the US), was used to accurately estimate income, race, education, and voting patterns, with single-precinct resolution. (The average US precinct contains approximately 1000 people.) The resulting associations are surprisingly simple and powerful. For instance, if the number of sedans encountered during a 15-minute drive through a city is higher than the number of pickup trucks, the city is likely to vote for a Democrat during the next Presidential election (88% chance); otherwise, it is likely to vote Republican (82%). Our results suggest that automated systems for monitoring demographic trends may effectively complement labor-intensive approaches, with the potential to detect trends with fine spatial resolution, in close to real time.

...read moreread less

33 citations

Posted Content•

Learning to Detect Human-Object Interactions

[...]

Yu-Wei Chao¹, Yunfan Liu¹, Xieyang Liu¹, Huayi Zeng², Jia Deng¹ - Show less +1 more•Institutions (2)

University of Michigan¹, Washington University in St. Louis²

17 Feb 2017-arXiv: Computer Vision and Pattern Recognition

TL;DR: Experiments demonstrate that the proposed Human-Object Region-based Convolutional Neural Networks (HO-RCNN), by exploiting human-object spatial relations through Interaction Patterns, significantly improves the performance of HOI detection over baseline approaches.

...read moreread less

Abstract: We study the problem of detecting human-object interactions (HOI) in static images, defined as predicting a human and an object bounding box with an interaction class label that connects them. HOI detection is a fundamental problem in computer vision as it provides semantic information about the interactions among the detected objects. We introduce HICO-DET, a new large benchmark for HOI detection, by augmenting the current HICO classification benchmark with instance annotations. To solve the task, we propose Human-Object Region-based Convolutional Neural Networks (HO-RCNN). At the core of our HO-RCNN is the Interaction Pattern, a novel DNN input that characterizes the spatial relations between two bounding boxes. Experiments on HICO-DET demonstrate that our HO-RCNN, by exploiting human-object spatial relations through Interaction Patterns, significantly improves the performance of HOI detection over baseline approaches.

...read moreread less

Posted Content•

Forecasting Human Dynamics from Static Images

[...]

Yu-Wei Chao¹, Jimei Yang², Brian Price², Scott Cohen², Jia Deng¹ - Show less +1 more•Institutions (2)

University of Michigan¹, Adobe Systems²

11 Apr 2017-arXiv: Computer Vision and Pattern Recognition

TL;DR: Wang et al. as discussed by the authors proposed the 3D Pose Forecasting Network (3D-PFNet), which combines single-image human pose estimation and sequence prediction, and converts the 2D predictions into 3D space.

...read moreread less

Abstract: This paper presents the first study on forecasting human dynamics from static images. The problem is to input a single RGB image and generate a sequence of upcoming human body poses in 3D. To address the problem, we propose the 3D Pose Forecasting Network (3D-PFNet). Our 3D-PFNet integrates recent advances on single-image human pose estimation and sequence prediction, and converts the 2D predictions into 3D space. We train our 3D-PFNet using a three-step training strategy to leverage a diverse source of training data, including image and video based human pose datasets and 3D motion capture (MoCap) data. We demonstrate competitive performance of our 3D-PFNet on 2D pose forecasting and 3D pose recovery through quantitative and qualitative results.

...read moreread less

Surgeon Technical Skill Assessment using Computer Vision based Analysis

[...]

Hei Law, Khurshid R. Ghani, Jia Deng

06 Nov 2017

TL;DR: In this article, a computer vision-based method was proposed to assess the technical skill level of surgeons by analyzing the movement of robotic instruments in robotic surgical videos, using peer evaluations of skill as the reference standard.

...read moreread less

Abstract: In this paper, we propose a computer vision based method to assess the technical skill level of surgeons by analyzing the movement of robotic instruments in robotic surgical videos. First, our method leverages the power of crowd workers on the internet to obtain high quality data in a scalable and cost-efficient way. Second, we utilize the high quality data to train an accurate and efficient robotic instrument tracker based on the state-of-theart Hourglass Networks. Third, we assess the movement of the robotic instruments and automatically classify the technical level of a surgeon with a linear classifier, using peer evaluations of skill as the reference standard. Since the proposed method relies only on video data, this method has the potential to be transferred to other minimally invasive surgical procedures.

...read moreread less

Posted Content•

Dynamic Deep Neural Networks: Optimizing Accuracy-Efficiency Trade-offs by Selective Execution

[...]

Lanlan Liu¹, Jia Deng¹•Institutions (1)

University of Michigan¹

02 Jan 2017-arXiv: Learning

TL;DR: This work introduces Dynamic Deep Neural Networks (D2NN), a new type of feed-forward deep neural network that allows selective execution, and demonstrates that D2NNs are general and flexible, and can effectively optimize accuracy-efficiency trade-offs.

...read moreread less

Abstract: We introduce Dynamic Deep Neural Networks (D2NN), a new type of feed-forward deep neural network that allows selective execution. Given an input, only a subset of D2NN neurons are executed, and the particular subset is determined by the D2NN itself. By pruning unnecessary computation depending on input, D2NNs provide a way to improve computational efficiency. To achieve dynamic selective execution, a D2NN augments a feed-forward deep neural network (directed acyclic graph of differentiable modules) with controller modules. Each controller module is a sub-network whose output is a decision that controls whether other modules can execute. A D2NN is trained end to end. Both regular and controller modules in a D2NN are learnable and are jointly trained to optimize both accuracy and efficiency. Such training is achieved by integrating backpropagation with reinforcement learning. With extensive experiments of various D2NN architectures on image classification tasks, we demonstrate that D2NNs are general and flexible, and can effectively optimize accuracy-efficiency trade-offs.

...read moreread less

Proceedings Article•DOI•

Scalable Annotation of Fine-Grained Categories Without Experts

[...]

Timnit Gebru¹, Jonathan Krause¹, Jia Deng², Li Fei-Fei¹•Institutions (2)

Stanford University¹, University of Michigan²

02 May 2017

TL;DR: This work introduces a graph based crowdsourcing algorithm to automatically group visually indistinguishable objects together and presents the largest fine-grained visual dataset reported to date with 2,657 categories of cars annotated at 1/20th the cost of hiring experts.

...read moreread less

Abstract: We present a crowdsourcing workflow to collect image annotations for visually similar synthetic categories without requiring experts. In animals, there is a direct link between taxonomy and visual similarity: e.g. a collie (type of dog) looks more similar to other collies (e.g. smooth collie) than a greyhound (another type of dog). However, in synthetic categories such as cars, objects with similar taxonomy can have very different appearance: e.g. a 2011 Ford F-150 Supercrew-HD looks the same as a 2011 Ford F-150 Supercrew-LL but very different from a 2011 Ford F-150 Supercrew-SVT. We introduce a graph based crowdsourcing algorithm to automatically group visually indistinguishable objects together. Using our workflow, we label 712,430 images by ~1,000 Amazon Mechanical Turk workers; resulting in the largest fine-grained visual dataset reported to date with 2,657 categories of cars annotated at 1/20th the cost of hiring experts.

...read moreread less

Posted Content•

Scalable Annotation of Fine-Grained Categories Without Experts

[...]

Timnit Gebru¹, Jonathan Krause¹, Jia Deng², Li Fei-Fei¹•Institutions (2)

Stanford University¹, University of Michigan²

07 Sep 2017-arXiv: Human-Computer Interaction

TL;DR: In this paper, a crowdsourcing workflow is presented to collect image annotations for visually similar synthetic categories without requiring experts, and the largest fine-grained visual dataset reported to date with 2,657 categories of cars annotated at 1/20th the cost of hiring experts.

...read moreread less

Abstract: We present a crowdsourcing workflow to collect image annotations for visually similar synthetic categories without requiring experts. In animals, there is a direct link between taxonomy and visual similarity: e.g. a collie (type of dog) looks more similar to other collies (e.g. smooth collie) than a greyhound (another type of dog). However, in synthetic categories such as cars, objects with similar taxonomy can have very different appearance: e.g. a 2011 Ford F-150 Supercrew-HD looks the same as a 2011 Ford F-150 Supercrew-LL but very different from a 2011 Ford F-150 Supercrew-SVT. We introduce a graph based crowdsourcing algorithm to automatically group visually indistinguishable objects together. Using our workflow, we label 712,430 images by ~1,000 Amazon Mechanical Turk workers; resulting in the largest fine-grained visual dataset reported to date with 2,657 categories of cars annotated at 1/20th the cost of hiring experts.

...read moreread less

Posted Content•

Shape from Shading through Shape Evolution

[...]

Dawei Yang¹, Jia Deng¹•Institutions (1)

University of Michigan¹

08 Dec 2017-arXiv: Computer Vision and Pattern Recognition

TL;DR: In this article, the shape-from-shading problem is addressed by training deep networks with synthetic images, which does not need any external shape dataset to render synthetic images and achieves state-of-the-art performance.

...read moreread less

Abstract: In this paper, we address the shape-from-shading problem by training deep networks with synthetic images. Unlike conventional approaches that combine deep learning and synthetic imagery, we propose an approach that does not need any external shape dataset to render synthetic images. Our approach consists of two synergistic processes: the evolution of complex shapes from simple primitives, and the training of a deep network for shape-from-shading. The evolution generates better shapes guided by the network training, while the training improves by using the evolved shapes. We show that our approach achieves state-of-the-art performance on a shape-from-shading benchmark.

...read moreread less

Posted Content•

Surface Normals in the Wild

[...]

Weifeng Chen¹, Donglai Xiang², Jia Deng³•Institutions (3)

University of Michigan¹, Tsinghua University², Carnegie Mellon University³

10 Apr 2017-arXiv: Computer Vision and Pattern Recognition

TL;DR: This work collects human annotated surface normals and uses them to help train a neural network that directly predicts pixel-wise depth and proposes two novel loss functions for training with surface normal annotations.

...read moreread less

Abstract: We study the problem of single-image depth estimation for images in the wild. We collect human annotated surface normals and use them to train a neural network that directly predicts pixel-wise depth. We propose two novel loss functions for training with surface normal annotations. Experiments on NYU Depth and our own dataset demonstrate that our approach can significantly improve the quality of depth estimation in the wild.

...read moreread less

Journal Article•DOI•

Pd46-04 video analysis of skill and technique (vast): machine learning to assess surgeons performing robotic prostatectomy

[...]

Khurshid R. Ghani, Yunfan Liu, Hei Law, David C. Miller, James E. Montie, Jia Deng - Show less +2 more

01 Apr 2017-The Journal of Urology

TL;DR: Computer video analysis can be used to predict skill in practicing robotic surgeons and in the future, methods utilizing deep learning to track instruments and calculate skill, may have significant implications for credentialing and quality improvement.

...read moreread less

Posted Content•

Pixels to Graphs by Associative Embedding

[...]

Alejandro Newell¹, Jia Deng¹•Institutions (1)

University of Michigan¹

22 Jun 2017-arXiv: Computer Vision and Pattern Recognition

TL;DR: In this paper, the network learns to simultaneously identify all of the elements that make up a graph and piece them together, achieving state-of-the-art performance on the Visual Genome dataset.

...read moreread less

Abstract: Graphs are a useful abstraction of image content. Not only can graphs represent details about individual objects in a scene but they can capture the interactions between pairs of objects. We present a method for training a convolutional neural network such that it takes in an input image and produces a full graph definition. This is done end-to-end in a single stage with the use of associative embeddings. The network learns to simultaneously identify all of the elements that make up a graph and piece them together. We benchmark on the Visual Genome dataset, and demonstrate state-of-the-art performance on the challenging task of scene graph generation.

...read moreread less

Visual Census: Using Cars to Study People and Society

[...]

Timnit Gebru, Jonathan Krause, Yilun Wang, Duyun Chen, Jia Deng, Li Fei-Fei - Show less +2 more

11 Sep 2017

Posted Content•

Temporal Action Localization by Structured Maximal Sums

[...]

Zehuan Yuan¹, Zehuan Yuan², Jonathan C. Stroud², Tong Lu¹, Jia Deng - Show less +1 more•Institutions (2)

Nanjing University¹, University of Michigan²

15 Apr 2017-arXiv: Computer Vision and Pattern Recognition

TL;DR: This work addresses the problem of temporal action localization in videos by posing action localization as a structured prediction over arbitrary-length temporal windows, where each window is scored as the sum of frame-wise classification scores.

...read moreread less

Abstract: We address the problem of temporal action localization in videos. We pose action localization as a structured prediction over arbitrary-length temporal windows, where each window is scored as the sum of frame-wise classification scores. Additionally, our model classifies the start, middle, and end of each action as separate components, allowing our system to explicitly model each action's temporal evolution and take advantage of informative temporal dependencies present in this structure. In this framework, we localize actions by searching for the structured maximal sum, a problem for which we develop a novel, provably-efficient algorithmic solution. The frame-wise classification scores are computed using features from a deep Convolutional Neural Network (CNN), which are trained end-to-end to directly optimize for a novel structured objective. We evaluate our system on the THUMOS 14 action detection benchmark and achieve competitive performance.

...read moreread less

Scalable Annotation of Fine-Grained Objects Without Experts

[...]

Timnit Gebru, Jonathan Krause, Jia Deng, Li Fei-Fei

11 Sep 2017

Journal Article•DOI•

Video analysis of skill and technique (VAST): Machine learning to assess the technical skill of surgeons performing robotic prostatectomy

[...]

Khurshid R. Ghani, Y. Liu, Hei Law, D. He, David Miller, James E. Montie, Jia Deng - Show less +3 more

01 Mar 2017-European Urology Supplements

Posted Content•

Premise Selection for Theorem Proving by Deep Graph Embedding

[...]

Mingzhe Wang¹, Yihe Tang¹, Jian Wang¹, Jia Deng¹•Institutions (1)

University of Michigan¹

28 Sep 2017-arXiv: Artificial Intelligence

TL;DR: In this article, a deep learning-based approach is proposed to select mathematical statements relevant for proving a given conjecture, which achieves state-of-the-art results on the HolStep dataset.

...read moreread less

Abstract: We propose a deep learning-based approach to the problem of premise selection: selecting mathematical statements relevant for proving a given conjecture. We represent a higher-order logic formula as a graph that is invariant to variable renaming but still fully preserves syntactic and semantic information. We then embed the graph into a vector via a novel embedding method that preserves the information of edge ordering. Our approach achieves state-of-the-art results on the HolStep dataset, improving the classification accuracy from 83% to 90.3%.

...read moreread less

Showing papers by "Jia Deng published in 2017"