scispace - formally typeset
Search or ask a question

Showing papers by "Takeo Kanade published in 2016"


Proceedings ArticleDOI
30 Jan 2016
TL;DR: In this paper, a convolutional network is incorporated into the pose machine framework for learning image features and image-dependent spatial models for the task of pose estimation, which can implicitly model long-range dependencies between variables in structured prediction tasks such as articulated pose estimation.
Abstract: Pose Machines provide a sequential prediction framework for learning rich implicit spatial models. In this work we show a systematic design for how convolutional networks can be incorporated into the pose machine framework for learning image features and image-dependent spatial models for the task of pose estimation. The contribution of this paper is to implicitly model long-range dependencies between variables in structured prediction tasks such as articulated pose estimation. We achieve this by designing a sequential architecture composed of convolutional networks that directly operate on belief maps from previous stages, producing increasingly refined estimates for part locations, without the need for explicit graphical model-style inference. Our approach addresses the characteristic difficulty of vanishing gradients during training by providing a natural learning objective function that enforces intermediate supervision, thereby replenishing back-propagated gradients and conditioning the learning procedure. We demonstrate state-of-the-art performance and outperform competing methods on standard benchmarks including the MPII, LSP, and FLIC datasets.

2,687 citations


Posted Content
TL;DR: This work designs a sequential architecture composed of convolutional networks that directly operate on belief maps from previous stages, producing increasingly refined estimates for part locations, without the need for explicit graphical model-style inference in structured prediction tasks such as articulated pose estimation.
Abstract: Pose Machines provide a sequential prediction framework for learning rich implicit spatial models. In this work we show a systematic design for how convolutional networks can be incorporated into the pose machine framework for learning image features and image-dependent spatial models for the task of pose estimation. The contribution of this paper is to implicitly model long-range dependencies between variables in structured prediction tasks such as articulated pose estimation. We achieve this by designing a sequential architecture composed of convolutional networks that directly operate on belief maps from previous stages, producing increasingly refined estimates for part locations, without the need for explicit graphical model-style inference. Our approach addresses the characteristic difficulty of vanishing gradients during training by providing a natural learning objective function that enforces intermediate supervision, thereby replenishing back-propagated gradients and conditioning the learning procedure. We demonstrate state-of-the-art performance and outperform competing methods on standard benchmarks including the MPII, LSP, and FLIC datasets.

317 citations


Posted Content
TL;DR: Data seems cheap to get, and in many ways it is, but the process of creating a high quality labeled dataset from a mass of data is time-consuming and expensive.
Abstract: Data seems cheap to get, and in many ways it is, but the process of creating a high quality labeled dataset from a mass of data is time-consuming and expensive. With the advent of rich 3D repositories, photo-realistic rendering systems offer the opportunity to provide nearly limitless data. Yet, their primary value for visual learning may be the quality of the data they can provide rather than the quantity. Rendering engines offer the promise of perfect labels in addition to the data: what the precise camera pose is; what the precise lighting location, temperature, and distribution is; what the geometry of the object is. In this work we focus on semi-automating dataset creation through use of synthetic data and apply this method to an important task -- object viewpoint estimation. Using state-of-the-art rendering software we generate a large labeled dataset of cars rendered densely in viewpoint space. We investigate the effect of rendering parameters on estimation performance and show realism is important. We show that generalizing from synthetic data is not harder than the domain adaptation required between two real-image datasets and that combining synthetic images with a small amount of real data improves estimation accuracy.

103 citations


Book ChapterDOI
08 Oct 2016
TL;DR: In many ways, data seems cheap to get, and in many ways it is, but the process of creating a high quality labeled dataset from a mass of data is time-consuming and expensive as mentioned in this paper.
Abstract: Data seems cheap to get, and in many ways it is, but the process of creating a high quality labeled dataset from a mass of data is time-consuming and expensive.

62 citations


Journal ArticleDOI
TL;DR: Experimental results performed on three types of cell populations validate that the interactive cell segmentation algorithm quickly reaches high quality results with minimal human interventions and is significantly more efficient than alternative methods, since the most informative samples are selected for human annotation/verification early.
Abstract: Automatic cell segmentation can hardly be flawless due to the complexity of image data particularly when time-lapse experiments last for a long time without biomarkers. To address this issue, we propose an interactive cell segmentation method by classifying feature-homogeneous superpixels into specific classes, which is guided by human interventions. Specifically, we propose to actively select the most informative superpixels by minimizing the expected prediction error which is upper bounded by the transductive Rademacher complexity, and then query for human annotations. After propagating the user-specified labels to the remaining unlabeled superpixels via an affinity graph, the error-prone superpixels are selected automatically and request for human verification on them; once erroneous segmentation is detected and subsequently corrected, the information is propagated efficiently over a gradually-augmented graph to un-labeled superpixels such that the analogous errors are fixed meanwhile. The correction propagation step is efficiently conducted by introducing a verification propagation matrix rather than rebuilding the affinity graph and re-performing the label propagation from the beginning. We repeat this procedure until most superpixels are classified into a specific category with high confidence. Experimental results performed on three types of cell populations validate that our interactive cell segmentation algorithm quickly reaches high quality results with minimal human interventions and is significantly more efficient than alternative methods, since the most informative samples are selected for human annotation/verification early.

51 citations


Patent
20 May 2016
TL;DR: In this paper, an image processing system and/or method obtains source images in which a damaged vehicle is represented, and performs image processing techniques to determine, predict, estimate, and or detect damage that has occurred at various locations on the vehicle.
Abstract: An image processing system and/or method obtains source images in which a damaged vehicle is represented, and performs image processing techniques to determine, predict, estimate, and/or detect damage that has occurred at various locations on the vehicle. The image processing techniques may include generating a composite image of the damaged vehicle, aligning and/or isolating the image, applying convolutional neural network techniques to the image to generate damage parameter values, where each value corresponds to damage in a particular location of vehicle, and/or other techniques. Based on the damage values, the image processing system/method generates and displays a heat map for the vehicle, where each color and/or color gradation corresponds to respective damage at a respective location on the vehicle. The heat map may be manipulatable by the user, and may include user controls for displaying additional information corresponding to the damage at a particular location on the vehicle.

36 citations


Posted Content
TL;DR: In this article, the Panoptic Studio is used to capture the 3D motion of a group of people engaged in a social interaction, and a modularized system consisting of integrated structural, hardware, and software innovations is presented.
Abstract: We present an approach to capture the 3D motion of a group of people engaged in a social interaction. The core challenges in capturing social interactions are: (1) occlusion is functional and frequent; (2) subtle motion needs to be measured over a space large enough to host a social group; (3) human appearance and configuration variation is immense; and (4) attaching markers to the body may prime the nature of interactions. The Panoptic Studio is a system organized around the thesis that social interactions should be measured through the integration of perceptual analyses over a large variety of view points. We present a modularized system designed around this principle, consisting of integrated structural, hardware, and software innovations. The system takes, as input, 480 synchronized video streams of multiple people engaged in social activities, and produces, as output, the labeled time-varying 3D structure of anatomical landmarks on individuals in the space. Our algorithm is designed to fuse the "weak" perceptual processes in the large number of views by progressively generating skeletal proposals from low-level appearance cues, and a framework for temporal refinement is also presented by associating body parts to reconstructed dense 3D trajectory stream. Our system and method are the first in reconstructing full body motion of more than five people engaged in social interactions without using markers. We also empirically demonstrate the impact of the number of views in achieving this goal.

17 citations


Posted Content
TL;DR: This work introduces the concept of a Visual Compiler that generates a scene specific pedestrian detector and pose estimator without any pedestrian observations, and demonstrates that when real human annotated data is scarce or non-existent, this data generation strategy can provide an excellent solution for bootstrapping human detection and pose estimation.
Abstract: We introduce the concept of a Visual Compiler that generates a scene specific pedestrian detector and pose estimator without any pedestrian observations. Given a single image and auxiliary scene information in the form of camera parameters and geometric layout of the scene, the Visual Compiler first infers geometrically and photometrically accurate images of humans in that scene through the use of computer graphics rendering. Using these renders we learn a scene-and-region specific spatially-varying fully convolutional neural network, for simultaneous detection, pose estimation and segmentation of pedestrians. We demonstrate that when real human annotated data is scarce or non-existent, our data generation strategy can provide an excellent solution for bootstrapping human detection and pose estimation. Experimental results show that our approach outperforms off-the-shelf state-of-the-art pedestrian detectors and pose estimators that are trained on real data.

13 citations


Journal ArticleDOI
TL;DR: This CVIU special issue gathers very recent and various works n assistive computer vision and robotics that have applications in robotics as multi-modal human-robot interaction, autonomous navigaion, object usage, place recognition, robotic manipulator, egocenric vision.

8 citations


Journal ArticleDOI
TL;DR: The proposed object representation consists of an approximated geometry model and a viewpoint-scale invariant appearance model which makes it possible to model a new object online, and provides a robustness to viewpoint variation and occlusion.
Abstract: Various object representations have been widely used for many tasks such as object detection, recognition, and tracking. Most of them requires an intensive training process on large database which is collected in advance, and it is hard to add models of a previously unobserved object which is not in the database. In this paper, we investigate how to create a representation of a new and unknown object online, and how to apply it to practical applications like object detection and tracking. To make it viable, we utilize a sensor fusion approach using a camera and a single-line scan LIDAR. The proposed representation consists of an approximated geometry model and a viewpoint-scale invariant appearance model which makes to extremely simple to match the model and the observation. This property makes it possible to model a new object online, and provides a robustness to viewpoint variation and occlusion. The representation has benefits of both an implicit model (referred to as a view-based model) and an explicit model (referred to as a shape-based model). Intensive experiments using synthetic and real data demonstrate the viability of the proposed object representation in both modeling and detecting/tracking objects.

2 citations


Book ChapterDOI
01 Jan 2016
TL;DR: This chapter focuses on image analysis and understanding of live cell populations in time lapse phase contrast microscopy using state-of-the-art algorithms for cell segmentation and cell behavior understanding.
Abstract: This chapter focuses on image analysis and understanding of live cell populations in time lapse phase contrast microscopy. The computer vision tasks involve cell segmentation and cell behavior understanding, including cell migration, division (mitosis), death (apoptosis), and differentiation. We will describe the problem definition for each topic, introduce the general schools of approaches that have been explored, discuss details of the state-of-the-art algorithms, and propose promising directions for future investigation.

Book ChapterDOI
20 Nov 2016
TL;DR: This work proposes a second order linear regression method that is both compact and robust against strong rotations, and provides a closed form solution, making the method fast to train.
Abstract: Recent methods for facial landmark location perform well on close-to-frontal faces but have problems in generalising to large head rotations. In order to address this issue we propose a second order linear regression method that is both compact and robust against strong rotations. We provide a closed form solution, making the method fast to train. We test the method’s performance on two challenging datasets. The first has been intensely used by the community. The second has been specially generated from a well known 3D face dataset. It is considerably more challenging, including a high diversity of rotations and more samples than any other existing public dataset. The proposed method is compared against state-of-the-art approaches, including RCPR, CGPRT, LBF, CFSS, and GSDM. Results upon both datasets show that the proposed method offers state-of-the-art performance on near frontal view data, improves state-of-the-art methods on more challenging head rotation problems and keeps a compact model size.

Journal ArticleDOI
TL;DR: This CVIU special issue gathers very recent and various works n assistive computer vision and robotics that have applications in robotics as multi-modal human-robot interaction, autonomous navigaion, object usage, place recognition, robotic manipulator, egocenric vision.