Learning to Act Properly: Predicting and Explaining Affordances from Images

Open AccessPosted Content

Learning to Act Properly: Predicting and Explaining Affordances from Images

Ching-Yao Chuang, +3 more

- 20 Dec 2017 -

arXiv: Computer Vision and Pattern Recog...

Chats0

TLDR

In this paper, a model that exploits Graph Neural Networks to propagate contextual information from the scene in order to perform detailed affordance reasoning about each object is proposed. But their work is limited to a single object.

Abstract:

We address the problem of affordance reasoning in diverse scenes that appear in the real world. Affordances relate the agent's actions to their effects when taken on the surrounding objects. In our work, we take the egocentric view of the scene, and aim to reason about action-object affordances that respect both the physical world as well as the social norms imposed by the society. We also aim to teach artificial agents why some actions should not be taken in certain situations, and what would likely happen if these actions would be taken. We collect a new dataset that builds upon ADE20k, referred to as ADE-Affordance, which contains annotations enabling such rich visual reasoning. We propose a model that exploits Graph Neural Networks to propagate contextual information from the scene in order to perform detailed affordance reasoning about each object. Our model is showcased through various ablation studies, pointing to successes and challenges in this complex task.

Citations

PDF

Open Access

More filters

Proceedings ArticleDOI

From Recognition to Cognition: Visual Commonsense Reasoning

Rowan Zellers, +3 more

TL;DR: To move towards cognition-level understanding, a new reasoning engine is presented, Recognition to Cognition Networks (R2C), that models the necessary layered inferences for grounding, contextualization, and reasoning.

...read moreread less

Posted Content

Grounded Human-Object Interaction Hotspots from Video.

Tushar Nagarajan, +2 more

- 11 Dec 2018 -

arXiv: Computer Vision and Pattern Recog...

TL;DR: This work proposes an approach to learn human-object interaction "hotspots" directly from video by watching videos of real human behavior and anticipating afforded actions, and infers a spatial hotspot map indicating where an object would be manipulated in a potential interaction even if the object is currently at rest.

...read moreread less

Posted Content

Generating 3D People in Scenes without People

Yan Zhang, +4 more

- 05 Dec 2019 -

arXiv: Computer Vision and Pattern Recog...

TL;DR: The approach is able to synthesize realistic and expressive 3D human bodies that naturally interact with 3D environment that will be useful for numerous applications; e.g. to generate training data for human pose estimation, in video games and in VR/AR.

...read moreread less

Posted Content

Putting Humans in a Scene: Learning Affordance in 3D Indoor Environments

Xueting Li, +5 more

- 13 Mar 2019 -

arXiv: Computer Vision and Pattern Recog...

TL;DR: This work builds a fully automatic 3D pose synthesizer that fuses semantic knowledge from a large number of 2D poses extracted from TV shows as well as 3D geometric knowledge from voxel representations of indoor scenes to predict semantically plausible and physically feasible human poses within a given scene.

...read moreread less

Posted Content

Generating Natural Language Explanations for Visual Question Answering using Scene Graphs and Visual Attention.

Shalini Ghosh, +3 more

- 15 Feb 2019 -

arXiv: Computation and Language

TL;DR: This paper shows how combining the visual attention map with the NL representation of relevant scene graph entities, carefully selected using a language model, can give reasonable textual explanations without the need of any additional collected data.

...read moreread less

References

PDF

Open Access

More filters

Proceedings ArticleDOI

Deep Residual Learning for Image Recognition

Kaiming He, +3 more

TL;DR: In this article, the authors proposed a residual learning framework to ease the training of networks that are substantially deeper than those used previously, which won the 1st place on the ILSVRC 2015 classification task.

...read moreread less

Proceedings Article

Adam: A Method for Stochastic Optimization

Diederik P. Kingma, +1 more

TL;DR: This work introduces Adam, an algorithm for first-order gradient-based optimization of stochastic objective functions, based on adaptive estimates of lower-order moments, and provides a regret bound on the convergence rate that is comparable to the best known results under the online convex optimization framework.

...read moreread less

Journal ArticleDOI

ImageNet Large Scale Visual Recognition Challenge

Olga Russakovsky, +11 more

- 01 Dec 2015 -

International Journal of Computer Vision

TL;DR: The ImageNet Large Scale Visual Recognition Challenge (ILSVRC) as mentioned in this paper is a benchmark in object category classification and detection on hundreds of object categories and millions of images, which has been run annually from 2010 to present, attracting participation from more than fifty institutions.

...read moreread less

Book

The Ecological Approach to Visual Perception

James J. Gibson

TL;DR: The relationship between Stimulation and Stimulus Information for visual perception is discussed in detail in this article, where the authors also present experimental evidence for direct perception of motion in the world and movement of the self.

...read moreread less

Posted Content

Semi-Supervised Classification with Graph Convolutional Networks

Thomas Kipf, +1 more

- 09 Sep 2016 -

arXiv: Learning

TL;DR: A scalable approach for semi-supervised learning on graph-structured data that is based on an efficient variant of convolutional neural networks which operate directly on graphs which outperforms related methods by a significant margin.

...read moreread less