scispace - formally typeset
Open AccessPosted Content

Learning to Act Properly: Predicting and Explaining Affordances from Images

Reads0
Chats0
TLDR
In this paper, a model that exploits Graph Neural Networks to propagate contextual information from the scene in order to perform detailed affordance reasoning about each object is proposed. But their work is limited to a single object.
Abstract
We address the problem of affordance reasoning in diverse scenes that appear in the real world. Affordances relate the agent's actions to their effects when taken on the surrounding objects. In our work, we take the egocentric view of the scene, and aim to reason about action-object affordances that respect both the physical world as well as the social norms imposed by the society. We also aim to teach artificial agents why some actions should not be taken in certain situations, and what would likely happen if these actions would be taken. We collect a new dataset that builds upon ADE20k, referred to as ADE-Affordance, which contains annotations enabling such rich visual reasoning. We propose a model that exploits Graph Neural Networks to propagate contextual information from the scene in order to perform detailed affordance reasoning about each object. Our model is showcased through various ablation studies, pointing to successes and challenges in this complex task.

read more

Citations
More filters
Proceedings ArticleDOI

From Recognition to Cognition: Visual Commonsense Reasoning

TL;DR: To move towards cognition-level understanding, a new reasoning engine is presented, Recognition to Cognition Networks (R2C), that models the necessary layered inferences for grounding, contextualization, and reasoning.
Posted Content

Grounded Human-Object Interaction Hotspots from Video.

TL;DR: This work proposes an approach to learn human-object interaction "hotspots" directly from video by watching videos of real human behavior and anticipating afforded actions, and infers a spatial hotspot map indicating where an object would be manipulated in a potential interaction even if the object is currently at rest.
Posted Content

Generating 3D People in Scenes without People

TL;DR: The approach is able to synthesize realistic and expressive 3D human bodies that naturally interact with 3D environment that will be useful for numerous applications; e.g. to generate training data for human pose estimation, in video games and in VR/AR.
Posted Content

Putting Humans in a Scene: Learning Affordance in 3D Indoor Environments

TL;DR: This work builds a fully automatic 3D pose synthesizer that fuses semantic knowledge from a large number of 2D poses extracted from TV shows as well as 3D geometric knowledge from voxel representations of indoor scenes to predict semantically plausible and physically feasible human poses within a given scene.
Posted Content

Generating Natural Language Explanations for Visual Question Answering using Scene Graphs and Visual Attention.

TL;DR: This paper shows how combining the visual attention map with the NL representation of relevant scene graph entities, carefully selected using a language model, can give reasonable textual explanations without the need of any additional collected data.
References
More filters
Proceedings ArticleDOI

Deep Residual Learning for Image Recognition

TL;DR: In this article, the authors proposed a residual learning framework to ease the training of networks that are substantially deeper than those used previously, which won the 1st place on the ILSVRC 2015 classification task.
Proceedings Article

Adam: A Method for Stochastic Optimization

TL;DR: This work introduces Adam, an algorithm for first-order gradient-based optimization of stochastic objective functions, based on adaptive estimates of lower-order moments, and provides a regret bound on the convergence rate that is comparable to the best known results under the online convex optimization framework.
Journal ArticleDOI

ImageNet Large Scale Visual Recognition Challenge

TL;DR: The ImageNet Large Scale Visual Recognition Challenge (ILSVRC) as mentioned in this paper is a benchmark in object category classification and detection on hundreds of object categories and millions of images, which has been run annually from 2010 to present, attracting participation from more than fifty institutions.
Book

The Ecological Approach to Visual Perception

TL;DR: The relationship between Stimulation and Stimulus Information for visual perception is discussed in detail in this article, where the authors also present experimental evidence for direct perception of motion in the world and movement of the self.
Posted Content

Semi-Supervised Classification with Graph Convolutional Networks

TL;DR: A scalable approach for semi-supervised learning on graph-structured data that is based on an efficient variant of convolutional neural networks which operate directly on graphs which outperforms related methods by a significant margin.
Related Papers (5)