Open AccessPosted Content
Learning to Act Properly: Predicting and Explaining Affordances from Images
Reads0
Chats0
TLDR
In this paper, a model that exploits Graph Neural Networks to propagate contextual information from the scene in order to perform detailed affordance reasoning about each object is proposed. But their work is limited to a single object.Abstract:
We address the problem of affordance reasoning in diverse scenes that appear in the real world. Affordances relate the agent's actions to their effects when taken on the surrounding objects. In our work, we take the egocentric view of the scene, and aim to reason about action-object affordances that respect both the physical world as well as the social norms imposed by the society. We also aim to teach artificial agents why some actions should not be taken in certain situations, and what would likely happen if these actions would be taken. We collect a new dataset that builds upon ADE20k, referred to as ADE-Affordance, which contains annotations enabling such rich visual reasoning. We propose a model that exploits Graph Neural Networks to propagate contextual information from the scene in order to perform detailed affordance reasoning about each object. Our model is showcased through various ablation studies, pointing to successes and challenges in this complex task.read more
Citations
More filters
Proceedings ArticleDOI
From Recognition to Cognition: Visual Commonsense Reasoning
TL;DR: To move towards cognition-level understanding, a new reasoning engine is presented, Recognition to Cognition Networks (R2C), that models the necessary layered inferences for grounding, contextualization, and reasoning.
Posted Content
Grounded Human-Object Interaction Hotspots from Video.
TL;DR: This work proposes an approach to learn human-object interaction "hotspots" directly from video by watching videos of real human behavior and anticipating afforded actions, and infers a spatial hotspot map indicating where an object would be manipulated in a potential interaction even if the object is currently at rest.
Posted Content
Generating 3D People in Scenes without People
TL;DR: The approach is able to synthesize realistic and expressive 3D human bodies that naturally interact with 3D environment that will be useful for numerous applications; e.g. to generate training data for human pose estimation, in video games and in VR/AR.
Posted Content
Putting Humans in a Scene: Learning Affordance in 3D Indoor Environments
TL;DR: This work builds a fully automatic 3D pose synthesizer that fuses semantic knowledge from a large number of 2D poses extracted from TV shows as well as 3D geometric knowledge from voxel representations of indoor scenes to predict semantically plausible and physically feasible human poses within a given scene.
Posted Content
Generating Natural Language Explanations for Visual Question Answering using Scene Graphs and Visual Attention.
TL;DR: This paper shows how combining the visual attention map with the NL representation of relevant scene graph entities, carefully selected using a language model, can give reasonable textual explanations without the need of any additional collected data.
References
More filters
Proceedings ArticleDOI
Deep Residual Learning for Image Recognition
TL;DR: In this article, the authors proposed a residual learning framework to ease the training of networks that are substantially deeper than those used previously, which won the 1st place on the ILSVRC 2015 classification task.
Proceedings Article
Adam: A Method for Stochastic Optimization
Diederik P. Kingma,Jimmy Ba +1 more
TL;DR: This work introduces Adam, an algorithm for first-order gradient-based optimization of stochastic objective functions, based on adaptive estimates of lower-order moments, and provides a regret bound on the convergence rate that is comparable to the best known results under the online convex optimization framework.
Journal ArticleDOI
ImageNet Large Scale Visual Recognition Challenge
Olga Russakovsky,Jia Deng,Hao Su,Jonathan Krause,Sanjeev Satheesh,Sean Ma,Zhiheng Huang,Andrej Karpathy,Aditya Khosla,Michael S. Bernstein,Alexander C. Berg,Li Fei-Fei +11 more
TL;DR: The ImageNet Large Scale Visual Recognition Challenge (ILSVRC) as mentioned in this paper is a benchmark in object category classification and detection on hundreds of object categories and millions of images, which has been run annually from 2010 to present, attracting participation from more than fifty institutions.
Book
The Ecological Approach to Visual Perception
TL;DR: The relationship between Stimulation and Stimulus Information for visual perception is discussed in detail in this article, where the authors also present experimental evidence for direct perception of motion in the world and movement of the self.
Posted Content
Semi-Supervised Classification with Graph Convolutional Networks
Thomas Kipf,Max Welling +1 more
TL;DR: A scalable approach for semi-supervised learning on graph-structured data that is based on an efficient variant of convolutional neural networks which operate directly on graphs which outperforms related methods by a significant margin.