scispace - formally typeset
Open AccessProceedings ArticleDOI

Visual Abductive Reasoning

Reads0
Chats0
TLDR
In this paper , the authors proposed a new task and dataset, Visual Abductive Reasoning (VAR), for examining abductive reasoning ability of machine intelligence in everyday visual situations.
Abstract
Abductive reasoning seeks the likeliest possible explanation for partial observations. Although abduction is frequently employed in human daily reasoning, it is rarely explored in computer vision literature. In this paper, we propose a new task and dataset, Visual Abductive Reasoning (VAR), for examining abductive reasoning ability of machine intelligence in everyday visual situations. Given an incomplete set of visual events, AI systems are required to not only describe what is observed, but also infer the hypothesis that can best explain the visual premise. Based on our large-scale VAR dataset, we devise a strong baseline model, REASONER (causal-and-cascaded reasoning Transformer). First, to capture the causal structure of the observations, a contextualized directional position embedding strategy is adopted in the encoder, that yields discriminative represen-tations for the premise and hypothesis. Then, multiple de-coders are cascaded to generate and progressively refine the premise and hypothesis sentences. The prediction scores of the sentences are used to guide cross-sentence information flow in the cascaded reasoning procedure. Our VAR bench-marking results show that REASONER surpasses many famous video-language models, while still being far behind human performance. This work is expected to foster future efforts in the reasoning-beyond-observation paradigm.

read more

Content maybe subject to copyright    Report

Citations
More filters
Journal ArticleDOI

Local-Global Context Aware Transformer for Language-Guided Video Segmentation

TL;DR: In this article , the Transformer architecture is augmented with a finite memory for language-guided video segmentation (LVS), which is designed to persistently preserve global video content and dynamically gather local temporal context and segmentation history.
Journal ArticleDOI

Cascade-refine model for cephalometric landmark detection in high-resolution orthodontic images

TL;DR: Zhang et al. as mentioned in this paper interpreted the cascade-connected neural network (CCNN) as the discrete approximation of ordinary differential equations and proposed a cascade-refine model, which takes advantage of CCNNs and makes it possible to overcome the limitations of number and depth by sharing parameters among stacked network backbones.
Journal ArticleDOI

Cross-modal transformer with language query for referring image segmentation

TL;DR: Zhang et al. as discussed by the authors proposed a cross-modal transformer (CMT) with language queries for referring image segmentation, which combines the mutual guidance of vision and language.
Journal ArticleDOI

Boundary-constrained interpretable image reconstruction network for deep compressive sensing

TL;DR: The edge guided interpretable image compressive sensing network (EGINet) as mentioned in this paper proposes an edge-aware feature extraction module, an edge guided intermediate variable updating module and an intermediate-variable guided image reconstruction module.
References
More filters
Proceedings ArticleDOI

Deep Residual Learning for Image Recognition

TL;DR: In this article, the authors proposed a residual learning framework to ease the training of networks that are substantially deeper than those used previously, which won the 1st place on the ILSVRC 2015 classification task.
Proceedings ArticleDOI

Bleu: a Method for Automatic Evaluation of Machine Translation

TL;DR: This paper proposed a method of automatic machine translation evaluation that is quick, inexpensive, and language-independent, that correlates highly with human evaluation, and that has little marginal cost per run.
Journal ArticleDOI

A learning algorithm for continually running fully recurrent neural networks

TL;DR: The exact form of a gradient-following learning algorithm for completely recurrent networks running in continually sampled time is derived and used as the basis for practical algorithms for temporal supervised learning tasks.
Proceedings ArticleDOI

CIDEr: Consensus-based image description evaluation

TL;DR: A novel paradigm for evaluating image descriptions that uses human consensus is proposed and a new automated metric that captures human judgment of consensus better than existing metrics across sentences generated by various sources is evaluated.
Journal ArticleDOI

Cloze procedure: a new tool for measuring readability

TL;DR: This is the first comprehensive statement of a research method and its theory and findings from three pilot studies and two experiments in which “cloze procedure” results are compared with those of two readability formulas.