scispace - formally typeset
Proceedings ArticleDOI

Vision GNN: An Image is Worth Graph of Nodes

Kai Han, +4 more
- Vol. abs/2206.00272
Reads0
Chats0
TLDR
This paper proposes to represent the image as a graph structure and introduces a new Vision GNN (ViG) architecture to extract graphlevel feature for visual tasks, and extensive experiments on image recognition and object detection tasks demonstrate the superiority of this ViG architecture.
Abstract
Network architecture plays a key role in the deep learning-based computer vision system. The widely-used convolutional neural network and transformer treat the image as a grid or sequence structure, which is not flexible to capture irregular and complex objects. In this paper, we propose to represent the image as a graph structure and introduce a new Vision GNN (ViG) architecture to extract graph-level feature for visual tasks. We first split the image to a number of patches which are viewed as nodes, and construct a graph by connecting the nearest neighbors. Based on the graph representation of images, we build our ViG model to transform and exchange information among all the nodes. ViG consists of two basic modules: Grapher module with graph convolution for aggregating and updating graph information, and FFN module with two linear layers for node feature transformation. Both isotropic and pyramid architectures of ViG are built with different model sizes. Extensive experiments on image recognition and object detection tasks demonstrate the superiority of our ViG architecture. We hope this pioneering study of GNN on general visual tasks will provide useful inspiration and experience for future research. The PyTorch code is available at https://github.com/huawei-noah/Efficient-AI-Backbones and the MindSpore code is available at https://gitee.com/mindspore/models.

read more

Content maybe subject to copyright    Report

Citations
More filters
Journal ArticleDOI

MetaFormer Baselines for Vision

TL;DR: This paper introduces several baseline models under MetaFormer using the most basic or common mixers, and demonstrates their gratifying performance: ConvFormer and CAFormer.
Journal ArticleDOI

A Survey on Graph Neural Networks and Graph Transformers in Computer Vision: A Task-Oriented Perspective

TL;DR: A comprehensive review of GNNs and graph Transformers in computer vision from a task-oriented perspective and divides their applications into categories according to the modality of input data, i.e., 2D natural images, videos, 3D data, vision + language, and medical images.
Proceedings ArticleDOI

Pushing the Limits of Fewshot Anomaly Detection in Industry Vision: Graphcore

TL;DR: In this paper , a novel visual isometric invariant feature (VIIF) was proposed to improve the anomaly discriminating ability and further reduce the size of redundant features stored in memory bank M by a large amount.
Journal ArticleDOI

A Generalization of ViT/MLP-Mixer to Graphs

TL;DR: Graph ViT/MLP-Mixer as mentioned in this paper proposes an alternative approach to overcome the structural limitations of GNNs by leveraging the ViT and MLP-mixer architectures introduced in computer vision.
Journal ArticleDOI

ViGU: Vision GNN U-Net for Fast MRI

TL;DR: In this article , a U-shape network is developed using several graph blocks in symmetrical encoder and decoder paths, which can also benefit from Generative Adversarial Networks yielding to its variant ViGU-GAN.
References
More filters
Proceedings Article

Attention is All you Need

TL;DR: This paper proposed a simple network architecture based solely on an attention mechanism, dispensing with recurrence and convolutions entirely and achieved state-of-the-art performance on English-to-French translation.
Posted Content

Deep Residual Learning for Image Recognition

TL;DR: This work presents a residual learning framework to ease the training of networks that are substantially deeper than those used previously, and provides comprehensive empirical evidence showing that these residual networks are easier to optimize, and can gain accuracy from considerably increased depth.
Journal ArticleDOI

ImageNet classification with deep convolutional neural networks

TL;DR: A large, deep convolutional neural network was trained to classify the 1.2 million high-resolution images in the ImageNet LSVRC-2010 contest into the 1000 different classes and employed a recently developed regularization method called "dropout" that proved to be very effective.
Book ChapterDOI

Microsoft COCO: Common Objects in Context

TL;DR: A new dataset with the goal of advancing the state-of-the-art in object recognition by placing the question of object recognition in the context of the broader question of scene understanding by gathering images of complex everyday scenes containing common objects in their natural context.
Proceedings ArticleDOI

Fully convolutional networks for semantic segmentation

TL;DR: The key insight is to build “fully convolutional” networks that take input of arbitrary size and produce correspondingly-sized output with efficient inference and learning.