From Strings to Things: Knowledge-Enabled VQA Model That Can Read and Reason

doi:10.1109/ICCV.2019.00470

Home
/
Papers
/
From Strings to Things: Knowledge-Enabled VQA Model That Can Read and Reason

Proceedings Article•DOI•

From Strings to Things: Knowledge-Enabled VQA Model That Can Read and Reason

Ajeet Kumar Singh, Anand Mishra¹, Shashank Shekhar¹, Anirban Chakraborty¹•Institutions (1)

Indian Institute of Science¹

01 Oct 2019-pp 4601-4611

TL;DR: This work presents a VQA model which can read scene texts and perform reasoning on a knowledge graph to arrive at an accurate answer, and is the first dataset which identifies the need for bridging text recognition with knowledge graph based reasoning.

read less

Abstract: Text present in images are not merely strings, they provide useful cues about the image. Despite their utility in better image understanding, scene texts are not used in traditional visual question answering (VQA) models. In this work, we present a VQA model which can read scene texts and perform reasoning on a knowledge graph to arrive at an accurate answer. Our proposed model has three mutually interacting modules: i. proposal module to get word and visual content proposals from the image, ii. fusion module to fuse these proposals, question and knowledge base to mine relevant facts, and represent these facts as multi-relational graph, iii. reasoning module to perform a novel gated graph neural network based reasoning on this graph. The performance of our knowledge-enabled VQA model is evaluated on our newly introduced dataset, viz. text-KVQA. To the best of our knowledge, this is the first dataset which identifies the need for bridging text recognition with knowledge graph based reasoning. Through extensive experiments, we show that our proposed method outperforms traditional VQA as well as question-answering over knowledge base-based methods on text-KVQA.

...read moreread less

Content maybe subject to copyright Report

Citations

PDF

Open Access

More filters

Posted Content•

DocVQA: A Dataset for VQA on Document Images

[...]

Minesh Mathew¹, Dimosthenis Karatzas², C. V. Jawahar¹•Institutions (2)

International Institute of Information Technology, Hyderabad¹, Autonomous University of Barcelona²

01 Jul 2020-arXiv: Computer Vision and Pattern Recognition

TL;DR: Although the existing models perform reasonably well on certain types of questions, there is large performance gap compared to human performance (94.36% accuracy).

...read moreread less

Abstract: We present a new dataset for Visual Question Answering (VQA) on document images called DocVQA. The dataset consists of 50,000 questions defined on 12,000+ document images. Detailed analysis of the dataset in comparison with similar datasets for VQA and reading comprehension is presented. We report several baseline results by adopting existing VQA and reading comprehension models. Although the existing models perform reasonably well on certain types of questions, there is large performance gap compared to human performance (94.36% accuracy). The models need to improve specifically on questions where understanding structure of the document is crucial. The dataset, code and leaderboard are available at this http URL

...read moreread less

96 citations

Cites methods from "From Strings to Things: Knowledge-E..."

...The ST-VQA [5] and TextVQA [35] datasets were introduced in parallel in 2019 and were quickly followed by more research [36, 11, 39]....
[...]

Proceedings Article•DOI•

Boosting Visual Question Answering with Context-aware Knowledge Aggregation

[...]

Guohao Li¹, Xin Wang¹, Wenwu Zhu¹•Institutions (1)

Tsinghua University¹

12 Oct 2020

TL;DR: The proposed KG-Aug model is capable of retrieving context-aware knowledge subgraphs given visual images and textual questions, and learning to aggregate the useful image- and question-dependent knowledge which is then utilized to boost the accuracy in answering visual questions.

...read moreread less

Abstract: Given an image and a natural language question, Visual Question Answering (VQA) aims at answering the textual question correctly. Most VQA approaches in literature targets at finding answers to the questions solely based on analyzing the given images and questions alone. Other works that try to incorporate external knowledge into VQA adopt a query-based search on knowledge graphs to obtain the answer. However, these works suffer from the following problem: the model training process heavily relies on the ground-truth knowledge facts which serve as supervised information --- missing these ground-truth knowledge facts during training will lead to failures in producing the correct answers. To solve the challenging issue, we propose a Knowledge Graph Augmented (KG-Aug) model which conducts context-aware knowledge aggregation on external knowledge graphs, requiring no ground-truth knowledge facts for extra supervision. The proposed KG-Aug model is capable of retrieving context-aware knowledge subgraphs given visual images and textual questions, and learning to aggregate the useful image- and question-dependent knowledge which is then utilized to boost the accuracy in answering visual questions. We carry out extensive experiments to validate the effectiveness of our proposed KG-Aug models against several baseline approaches on various datasets.

...read moreread less

46 citations

Cites background from "From Strings to Things: Knowledge-E..."

...Recently, there are also relatedworks that expoit retrieved knowledge facts throught memory networks [14, 20] or graph neural networks [21]....
[...]

Journal Article•DOI•

A Survey on Graph Neural Networks and Graph Transformers in Computer Vision: A Task-Oriented Perspective

[...]

Chaoqi Chen, Yushuang Wu, Qiyuan Dai, Hong-Yu Zhou, Mutian Xu, S. Yang, Xiaoguang Han, Yizhou Yu - Show less +4 more

27 Sep 2022-arXiv.org

TL;DR: A comprehensive review of GNNs and graph Transformers in computer vision from a task-oriented perspective and divides their applications into categories according to the modality of input data, i.e., 2D natural images, videos, 3D data, vision + language, and medical images.

...read moreread less

Abstract: Graph Neural Networks (GNNs) have gained momentum in graph representation learning and boosted the state of the art in a variety of areas, such as data mining (\emph{e.g.,} social network analysis and recommender systems), computer vision (\emph{e.g.,} object detection and point cloud learning), and natural language processing (\emph{e.g.,} relation extraction and sequence learning), to name a few. With the emergence of Transformers in natural language processing and computer vision, graph Transformers embed a graph structure into the Transformer architecture to overcome the limitations of local neighborhood aggregation while avoiding strict structural inductive biases. In this paper, we present a comprehensive review of GNNs and graph Transformers in computer vision from a task-oriented perspective. Specifically, we divide their applications in computer vision into five categories according to the modality of input data, \emph{i.e.,} 2D natural images, videos, 3D data, vision + language, and medical images. In each category, we further divide the applications according to a set of vision tasks. Such a task-oriented taxonomy allows us to examine how each task is tackled by different GNN-based approaches and how well these approaches perform. Based on the necessary preliminaries, we provide the definitions and challenges of the tasks, in-depth coverage of the representative approaches, as well as discussions regarding insights, limitations, and future directions.

...read moreread less

16 citations

Journal Article•DOI•

An analysis of graph convolutional networks and recent datasets for visual question answering

[...]

Abdulganiyu Abdu Yusuf, Feng Chong, Mao Xianling

09 Apr 2022-Artificial Intelligence Review

11 citations

Proceedings Article•DOI•

Joint Learning of Object Graph and Relation Graph for Visual Question Answering

[...]

Hao Li, Xu Li, Belhal Karimi, Jie Chen, Mingming Sun - Show less +1 more

09 May 2022

TL;DR: A novel Dual Message-passing enhanced Graph Neural Net-work (DM-GNN) is introduced, which can obtain a balanced represen-tation by properly encoding multi-scale scene graph infor-mation.

...read moreread less

Abstract: Modeling visual question answering (VQA) through scene graphs can significantly improve the reasoning accuracy and interpretability. However, existing models answer poorly for complex reasoning questions with attributes or relations, which causes false attribute selection or missing relation in Figure 1(a). It is because these models cannot balance all kinds of information in scene graphs, neglecting relation and attribute information. In this paper, we introduce a novel Dual Message-passing enhanced Graph Neural Net-work (DM-GNN), which can obtain a balanced represen-tation by properly encoding multi-scale scene graph infor-mation. Specifically, we (i) transform the scene graph into two graphs with diversified focuses on objects and relations; Then we design a dual structure to encode them, which in-creases the weights from relations (ii) fuse the encoder out-put with attribute features, which increases the weights from attributes; (iii) propose a message-passing mechanism to en-hance the information transfer between objects, relations and attributes. We conduct extensive experiments on datasets in-cluding GQA, VG, motif-VG and achieve new state of the art.

...read moreread less

9 citations

1
2
3
4
…
5
6
7

Collapse

References

PDF

Open Access

More filters

Proceedings Article•DOI•

Bag of Tricks for Efficient Text Classification

[...]

Armand Joulin¹, Edouard Grave², Piotr Bojanowski¹, Tomas Mikolov¹•Institutions (2)

Facebook¹, Columbia University²

01 Apr 2017

TL;DR: FastText as mentioned in this paper explores a simple and efficient baseline for text classification, which is often on par with deep learning classifiers in terms of accuracy and many orders of magnitude faster for training and evaluation.

...read moreread less

Abstract: This paper explores a simple and efficient baseline for text classification. Our experiments show that our fast text classifier fastText is often on par with deep learning classifiers in terms of accuracy, and many orders of magnitude faster for training and evaluation. We can train fastText on more than one billion words in less than ten minutes using a standard multicore CPU, and classify half a million sentences among 312K classes in less than a minute.

...read moreread less

3,765 citations

Proceedings Article•DOI•

VQA: Visual Question Answering

[...]

Stanislaw Antol¹, Aishwarya Agrawal¹, Jiasen Lu¹, Margaret Mitchell², Dhruv Batra³, C. Lawrence Zitnick⁴, Devi Parikh³ - Show less +3 more•Institutions (4)

Virginia Tech¹, Microsoft², Georgia Institute of Technology³, Facebook⁴

07 Dec 2015

TL;DR: The task of free-form and open-ended Visual Question Answering (VQA) is proposed, given an image and a natural language question about the image, the task is to provide an accurate natural language answer.

...read moreread less

Abstract: We propose the task of free-form and open-ended Visual Question Answering (VQA). Given an image and a natural language question about the image, the task is to provide an accurate natural language answer. Mirroring real-world scenarios, such as helping the visually impaired, both the questions and answers are open-ended. Visual questions selectively target different areas of an image, including background details and underlying context. As a result, a system that succeeds at VQA typically needs a more detailed understanding of the image and complex reasoning than a system producing generic image captions. Moreover, VQA is amenable to automatic evaluation, since many open-ended answers contain only a few words or a closed set of answers that can be provided in a multiple-choice format. We provide a dataset containing ~0.25M images, ~0.76M questions, and ~10M answers (www.visualqa.org), and discuss the information it provides. Numerous baselines for VQA are provided and compared with human performance.

...read moreread less

3,513 citations

Journal Article•DOI•

Places: A 10 Million Image Database for Scene Recognition

[...]

Bolei Zhou¹, Agata Lapedriza², Aditya Khosla¹, Aude Oliva¹, Antonio Torralba¹ - Show less +1 more•Institutions (2)

Massachusetts Institute of Technology¹, Open University of Catalonia²

01 Jun 2018-IEEE Transactions on Pattern Analysis and Machine Intelligence

TL;DR: The Places Database is described, a repository of 10 million scene photographs, labeled with scene semantic categories, comprising a large and diverse list of the types of environments encountered in the world, using the state-of-the-art Convolutional Neural Networks as baselines, that significantly outperform the previous approaches.

...read moreread less

Abstract: The rise of multi-million-item dataset initiatives has enabled data-hungry machine learning algorithms to reach near-human semantic classification performance at tasks such as visual object and scene recognition. Here we describe the Places Database, a repository of 10 million scene photographs, labeled with scene semantic categories, comprising a large and diverse list of the types of environments encountered in the world. Using the state-of-the-art Convolutional Neural Networks (CNNs), we provide scene classification CNNs (Places-CNNs) as baselines, that significantly outperform the previous approaches. Visualization of the CNNs trained on Places shows that object detectors emerge as an intermediate representation of scene classification. With its high-coverage and high-diversity of exemplars, the Places Database along with the Places-CNNs offer a novel resource to guide future progress on scene recognition problems.

...read moreread less

3,215 citations

"From Strings to Things: Knowledge-E..." refers background or methods in this paper

...Since category names in our dataset are not exactly the same in Places, we could not perform quantitative analysis on visual content evaluation of places....
[...]
...To this end, we rely on Places [55] for scene recognition and a fine-tuned VGG-16 model for representing visual contents from movie posters and book covers....
[...]
...Word proposals [16]: Subway, open Visual content proposals [55]: fast food restaurant, shop front...
[...]
...We use Places [55] and a VGG-16 finetuned model for recognizing these visual contents for categories-(i) and (ii), respectively....
[...]

Journal Article•DOI•

Wikidata: a free collaborative knowledgebase

[...]

Denny Vrandecic¹, Markus Krötzsch²•Institutions (2)

Google¹, Dresden University of Technology²

23 Sep 2014-Communications of The ACM

TL;DR: This collaboratively edited knowledgebase provides a common source of data for Wikipedia, and everyone else, to help improve the quality of the encyclopedia.

...read moreread less

Abstract: This collaboratively edited knowledgebase provides a common source of data for Wikipedia, and everyone else.

...read moreread less

2,809 citations

"From Strings to Things: Knowledge-E..." refers background in this paper

..., Wikidata [44], IMDb [1], a book catalogue [13]....
[...]
...Our newly introduced dataset is much larger in scale as compared to the three aforementioned works [8, 32, 42] and more importantly, backed up by web-scale knowledge facts harvested from various sources, e.g., Wikidata [44], IMDb [1], a book catalogue [13]....
[...]
...To construct these three knowledge bases, we crawl open-source world knowledge bases, e.g., Wikidata [3], IMDb [1] and book catalogue provided by [18] around the anchor entities.1 Each knowledge fact is a triplet connecting two entities with a relation....
[...]
...Further, with the access to rich open-source knowledge graphs such as Wikidata [44], we could ask a series of natural questions, such as, Can I get a Sandwich here?, Is this a French brand?, and so on, which are not possible to ask in traditional VQA [5] as well as knowledge-enabled VQA models [47, 48]....
[...]

Proceedings Article•

Gated Graph Sequence Neural Networks.

[...]

Yujia Li¹, Daniel Tarlow², Marc Brockschmidt², Richard S. Zemel¹•Institutions (2)

University of Toronto¹, Microsoft²

01 Apr 2016

TL;DR: This work studies feature learning techniques for graph-structured inputs and achieves state-of-the-art performance on a problem from program verification, in which subgraphs need to be matched to abstract data structures.

...read moreread less

Abstract: Graph-structured data appears frequently in domains including chemistry, natural language semantics, social networks, and knowledge bases. In this work, we study feature learning techniques for graph-structured inputs. Our starting point is previous work on Graph Neural Networks (Scarselli et al., 2009), which we modify to use gated recurrent units and modern optimization techniques and then extend to output sequences. The result is a flexible and broadly useful class of neural network models that has favorable inductive biases relative to purely sequence-based models (e.g., LSTMs) when the problem is graph-structured. We demonstrate the capabilities on some simple AI (bAbI) and graph algorithm learning tasks. We then show it achieves state-of-the-art performance on a problem from program verification, in which subgraphs need to be matched to abstract data structures.

...read moreread less

2,518 citations

"From Strings to Things: Knowledge-E..." refers background or methods in this paper

...This motivated us to utilize the capability of graph representation learning in the form of gated graph neural networks (GGNN) [27]....
[...]
...A natural choice for this is ‘gated graph neural network’ (GGNN) [27] which is emerging as a powerful tool to perform reasoning over graphs....
[...]
...bolic QA [27] to more complex visual reasoning [30]....
[...]
..., visual contents, recognized words, question and knowledge facts, and perform a reasoning on a multi-relational graph using a novel gated graph neural network [27] formulation....
[...]
...We choose gated graph neural network (GGNN) [27] for this task....
[...]