scispace - formally typeset
Search or ask a question
Proceedings ArticleDOI

From Strings to Things: Knowledge-Enabled VQA Model That Can Read and Reason

TL;DR: This work presents a VQA model which can read scene texts and perform reasoning on a knowledge graph to arrive at an accurate answer, and is the first dataset which identifies the need for bridging text recognition with knowledge graph based reasoning.
Abstract: Text present in images are not merely strings, they provide useful cues about the image. Despite their utility in better image understanding, scene texts are not used in traditional visual question answering (VQA) models. In this work, we present a VQA model which can read scene texts and perform reasoning on a knowledge graph to arrive at an accurate answer. Our proposed model has three mutually interacting modules: i. proposal module to get word and visual content proposals from the image, ii. fusion module to fuse these proposals, question and knowledge base to mine relevant facts, and represent these facts as multi-relational graph, iii. reasoning module to perform a novel gated graph neural network based reasoning on this graph. The performance of our knowledge-enabled VQA model is evaluated on our newly introduced dataset, viz. text-KVQA. To the best of our knowledge, this is the first dataset which identifies the need for bridging text recognition with knowledge graph based reasoning. Through extensive experiments, we show that our proposed method outperforms traditional VQA as well as question-answering over knowledge base-based methods on text-KVQA.

Content maybe subject to copyright    Report

Citations
More filters
Posted Content
TL;DR: Although the existing models perform reasonably well on certain types of questions, there is large performance gap compared to human performance (94.36% accuracy).
Abstract: We present a new dataset for Visual Question Answering (VQA) on document images called DocVQA. The dataset consists of 50,000 questions defined on 12,000+ document images. Detailed analysis of the dataset in comparison with similar datasets for VQA and reading comprehension is presented. We report several baseline results by adopting existing VQA and reading comprehension models. Although the existing models perform reasonably well on certain types of questions, there is large performance gap compared to human performance (94.36% accuracy). The models need to improve specifically on questions where understanding structure of the document is crucial. The dataset, code and leaderboard are available at this http URL

96 citations


Cites methods from "From Strings to Things: Knowledge-E..."

  • ...The ST-VQA [5] and TextVQA [35] datasets were introduced in parallel in 2019 and were quickly followed by more research [36, 11, 39]....

    [...]

Proceedings ArticleDOI
12 Oct 2020
TL;DR: The proposed KG-Aug model is capable of retrieving context-aware knowledge subgraphs given visual images and textual questions, and learning to aggregate the useful image- and question-dependent knowledge which is then utilized to boost the accuracy in answering visual questions.
Abstract: Given an image and a natural language question, Visual Question Answering (VQA) aims at answering the textual question correctly. Most VQA approaches in literature targets at finding answers to the questions solely based on analyzing the given images and questions alone. Other works that try to incorporate external knowledge into VQA adopt a query-based search on knowledge graphs to obtain the answer. However, these works suffer from the following problem: the model training process heavily relies on the ground-truth knowledge facts which serve as supervised information --- missing these ground-truth knowledge facts during training will lead to failures in producing the correct answers. To solve the challenging issue, we propose a Knowledge Graph Augmented (KG-Aug) model which conducts context-aware knowledge aggregation on external knowledge graphs, requiring no ground-truth knowledge facts for extra supervision. The proposed KG-Aug model is capable of retrieving context-aware knowledge subgraphs given visual images and textual questions, and learning to aggregate the useful image- and question-dependent knowledge which is then utilized to boost the accuracy in answering visual questions. We carry out extensive experiments to validate the effectiveness of our proposed KG-Aug models against several baseline approaches on various datasets.

46 citations


Cites background from "From Strings to Things: Knowledge-E..."

  • ...Recently, there are also relatedworks that expoit retrieved knowledge facts throught memory networks [14, 20] or graph neural networks [21]....

    [...]

Journal ArticleDOI
TL;DR: A comprehensive review of GNNs and graph Transformers in computer vision from a task-oriented perspective and divides their applications into categories according to the modality of input data, i.e., 2D natural images, videos, 3D data, vision + language, and medical images.
Abstract: Graph Neural Networks (GNNs) have gained momentum in graph representation learning and boosted the state of the art in a variety of areas, such as data mining (\emph{e.g.,} social network analysis and recommender systems), computer vision (\emph{e.g.,} object detection and point cloud learning), and natural language processing (\emph{e.g.,} relation extraction and sequence learning), to name a few. With the emergence of Transformers in natural language processing and computer vision, graph Transformers embed a graph structure into the Transformer architecture to overcome the limitations of local neighborhood aggregation while avoiding strict structural inductive biases. In this paper, we present a comprehensive review of GNNs and graph Transformers in computer vision from a task-oriented perspective. Specifically, we divide their applications in computer vision into five categories according to the modality of input data, \emph{i.e.,} 2D natural images, videos, 3D data, vision + language, and medical images. In each category, we further divide the applications according to a set of vision tasks. Such a task-oriented taxonomy allows us to examine how each task is tackled by different GNN-based approaches and how well these approaches perform. Based on the necessary preliminaries, we provide the definitions and challenges of the tasks, in-depth coverage of the representative approaches, as well as discussions regarding insights, limitations, and future directions.

16 citations

Proceedings ArticleDOI
Hao Li, Xu Li, Belhal Karimi, Jie Chen, Mingming Sun 
09 May 2022
TL;DR: A novel Dual Message-passing enhanced Graph Neural Net-work (DM-GNN) is introduced, which can obtain a balanced represen-tation by properly encoding multi-scale scene graph infor-mation.
Abstract: Modeling visual question answering (VQA) through scene graphs can significantly improve the reasoning accuracy and interpretability. However, existing models answer poorly for complex reasoning questions with attributes or relations, which causes false attribute selection or missing relation in Figure 1(a). It is because these models cannot balance all kinds of information in scene graphs, neglecting relation and attribute information. In this paper, we introduce a novel Dual Message-passing enhanced Graph Neural Net-work (DM-GNN), which can obtain a balanced represen-tation by properly encoding multi-scale scene graph infor-mation. Specifically, we (i) transform the scene graph into two graphs with diversified focuses on objects and relations; Then we design a dual structure to encode them, which in-creases the weights from relations (ii) fuse the encoder out-put with attribute features, which increases the weights from attributes; (iii) propose a message-passing mechanism to en-hance the information transfer between objects, relations and attributes. We conduct extensive experiments on datasets in-cluding GQA, VG, motif-VG and achieve new state of the art.

9 citations

References
More filters
Proceedings ArticleDOI
01 Apr 2017
TL;DR: FastText as mentioned in this paper explores a simple and efficient baseline for text classification, which is often on par with deep learning classifiers in terms of accuracy and many orders of magnitude faster for training and evaluation.
Abstract: This paper explores a simple and efficient baseline for text classification. Our experiments show that our fast text classifier fastText is often on par with deep learning classifiers in terms of accuracy, and many orders of magnitude faster for training and evaluation. We can train fastText on more than one billion words in less than ten minutes using a standard multicore CPU, and classify half a million sentences among 312K classes in less than a minute.

3,765 citations

Proceedings ArticleDOI
07 Dec 2015
TL;DR: The task of free-form and open-ended Visual Question Answering (VQA) is proposed, given an image and a natural language question about the image, the task is to provide an accurate natural language answer.
Abstract: We propose the task of free-form and open-ended Visual Question Answering (VQA). Given an image and a natural language question about the image, the task is to provide an accurate natural language answer. Mirroring real-world scenarios, such as helping the visually impaired, both the questions and answers are open-ended. Visual questions selectively target different areas of an image, including background details and underlying context. As a result, a system that succeeds at VQA typically needs a more detailed understanding of the image and complex reasoning than a system producing generic image captions. Moreover, VQA is amenable to automatic evaluation, since many open-ended answers contain only a few words or a closed set of answers that can be provided in a multiple-choice format. We provide a dataset containing ~0.25M images, ~0.76M questions, and ~10M answers (www.visualqa.org), and discuss the information it provides. Numerous baselines for VQA are provided and compared with human performance.

3,513 citations

Journal ArticleDOI
TL;DR: The Places Database is described, a repository of 10 million scene photographs, labeled with scene semantic categories, comprising a large and diverse list of the types of environments encountered in the world, using the state-of-the-art Convolutional Neural Networks as baselines, that significantly outperform the previous approaches.
Abstract: The rise of multi-million-item dataset initiatives has enabled data-hungry machine learning algorithms to reach near-human semantic classification performance at tasks such as visual object and scene recognition. Here we describe the Places Database, a repository of 10 million scene photographs, labeled with scene semantic categories, comprising a large and diverse list of the types of environments encountered in the world. Using the state-of-the-art Convolutional Neural Networks (CNNs), we provide scene classification CNNs (Places-CNNs) as baselines, that significantly outperform the previous approaches. Visualization of the CNNs trained on Places shows that object detectors emerge as an intermediate representation of scene classification. With its high-coverage and high-diversity of exemplars, the Places Database along with the Places-CNNs offer a novel resource to guide future progress on scene recognition problems.

3,215 citations


"From Strings to Things: Knowledge-E..." refers background or methods in this paper

  • ...Since category names in our dataset are not exactly the same in Places, we could not perform quantitative analysis on visual content evaluation of places....

    [...]

  • ...To this end, we rely on Places [55] for scene recognition and a fine-tuned VGG-16 model for representing visual contents from movie posters and book covers....

    [...]

  • ...Word proposals [16]: Subway, open Visual content proposals [55]: fast food restaurant, shop front...

    [...]

  • ...We use Places [55] and a VGG-16 finetuned model for recognizing these visual contents for categories-(i) and (ii), respectively....

    [...]

Journal ArticleDOI
TL;DR: This collaboratively edited knowledgebase provides a common source of data for Wikipedia, and everyone else, to help improve the quality of the encyclopedia.
Abstract: This collaboratively edited knowledgebase provides a common source of data for Wikipedia, and everyone else.

2,809 citations


"From Strings to Things: Knowledge-E..." refers background in this paper

  • ..., Wikidata [44], IMDb [1], a book catalogue [13]....

    [...]

  • ...Our newly introduced dataset is much larger in scale as compared to the three aforementioned works [8, 32, 42] and more importantly, backed up by web-scale knowledge facts harvested from various sources, e.g., Wikidata [44], IMDb [1], a book catalogue [13]....

    [...]

  • ...To construct these three knowledge bases, we crawl open-source world knowledge bases, e.g., Wikidata [3], IMDb [1] and book catalogue provided by [18] around the anchor entities.1 Each knowledge fact is a triplet connecting two entities with a relation....

    [...]

  • ...Further, with the access to rich open-source knowledge graphs such as Wikidata [44], we could ask a series of natural questions, such as, Can I get a Sandwich here?, Is this a French brand?, and so on, which are not possible to ask in traditional VQA [5] as well as knowledge-enabled VQA models [47, 48]....

    [...]

Proceedings Article
01 Apr 2016
TL;DR: This work studies feature learning techniques for graph-structured inputs and achieves state-of-the-art performance on a problem from program verification, in which subgraphs need to be matched to abstract data structures.
Abstract: Graph-structured data appears frequently in domains including chemistry, natural language semantics, social networks, and knowledge bases. In this work, we study feature learning techniques for graph-structured inputs. Our starting point is previous work on Graph Neural Networks (Scarselli et al., 2009), which we modify to use gated recurrent units and modern optimization techniques and then extend to output sequences. The result is a flexible and broadly useful class of neural network models that has favorable inductive biases relative to purely sequence-based models (e.g., LSTMs) when the problem is graph-structured. We demonstrate the capabilities on some simple AI (bAbI) and graph algorithm learning tasks. We then show it achieves state-of-the-art performance on a problem from program verification, in which subgraphs need to be matched to abstract data structures.

2,518 citations


"From Strings to Things: Knowledge-E..." refers background or methods in this paper

  • ...This motivated us to utilize the capability of graph representation learning in the form of gated graph neural networks (GGNN) [27]....

    [...]

  • ...A natural choice for this is ‘gated graph neural network’ (GGNN) [27] which is emerging as a powerful tool to perform reasoning over graphs....

    [...]

  • ...bolic QA [27] to more complex visual reasoning [30]....

    [...]

  • ..., visual contents, recognized words, question and knowledge facts, and perform a reasoning on a multi-relational graph using a novel gated graph neural network [27] formulation....

    [...]

  • ...We choose gated graph neural network (GGNN) [27] for this task....

    [...]