Cognitive Attention Network (CAN) for Text and Image Multimodal Visual Dialog Systems

doi:10.1109/ICASI49664.2020.9426334

Proceedings ArticleDOI

Cognitive Attention Network (CAN) for Text and Image Multimodal Visual Dialog Systems

TLDR

In this article, a cognitive attention network (CAN) is proposed to answer multiple user questions regarding an image and also identify similar images from past conversations and refer to them during an ongoing question-answering (Q&A) chat.

Abstract:

Visual question answering and visual dialog systems are the emerging research areas in natural language processing that exploits the use of image and text modalities to convey an understanding of the contexts and attributes in a conversation as humans do in online chat platforms. These multimodal dialog techniques are enabling the extended use of chatbots in many open and vertical domains. In this paper, we propose the cognitive attention network (CAN) which is a visual dialog system capable of answering multiple user questions regarding an image, and also able to identify similar images from past conversations and referring to them during an ongoing question-answering (Q&A) chat. Our model comprises of faster RCNN, pre-trained BERT, late data fusion, and a memory network serving as a knowledge base for the temporary storage of previous visio-textual dialog data representations. Training on VISDIAL v1.0 benchmark dataset, we achieve a competitive result that outperforms some of the existing state-of-the-art models.

References

PDF

Open Access

More filters

Proceedings ArticleDOI

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

Jacob Devlin, +3 more

TL;DR: BERT as mentioned in this paper pre-trains deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers, which can be fine-tuned with just one additional output layer to create state-of-the-art models for a wide range of tasks.

...read moreread less

Proceedings Article

Faster R-CNN: towards real-time object detection with region proposal networks

Shaoqing Ren, +3 more

TL;DR: Ren et al. as discussed by the authors proposed a region proposal network (RPN) that shares full-image convolutional features with the detection network, thus enabling nearly cost-free region proposals.

...read moreread less

Proceedings ArticleDOI

VQA: Visual Question Answering

Stanislaw Antol, +6 more

TL;DR: The task of free-form and open-ended Visual Question Answering (VQA) is proposed, given an image and a natural language question about the image, the task is to provide an accurate natural language answer.

...read moreread less

Proceedings Article

ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks

Jiasen Lu, +3 more

TL;DR: The ViLBERT model as mentioned in this paper extends the BERT architecture to a multi-modal two-stream model, processing both visual and textual inputs in separate streams that interact through co-attentional transformer layers.

...read moreread less

Proceedings Article

BERT and PALs: Projected Attention Layers for Efficient Adaptation in Multi-Task Learning

Asa Cooper Stickland, +1 more

TL;DR: This paper explore multi-task approaches that share a single BERT model with a small number of additional task-specific parameters, and obtain state-of-the-art results on the Recognizing Textual Entailment dataset.

...read moreread less

Related Papers (5)

VD-BERT: A Unified Vision and Dialog Transformer with BERT

Yue Wang, +5 more

- 28 Apr 2020 -

arXiv: Computer Vision and Pattern Recog...

Context, Attention and Audio Feature Explorations for Audio Visual Scene-Aware Dialog.

Shachi H. Kumar, +5 more

- 20 Dec 2018 -

arXiv: Computation and Language

Cognitive Attention Network (CAN) for Text and Image Multimodal Visual Dialog Systems

References

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

Faster R-CNN: towards real-time object detection with region proposal networks

VQA: Visual Question Answering

ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks

BERT and PALs: Projected Attention Layers for Efficient Adaptation in Multi-Task Learning

Related Papers (5)

VD-BERT: A Unified Vision and Dialog Transformer with BERT

Context, Attention and Audio Feature Explorations for Audio Visual Scene-Aware Dialog.

Adaptive Visual Dialog for Intelligent Tutoring Systems

Assessment of users' interests in multimodal dialog based on exchange unit

Topic Forest: a plan-based dialog management structure