scispace - formally typeset
Proceedings ArticleDOI

Cognitive Attention Network (CAN) for Text and Image Multimodal Visual Dialog Systems

TLDR
In this article, a cognitive attention network (CAN) is proposed to answer multiple user questions regarding an image and also identify similar images from past conversations and refer to them during an ongoing question-answering (Q&A) chat.
Abstract
Visual question answering and visual dialog systems are the emerging research areas in natural language processing that exploits the use of image and text modalities to convey an understanding of the contexts and attributes in a conversation as humans do in online chat platforms. These multimodal dialog techniques are enabling the extended use of chatbots in many open and vertical domains. In this paper, we propose the cognitive attention network (CAN) which is a visual dialog system capable of answering multiple user questions regarding an image, and also able to identify similar images from past conversations and referring to them during an ongoing question-answering (Q&A) chat. Our model comprises of faster RCNN, pre-trained BERT, late data fusion, and a memory network serving as a knowledge base for the temporary storage of previous visio-textual dialog data representations. Training on VISDIAL v1.0 benchmark dataset, we achieve a competitive result that outperforms some of the existing state-of-the-art models.

read more

References
More filters
Proceedings ArticleDOI

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

TL;DR: BERT as mentioned in this paper pre-trains deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers, which can be fine-tuned with just one additional output layer to create state-of-the-art models for a wide range of tasks.
Proceedings Article

Faster R-CNN: towards real-time object detection with region proposal networks

TL;DR: Ren et al. as discussed by the authors proposed a region proposal network (RPN) that shares full-image convolutional features with the detection network, thus enabling nearly cost-free region proposals.
Proceedings ArticleDOI

VQA: Visual Question Answering

TL;DR: The task of free-form and open-ended Visual Question Answering (VQA) is proposed, given an image and a natural language question about the image, the task is to provide an accurate natural language answer.
Proceedings Article

ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks

TL;DR: The ViLBERT model as mentioned in this paper extends the BERT architecture to a multi-modal two-stream model, processing both visual and textual inputs in separate streams that interact through co-attentional transformer layers.
Proceedings Article

BERT and PALs: Projected Attention Layers for Efficient Adaptation in Multi-Task Learning

TL;DR: This paper explore multi-task approaches that share a single BERT model with a small number of additional task-specific parameters, and obtain state-of-the-art results on the Recognizing Textual Entailment dataset.
Related Papers (5)