scispace - formally typeset
Open AccessProceedings ArticleDOI

Dual Attention Networks for Multimodal Reasoning and Matching

TLDR
The authors propose Dual Attention Networks (DANs) which jointly leverage visual and textual attention mechanisms to capture fine-grained interplay between vision and language for VQA and image-text matching.
Abstract
We propose Dual Attention Networks (DANs) which jointly leverage visual and textual attention mechanisms to capture fine-grained interplay between vision and language. DANs attend to specific regions in images and words in text through multiple steps and gather essential information from both modalities. Based on this framework, we introduce two types of DANs for multimodal reasoning and matching, respectively. The reasoning model allows visual and textual attentions to steer each other during collaborative inference, which is useful for tasks such as Visual Question Answering (VQA). In addition, the matching model exploits the two attention mechanisms to estimate the similarity between images and sentences by focusing on their shared semantics. Our extensive experiments validate the effectiveness of DANs in combining vision and language, achieving the state-of-the-art performance on public benchmarks for VQA and image-text matching.

read more

Content maybe subject to copyright    Report

Dual Attention Networks for Multimodal Reasoning and Matching
Hyeonseob Nam
Search Solutions Inc.
hyeonseob.nam@navercorp.com
Jung-Woo Ha
NAVER Corp.
jungwoo.ha@navercorp.com
Jeonghee Kim
NAVER LABS Corp.
jeonghee.kim@naverlabs.com
Abstract
We propose Dual Attention Networks (DANs) which
jointly leverage visual and textual attention mechanisms
to capture fine-grained interplay between vision and lan-
guage. DANs attend to specific regions in images and words
in text through multiple steps and gather essential informa-
tion from both modalities. Based on this framework, we
introduce two types of DANs for multimodal reasoning and
matching, respectively. The reasoning model allows visual
and textual attentions to steer each other during collabora-
tive inference, which is useful for tasks such as Visual Ques-
tion Answering (VQA). In addition, the matching model ex-
ploits the two attention mechanisms to estimate the simi-
larity between images and sentences by focusing on their
shared semantics. Our extensive experiments validate the
effectiveness of DANs in combining vision and language,
achieving the state-of-the-art performance on public bench-
marks for VQA and image-text matching.
1. Introduction
Vision and language are two central parts of human intel-
ligence to understand the real world. They are also funda-
mental components in achieving artificial intelligence, and
a tremendous amount of research has been done for decades
in each area. Recently, dramatic advances in deep learning
have broken the boundaries between vision and language,
drawing growing interest in their intersection, such as vi-
sual question answering (VQA) [
3, 37, 23, 35], image cap-
tioning [
33, 2], image-text matching [8, 11, 20, 30], visual
grounding [
24, 9], etc.
One of the recent advances in neural networks is the at-
tention mechanism [
21, 4, 33]. It aims to focus on certain
aspects of data sequentially and aggregate essential infor-
mation over time to infer the results, and has been suc-
cessfully applied to both areas of vision and language. In
computer vision, attention based methods adaptively se-
lect a sequence of image regions to extract necessary fea-
tures [
21, 6, 33]. Similarly, attention models for natural
language processing highlight specific words or sentences
(a) DAN for multimodal reasoning. (r-DAN)
(b) DAN for multimodal matching. (m-DAN)
Figure 1: Overview of Dual Attention Networks (DANs)
for multimodal reasoning and matching. The brightness of
image regions and darkness of words indicate their attention
weights predicted by DANs.
to distill information from input text [
4, 25, 15]. These
approaches have improved the performance of wide ap-
plications in conjunction with deep architectures including
convolutional neural networks (CNNs) and recurrent neural
networks (RNNs).
Despite the effectiveness of attention in handling both
visual and textual data, it has been hardly attempted to es-
tablish a connection between visual and textual attention
models which can be highly beneficial in various scenar-
ios. For example, the VQA problem in Figure 1a with
the question What color is the umbrella? can
be efficiently solved by simultaneously focusing on the re-
gion of umbrella and the word color. In the example
of image-text matching in Figure
1b, the similarity between
the image and sentence can be effectively measured by at-
1
299

tending to the specific regions and words sharing common
semantics such as girl and pool.
In this paper, we propose Dual Attention Networks
(DANs) which jointly learn visual and textual attention
models to explore the fine-grained interaction between vi-
sion and language. We investigate two variants of DANs
illustrated in Figure
1, referred to as reasoning-DAN (r-
DAN) and matching-DAN (m-DAN), respectively. The r-
DAN collaboratively performs visual and textual attentions
using a joint memory which assembles the previous atten-
tion results and guides the next attentions. It is suited to the
tasks requiring multimodal reasoning such as VQA. On the
other hand, the m-DAN separates visual and textual atten-
tion models with distinct memories but jointly trains them
to capture the shared semantics between images and sen-
tences. This approach eventually finds a joint embedding
space which facilitates efficient cross-modal matching and
retrieval. Both proposed algorithms closely connect visual
and textual attention mechanisms into a unified framework,
achieving outstanding performance in VQA and image-text
matching problems.
To summarize, the main contributions of our work are as
follows:
We propose an integrated framework of visual and tex-
tual attentions, where critical regions and words are
jointly located through multiple steps.
Two variants of the proposed framework are imple-
mented for multimodal reasoning and matching, and
applied to VQA and image-text matching.
Detailed visualization of the attention results validates
that our models effectively focus on vital portions of
visual and textual data for the given task.
Our framework demonstrates the state-of-the-art per-
formance on the VQA dataset [
3] and the Flickr30K
image-text matching dataset [
36].
2. Related Work
2.1. Attention Mechanisms
Attention mechanisms allow models to focus on neces-
sary parts of visual or textual inputs at each step of a task.
Visual attention models selectively pay attention to small
regions in an image to extract core features as well as re-
duce the amount of information to process. A number of
methods have recently adopted visual attention to benefit
image classification [
21, 28], image generation [6], image
captioning [
33], visual question answering [35, 26, 32], etc.
On the other hand, textual attention mechanisms generally
aim to find semantic or syntactic input-output alignments
under an encoder-decoder framework, which is especially
effective in handling long-term dependency. This approach
has been successfully applied to various tasks including ma-
chine translation [
4], text generation [16], sentence summa-
rization [
25], and question answering [15, 32].
2.2. Visual Question Answering (VQA)
VQA is a task of answering a question in natural lan-
guage regarding a given image, which requires multimodal
reasoning over visual and textual data. It has received a
surge of interest since Antol et al. [
3] presented a large-
scale dataset with free-form and open-ended questions. A
simple baseline by Zhou et al. [
37] predicts the answer from
a concatenation of CNN image features and bag-of-word
question features. Several methods adaptively construct a
deep architecture depending on the given question. For ex-
ample, Noh et al. [
23] impose a dynamic parameter layer
on a CNN which is learned by the question, while Andreas
et al. [
1] utilize a compositional structure of the question to
assemble a collection of neural modules.
One limitation of the above approaches is that they re-
sort to a global image representation which contains noisy
or unnecessary information. To address this problem, Yang
et al. [
35] propose stacked attention networks which per-
form multi-step visual attention, and Shih et al. [
26] use
object proposals to identify regions relevant to the given
question. Recently, dynamic memory networks [
32] in-
tegrate an attention mechanism with a memory module,
and multimodal compact bilinear pooling [
5] is exploited
to expressively combine multimodal features and predict
attention over the image. These methods commonly em-
ploy visual attention to find critical regions, but textual at-
tention has been rarely incorporated into VQA. Although
HieCoAtt [
18] applies both visual and textual attentions,
it independently performs each step of co-attention with-
out reasoning over previous co-attention outputs. On the
contrary, our method moves and refines both attentions via
multiple reasoning steps based on the memory of previous
attentions, which facilitates close interplay between visual
and textual data.
2.3. Image-Text Matching
The core issue with image-text matching is measuring
the semantic similarity between visual and textual inputs.
It is commonly addressed by learning a joint space where
image and sentence feature vectors are directly compara-
ble. Hodosh et al. [
8] apply canonical correlation analy-
sis (CCA) to find embeddings that maximize the correlation
between images and sentences, which is further improved
by incorporating deep neural networks [
14, 34]. A recent
approach by Wang et al. [
30] includes structure-preserving
constraints within a bidirectional loss function to make the
joint space more discriminative. In contrast, Ma et al. [
19]
construct a CNN to combine an image and sentence frag-
ments into a joint representation, from which the matching
300

score is directly inferred. Image captioning frameworks are
also exploited to estimate the similarity based on the inverse
probability of sentences given a query image [
20, 29].
To the best of our knowledge, no study has attempted
to learn multimodal attention models for image-text match-
ing. Even though Karpathy et al. [
11, 10] propose to find the
alignments between image regions and sentence fragments,
they explicitly compute all pairwise distances between them
and estimate the average or best alignment score, which
leads to inefficiency. On the other hand, our method au-
tomatically attends to the shared concepts between images
and sentences while embedding them into a joint space,
where cross-modal similarity is directly obtained by a sin-
gle inner product operation.
3. Dual Attention Networks (DANs)
We present two structures of DANs to consolidate vi-
sual and textual attention mechanisms: r-DAN for mul-
timodal reasoning and m-DAN for multimodal matching.
They share a common framework but differ in their ways
of associating visual and textual attentions. We first de-
scribe the common framework including input representa-
tion (Section
3.1) and attention mechanisms (Section 3.2).
Then we illustrate the details of r-DAN (Section
3.3) and m-
DAN (Section
3.4) applied to VQA and image-text match-
ing, respectively.
3.1. Input Representation
Image representation The image features are extracted
from 19-layer VGGNet [
27] or 152-layer ResNet [7]. We
first rescale images to 448×448 and feed them into the
CNNs. In order to obtain feature vectors for different re-
gions, we take the last pooling layer of VGGNet (pool5) or
the layer beneath the last pooling layer of ResNet (res5c).
Finally the input image is represented by {v
1
, · · · , v
N
},
where N is the number of image regions and v
n
is a 512
(VGGNet) or 2048 (ResNet) dimensional feature vector
corresponding to the n-th region.
Text representation We employ bidirectional LSTMs to
generate text features as depicted in Figure
2. Given one-hot
encoding of T input words {w
1
, · · · , w
T
}, we first embed
the words into a vector space by x
t
= Mw
t
, where M is
an embedding matrix. Then we feed the vectors into the
bidirectional LSTMs:
h
(f)
t
= LSTM
(f)
(x
t
, h
(f)
t1
), (1)
h
(b)
t
= LSTM
(b)
(x
t
, h
(b)
t+1
), (2)
where h
(f)
t
and h
(b)
t
represent the hidden states at time
t from the forward and backward LSTMs, respectively.
By adding the two hidden states at each time step, i.e.
Figure 2: Bidirectional LSTMs for text encoding.
u
t
= h
(f)
t
+ h
(b)
t
, we construct a set of feature vectors
{u
1
, · · · , u
T
} where u
t
encodes the semantics of the t-th
word in the context of the entire sentence. Note that the
models discussed here including the word embedding ma-
trix and the LSTMs are trained end-to-end.
3.2. Attention Mechanisms
Our method performs visual and textual attentions simul-
taneously through multiple steps and gathers necessary in-
formation from both modalities. In this section, we explain
the underlying attention mechanisms employed at each step,
which serve as building blocks to compose the entire DANs.
For simplicity, we shall omit the bias term b in the follow-
ing equations.
Visual Attention. Visual attention aims to generate a con-
text vector by attending to certain parts of the input image.
At step k, the visual context vector v
(k)
is given by
v
(k)
= V
Att({v
n
}
N
n=1
, m
(k1)
v
), (3)
where m
(k1)
v
is a memory vector encoding the informa-
tion that has been attended until step k 1. Specifically,
we employ the soft attention mechanism where the context
vector is obtained from a weighted average of input feature
vectors. The attention weights {α
(k)
v,n
}
N
n=1
are computed by
a 2-layer feed-forward neural network (FNN) and the soft-
max function:
h
(k)
v,n
= tanh
W
(k)
v
v
n
tanh
W
(k)
v,m
m
(k1)
v
, (4)
α
(k)
v,n
= softmax
W
(k)
v,h
h
(k)
v,n
, (5)
v
(k)
= tanh
P
(k)
N
X
n=1
α
(k)
v,n
v
n
!
, (6)
where W
(k)
v
, W
(k)
v,m
, and W
(k)
v,h
are the network parame-
ters, h
(k)
v,n
is a hidden state, and is element-wise multipli-
cation. In Equation
6, we introduce an additional layer with
301

the weight matrix P
(k)
in order to embed visual context
vectors into a compatible space with textual context vectors,
as we use pretrained image features v
n
.
Textual Attention. Textual attention computes a textual
context vector u
(k)
by focusing on specific words in the in-
put sentence every step:
u
(k)
= T
Att({u
t
}
T
t=1
, m
(k1)
u
), (7)
where m
(k1)
u
is a memory vector. The textual attention
mechanism is almost identical to the visual attention mech-
anism. In other words, the attention weights {α
(k)
u,t
}
T
t=1
are
obtained from a 2-layer FNN and the context vector u
(k)
is
calculated by weighted averaging:
h
(k)
u,t
= tanh
W
(k)
u
u
t
tanh
W
(k)
u,m
m
(k1)
u
, (8)
α
(k)
u,t
= softmax
W
(k)
u,h
h
(k)
u,t
, (9)
u
(k)
=
X
t
α
(k)
u,t
u
t
. (10)
where W
(k)
u
, W
(k)
u,m
, and W
(k)
u,h
are the network parame-
ters, h
(k)
u,t
is a hidden state. Unlike the visual attention, it
does not need an additional layer after the last weighted
averaging because the text features u
t
are already trained
end-to-end.
3.3. r-DAN for Visual Question Answering
VQA is a representative problem which requires joint
reasoning over multimodal data. For this purpose, the r-
DAN maintains a joint memory vector m
(k)
which accu-
mulates the visual and textual information that has been at-
tended until step k. It is recursively updated by
m
(k)
= m
(k1)
+ v
(k)
u
(k)
, (11)
where v
(k)
and u
(k)
are the visual and textual context vec-
tors obtained from Equation
6 and 10, respectively. This
joint representation concurrently guides the visual and tex-
tual attentions, i.e. m
(k)
= m
(k)
v
= m
(k)
u
, which allows the
two attention mechanisms to closely cooperate with each
other. The initial memory vector m
(0)
is defined based on
global context vectors v
(0)
and u
(0)
as
m
(0)
= v
(0)
u
(0)
, (12)
where v
(0)
= tanh
P
(0)
1
N
X
n
v
n
!
, (13)
u
(0)
=
1
T
X
t
u
t
. (14)
By repeating the dual attention (Equation
3 and 7) and
memory update (Equation 11) for K steps, we effectively
Figure 3: r-DAN in case of K = 2.
focus on the key portions in the image and question, and
gather relevant information for answering the question. Fig-
ure
3 illustrates the overall architecture of r-DAN in case of
K = 2.
The final answer is predicted by multi-way classification
to the top C frequent answers. We employ a single-layer
softmax classifier with cross-entropy loss where the input is
the final memory m
(K)
:
p
ans
= softmax
W
ans
m
(K)
, (15)
where p
ans
represents the probability over the candidate an-
swers.
3.4. m-DAN for Image-Text Matching
Image-text matching tasks usually involve comparison
between numerous images and sentences, where effective
and efficient computation of cross-modal similarities is cru-
cial. To achieve this, we aim to learn a joint embedding
space which satisfies the following properties. First, the em-
bedding space encodes the shared concepts that frequently
co-occur in image and sentence domains. Moreover, im-
ages and sentences are autonomously embedded into the
joint space without being paired, so that arbitrary image and
sentence vectors in the space are directly comparable.
Our m-DAN jointly learns visual and textual attention
models to capture the shared concepts between the two
modalities, but separates them at inference time to pro-
vide generally comparable representations in the embed-
ding space. Contrary to the r-DAN which uses a joint mem-
ory, the m-DAN maintains separate memory vectors for vi-
sual and textual attentions as follows:
m
(k)
v
= m
(k1)
v
+ v
(k)
, (16)
m
(k)
u
= m
(k1)
u
+ u
(k)
, (17)
302

Figure 4: m-DAN in case of K = 2.
which are initialized to v
(0)
and u
(0)
defined in Equation
13
and 14, respectively. At each step, we compute the similar-
ity s
(k)
between visual and textual context vectors by their
inner product:
s
(k)
= v
(k)
· u
(k)
. (18)
After performing K steps of the dual attention and memory
update, the final similarity S between the given image and
sentence becomes
S =
K
X
k=0
s
(k)
. (19)
The overall architecture of this model when K = 2 is de-
picted in Figure
4.
This network is trained with bidirectional max-margin
ranking loss, which is widely adopted for multimodal sim-
ilarity learning [
11, 10, 13, 30]. For each correct pair of an
image and a sentence (v, u), we additionally sample a neg-
ative image v
and a negative sentence u
to construct two
negative pairs (v
, u) and (v, u
). Then, the loss function
becomes:
L =
X
(v,u)
n
max
0, m S(v, u) + S(v
, u)
+ max
0, m S(v, u) + S(v, u
)
o
, (20)
where m is a margin constraint. By minimizing this func-
tion, the network is trained to focus on the common se-
mantics that only appears in correct image-sentence pairs
through visual and textual attention mechanisms.
At inference time, an arbitrary image or sentence is em-
bedded into the joint space by concatenating its context vec-
tors:
z
v
= [v
(0)
; · · · ; v
(K)
], (21)
z
u
= [u
(0)
; · · · ; u
(K)
], (22)
where z
v
and z
u
are the representations for image v and
sentence u, respectively. Note that these vectors are ob-
tained via separate pipelines of visual and textual attentions,
i.e. learned shared concepts are revealed from an image or
sentence itself, not from an image-sentence pair. The simi-
larity between two vectors in the joint space is simply com-
puted by their inner product, e.g. S(v, u) = z
v
· z
u
, which
is equivalent to the output of the network in Equation
19.
4. Experiments
4.1. Experimental Setup
We fix all the hyper-parameters applied to both r-DAN
and m-DAN. The number of attention steps K is set to 2
which empirically shows the best performance. The di-
mension of every hidden layer—including word embed-
ding, LSTMs, and attention models—is set to 512. We train
our networks by stochastic gradient descent with a learning
rate 0.1, momentum 0.9, weight decay 0.0005, dropout ra-
tio 0.5, and gradient clipping at 0.1. The network is trained
for 60 epochs, where the learning rate is dropped to 0.01
after 30 epochs. A minibatch for r-DAN and m-DAN con-
sists of 128 pairs of himage, questioni and 128 quadruplets
of h positive image, positive sentence, negative image, neg-
ative sentencei, respectively. The number of possible an-
swers C for VQA is set to 2000, and the margin m for the
loss function in Equation
20 is set to 100.
4.2. Evaluation on Visual Question Answering
4.2.1 Dataset and Evaluation Metric
We evaluate the r-DAN on the Visual Question Answering
(VQA) dataset [
3], which contains approximately 200K real
images from MSCOCO dataset [
17]. Each image is associ-
ated with three questions, and each question is labeled with
ten answers by human annotators. The dataset is typically
divided into four splits: train (80K images), val (40K im-
ages), test-dev (20K images), and test-std (20K images). We
train our model using train and val, validate with test-dev,
and evaluate on test-std. There are two forms of tasks, open-
ended and multiple-choice, which require to answer each
question without and with a set of candidate answers, re-
spectively. For both tasks, we follow the evaluation metric
used in [
3] as
Acc(ˆa) = min
#humans that labeled ˆa
3
, 1
, (23)
where ˆa is a predicted answer.
4.2.2 Results and Analysis
The performance of r-DAN compared with state-of-the-art
VQA systems is presented in Table
1, where our method
303

Citations
More filters
Book ChapterDOI

Stacked Cross Attention for Image-Text Matching

TL;DR: In this article, Liu et al. proposed a stacked cross-attention to discover the full latent alignments using both image regions and words in a sentence as context and infer image-text similarity, achieving state-of-the-art results on the MS-COCO and Flickr30K datasets.
Posted Content

VSE++: Improving Visual-Semantic Embeddings with Hard Negatives

TL;DR: A simple change to common loss functions used for multi-modal embeddings, inspired by hard negative mining, the use of hard negatives in structured prediction, and ranking loss functions, is introduced, which yields significant gains in retrieval performance.
Proceedings ArticleDOI

Multi-modal Factorized Bilinear Pooling with Co-attention Learning for Visual Question Answering

TL;DR: A Multi-modal Factorized Bilinear (MFB) pooling approach to efficiently and effectively combine multi- modal features, which results in superior performance for VQA compared with other bilinear pooling approaches.
Journal ArticleDOI

Deep learning in medical imaging and radiation therapy.

TL;DR: The general principles of DL and convolutional neural networks are introduced, five major areas of application of DL in medical imaging and radiation therapy are surveyed, common themes are identified, methods for dataset expansion are discussed, and lessons learned, remaining challenges, and future directions are summarized.
Proceedings ArticleDOI

Deep Modular Co-Attention Networks for Visual Question Answering

TL;DR: In this article, a modular co-attention network (MCAN) is proposed, which consists of Modular Co-Attention (MCA) layers cascaded in depth.
References
More filters
Proceedings ArticleDOI

Deep Residual Learning for Image Recognition

TL;DR: In this article, the authors proposed a residual learning framework to ease the training of networks that are substantially deeper than those used previously, which won the 1st place on the ILSVRC 2015 classification task.
Proceedings Article

Very Deep Convolutional Networks for Large-Scale Image Recognition

TL;DR: In this paper, the authors investigated the effect of the convolutional network depth on its accuracy in the large-scale image recognition setting and showed that a significant improvement on the prior-art configurations can be achieved by pushing the depth to 16-19 layers.
Book ChapterDOI

Microsoft COCO: Common Objects in Context

TL;DR: A new dataset with the goal of advancing the state-of-the-art in object recognition by placing the question of object recognition in the context of the broader question of scene understanding by gathering images of complex everyday scenes containing common objects in their natural context.
Proceedings Article

Neural Machine Translation by Jointly Learning to Align and Translate

TL;DR: It is conjecture that the use of a fixed-length vector is a bottleneck in improving the performance of this basic encoder-decoder architecture, and it is proposed to extend this by allowing a model to automatically (soft-)search for parts of a source sentence that are relevant to predicting a target word, without having to form these parts as a hard segment explicitly.
Proceedings Article

Show, Attend and Tell: Neural Image Caption Generation with Visual Attention

TL;DR: An attention based model that automatically learns to describe the content of images is introduced that can be trained in a deterministic manner using standard backpropagation techniques and stochastically by maximizing a variational lower bound.
Related Papers (5)