Open AccessPosted Content
Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering.
TLDR
A combined bottom-up and top-down attention mechanism that enables attention to be calculated at the level of objects and other salient image regions is proposed, demonstrating the broad applicability of this approach to VQA.Citations
More filters
Posted Content
Attention U-Net: Learning Where to Look for the Pancreas
Ozan Oktay,Jo Schlemper,Loic Le Folgoc,Matthew C. H. Lee,Mattias P. Heinrich,Kazunari Misawa,Kensaku Mori,Steven McDonagh,Nils Y. Hammerla,Bernhard Kainz,Ben Glocker,Daniel Rueckert +11 more
TL;DR: A novel attention gate (AG) model for medical imaging that automatically learns to focus on target structures of varying shapes and sizes is proposed to eliminate the necessity of using explicit external tissue/organ localisation modules of cascaded convolutional neural networks (CNNs).
Proceedings ArticleDOI
Making the V in VQA Matter: Elevating the Role of Image Understanding in Visual Question Answering
TL;DR: The authors balance the VQA dataset by collecting complementary images such that every question in the balanced dataset is associated with not just a single image, but rather a pair of similar images that result in two different answers to the same question.
Book ChapterDOI
UNITER: UNiversal Image-TExt Representation Learning
Yen-Chun Chen,Linjie Li,Licheng Yu,Ahmed El Kholy,Faisal Ahmed,Zhe Gan,Yu Cheng,Jingjing Liu +7 more
TL;DR: UNITER, a UNiversal Image-TExt Representation, learned through large-scale pre-training over four image-text datasets is introduced, which can power heterogeneous downstream V+L tasks with joint multimodal embeddings.
Posted Content
VisualBERT: A Simple and Performant Baseline for Vision and Language.
TL;DR: Analysis demonstrates that VisualBERT can ground elements of language to image regions without any explicit supervision and is even sensitive to syntactic relationships, tracking, for example, associations between verbs and image regions corresponding to their arguments.
Proceedings Article
ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks
TL;DR: The ViLBERT model as mentioned in this paper extends the BERT architecture to a multi-modal two-stream model, processing both visual and textual inputs in separate streams that interact through co-attentional transformer layers.
References
More filters
Proceedings ArticleDOI
Deep Residual Learning for Image Recognition
TL;DR: In this article, the authors proposed a residual learning framework to ease the training of networks that are substantially deeper than those used previously, which won the 1st place on the ILSVRC 2015 classification task.
Journal ArticleDOI
Long short-term memory
TL;DR: A novel, efficient, gradient based method called long short-term memory (LSTM) is introduced, which can learn to bridge minimal time lags in excess of 1000 discrete-time steps by enforcing constant error flow through constant error carousels within special units.
Posted Content
Deep Residual Learning for Image Recognition
TL;DR: This work presents a residual learning framework to ease the training of networks that are substantially deeper than those used previously, and provides comprehensive empirical evidence showing that these residual networks are easier to optimize, and can gain accuracy from considerably increased depth.
Journal ArticleDOI
ImageNet Large Scale Visual Recognition Challenge
Olga Russakovsky,Jia Deng,Hao Su,Jonathan Krause,Sanjeev Satheesh,Sean Ma,Zhiheng Huang,Andrej Karpathy,Aditya Khosla,Michael S. Bernstein,Alexander C. Berg,Li Fei-Fei +11 more
TL;DR: The ImageNet Large Scale Visual Recognition Challenge (ILSVRC) as mentioned in this paper is a benchmark in object category classification and detection on hundreds of object categories and millions of images, which has been run annually from 2010 to present, attracting participation from more than fifty institutions.
Proceedings ArticleDOI
Glove: Global Vectors for Word Representation
TL;DR: A new global logbilinear regression model that combines the advantages of the two major model families in the literature: global matrix factorization and local context window methods and produces a vector space with meaningful substructure.