Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering
Peter Anderson,Xiaodong He,Chris Buehler,Damien Teney,Mark Johnson,Stephen Gould,Lei Zhang +6 more
- pp 6077-6086
Reads0
Chats0
TLDR
In this paper, a bottom-up and top-down attention mechanism was proposed to enable attention to be calculated at the level of objects and other salient image regions, which achieved state-of-the-art results on the MSCOCO test server.Citations
More filters
Proceedings ArticleDOI
Context-Aware Group Captioning via Self-Attention and Contrastive Features
TL;DR: This paper introduces a new task, context-aware group captioning, which aims to describe a group of target images in the context of another group of related reference images with a framework combining self-attention mechanism with contrastive feature construction to effectively summarize common information from each image group while capturing discriminative information between them.
Proceedings ArticleDOI
TAP: Text-Aware Pre-training for Text-VQA and Text-Caption
Zhengyuan Yang,Yijuan Lu,Jianfeng Wang,Xi Yin,Dinei Florencio,Lijuan Wang,Cha Zhang,Lei Zhang,Jiebo Luo +8 more
TL;DR: The authors proposed Text-Aware Pre-training (TAP) for text-VQA and text-Caption tasks, which explicitly incorporates scene text (generated from OCR engines) during pretraining.
Proceedings ArticleDOI
Unsupervised Vision-and-Language Pre-training Without Parallel Images and Captions
TL;DR: This work proposes to conduct “mask-and-predict” pre-training on text-only and image-only corpora and introduces the object tags detected by an object recognition model as anchor points to bridge two modalities and finds that such a simple approach achieves performance close to a model pre-trained with aligned data, on four English V&L benchmarks.
Posted Content
A Comparison of Pre-trained Vision-and-Language Models for Multimodal Representation Learning across Medical Images and Reports
Yikuan Li,Hanyin Wang,Yuan Luo +2 more
TL;DR: External evaluation using the OpenI dataset shows that the joint embedding learned by pre-trained LXMERT, VisualBERT, UNIER and PixelBERT models demonstrates performance improvement of 1.4% in thoracic finding classification tasks compared to a pioneering CNN + RNN model.
Journal ArticleDOI
Multimodal Pretraining Unmasked: A Meta-Analysis and a Unified Framework of Vision-and-Language BERTs
TL;DR: The authors study the differences between these two categories, and show how they can be unified under a single theoretical framework, and conduct controlled experiments to discern the empirical differences between five V&L BERTs.
References
More filters
Proceedings ArticleDOI
Deep Residual Learning for Image Recognition
TL;DR: In this article, the authors proposed a residual learning framework to ease the training of networks that are substantially deeper than those used previously, which won the 1st place on the ILSVRC 2015 classification task.
Journal ArticleDOI
Long short-term memory
TL;DR: A novel, efficient, gradient based method called long short-term memory (LSTM) is introduced, which can learn to bridge minimal time lags in excess of 1000 discrete-time steps by enforcing constant error flow through constant error carousels within special units.
Journal ArticleDOI
ImageNet Large Scale Visual Recognition Challenge
Olga Russakovsky,Jia Deng,Hao Su,Jonathan Krause,Sanjeev Satheesh,Sean Ma,Zhiheng Huang,Andrej Karpathy,Aditya Khosla,Michael S. Bernstein,Alexander C. Berg,Li Fei-Fei +11 more
TL;DR: The ImageNet Large Scale Visual Recognition Challenge (ILSVRC) as mentioned in this paper is a benchmark in object category classification and detection on hundreds of object categories and millions of images, which has been run annually from 2010 to present, attracting participation from more than fifty institutions.
Book ChapterDOI
Microsoft COCO: Common Objects in Context
Tsung-Yi Lin,Michael Maire,Serge Belongie,James Hays,Pietro Perona,Deva Ramanan,Piotr Dollár,C. Lawrence Zitnick +7 more
TL;DR: A new dataset with the goal of advancing the state-of-the-art in object recognition by placing the question of object recognition in the context of the broader question of scene understanding by gathering images of complex everyday scenes containing common objects in their natural context.
Proceedings ArticleDOI
You Only Look Once: Unified, Real-Time Object Detection
TL;DR: Compared to state-of-the-art detection systems, YOLO makes more localization errors but is less likely to predict false positives on background, and outperforms other detection methods, including DPM and R-CNN, when generalizing from natural images to other domains like artwork.