Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering
Peter Anderson,Xiaodong He,Chris Buehler,Damien Teney,Mark Johnson,Stephen Gould,Lei Zhang +6 more
- pp 6077-6086
Reads0
Chats0
TLDR
In this paper, a bottom-up and top-down attention mechanism was proposed to enable attention to be calculated at the level of objects and other salient image regions, which achieved state-of-the-art results on the MSCOCO test server.Citations
More filters
Proceedings ArticleDOI
Domain Decluttering: Simplifying Images to Mitigate Synthetic-Real Domain Shift and Improve Depth Estimation
TL;DR: In this article, an attention module is proposed to identify and remove difficult out-of-domain regions in real images in order to improve depth prediction for a model trained primarily on synthetic data.
Posted Content
Auto-captions on GIF: A Large-scale Video-sentence Dataset for Vision-language Pre-training
TL;DR: A detailed analysis of Auto-captions on GIF dataset in comparison to existing video-sentence datasets is presented and an evaluation of a Transformer-based encoder-decoder structure for vision-language pre-training, which is further adapted to video captioning downstream task and yields the compelling generalizability on MSR-VTT.
Journal ArticleDOI
Noise Augmented Double-Stream Graph Convolutional Networks for Image Captioning
TL;DR: The Noise Augmented Double-stream Graph Convolutional Networks (NADGCN) is proposed that novelly exploits the additional background context and enhances the generalization of the language model.
Posted Content
Multimodal Learning for Hateful Memes Detection
Yi Zhou,Zhenhao Chen +1 more
TL;DR: In this article, the authors focus on multimodal hateful memes detection and propose a novel method that incorporates the image captioning process into the memes detection process, which achieves promising results on the Hateful Memes Detection Challenge.
Posted Content
Spatial-Temporal Transformer for Dynamic Scene Graph Generation
TL;DR: Wang et al. as mentioned in this paper proposed a spatial-temporal transformer (STTran), which consists of two core modules: (1) a spatial encoder that takes an input frame to extract spatial context and reason about the visual relationships within a frame, and (2) a temporal decoder which takes the output of the spatial encoding as input in order to capture the temporal dependencies between frames and infer the dynamic relationships.
References
More filters
Proceedings ArticleDOI
Deep Residual Learning for Image Recognition
TL;DR: In this article, the authors proposed a residual learning framework to ease the training of networks that are substantially deeper than those used previously, which won the 1st place on the ILSVRC 2015 classification task.
Journal ArticleDOI
Long short-term memory
TL;DR: A novel, efficient, gradient based method called long short-term memory (LSTM) is introduced, which can learn to bridge minimal time lags in excess of 1000 discrete-time steps by enforcing constant error flow through constant error carousels within special units.
Journal ArticleDOI
ImageNet Large Scale Visual Recognition Challenge
Olga Russakovsky,Jia Deng,Hao Su,Jonathan Krause,Sanjeev Satheesh,Sean Ma,Zhiheng Huang,Andrej Karpathy,Aditya Khosla,Michael S. Bernstein,Alexander C. Berg,Li Fei-Fei +11 more
TL;DR: The ImageNet Large Scale Visual Recognition Challenge (ILSVRC) as mentioned in this paper is a benchmark in object category classification and detection on hundreds of object categories and millions of images, which has been run annually from 2010 to present, attracting participation from more than fifty institutions.
Book ChapterDOI
Microsoft COCO: Common Objects in Context
Tsung-Yi Lin,Michael Maire,Serge Belongie,James Hays,Pietro Perona,Deva Ramanan,Piotr Dollár,C. Lawrence Zitnick +7 more
TL;DR: A new dataset with the goal of advancing the state-of-the-art in object recognition by placing the question of object recognition in the context of the broader question of scene understanding by gathering images of complex everyday scenes containing common objects in their natural context.
Proceedings ArticleDOI
You Only Look Once: Unified, Real-Time Object Detection
TL;DR: Compared to state-of-the-art detection systems, YOLO makes more localization errors but is less likely to predict false positives on background, and outperforms other detection methods, including DPM and R-CNN, when generalizing from natural images to other domains like artwork.