Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering
Peter Anderson,Xiaodong He,Chris Buehler,Damien Teney,Mark Johnson,Stephen Gould,Lei Zhang +6 more
- pp 6077-6086
Reads0
Chats0
TLDR
In this paper, a bottom-up and top-down attention mechanism was proposed to enable attention to be calculated at the level of objects and other salient image regions, which achieved state-of-the-art results on the MSCOCO test server.Citations
More filters
Journal Article
The Benefit of Distraction: Denoising Remote Vitals Measurements Using Inverse Attention
TL;DR: This work uses the inverse of an attention mask to generate a noise estimate that is then used to denoise temporal observations and produces state-of-the-art results, increasing the signal-to-noise ratio by up to 5.8 dB, reducing heart rate and breathing rate estimation error by as much as 30%, recovering subtle pulse waveform dynamics, and generalizing from RGB to NIR videos without retraining.
Journal ArticleDOI
A comparative study of language transformers for video question answering
TL;DR: Different from previous models which represent visual features by recurrent neural networks, this model encodes visual concept sequences with a pre-trained language Transformer to encode the complex semantics from video clips.
Journal ArticleDOI
Human gaze-aware attentive object detection for ambient intelligence
Dae-Yong Cho,Min-Koo Kang +1 more
TL;DR: In this article, a human gaze-aware attentive object detection framework is proposed to detect users' attentive objects and shows more precise and robust performance against object-scale variations. And the proposed framework detects a user's single object-of-interest only, even when the target object is occluded or extremely small.
Posted Content
What BERT Sees: Cross-Modal Transfer for Visual Question Generation
TL;DR: Evaluated visual capabilities of BERT out-of-the-box are evaluated, indicating an innate capacity for BERT-gen to adapt to multi-modal data and text generation, even with few data available, avoiding expensive pre-training.
Journal ArticleDOI
Vision and Language: from Visual Perception to Content Creation
Tao Mei,Wei Zhang,Ting Yao +2 more
TL;DR: In this paper, the authors review the recent advances along these two dimensions: "vision to language" and "language to vision." More concretely, the former mainly focuses on the development of image/video captioning, as well as typical encoder-decoder structures and benchmarks, while the latter summarizes the technologies of visual content creation.
References
More filters
Proceedings ArticleDOI
Deep Residual Learning for Image Recognition
TL;DR: In this article, the authors proposed a residual learning framework to ease the training of networks that are substantially deeper than those used previously, which won the 1st place on the ILSVRC 2015 classification task.
Journal ArticleDOI
Long short-term memory
TL;DR: A novel, efficient, gradient based method called long short-term memory (LSTM) is introduced, which can learn to bridge minimal time lags in excess of 1000 discrete-time steps by enforcing constant error flow through constant error carousels within special units.
Journal ArticleDOI
ImageNet Large Scale Visual Recognition Challenge
Olga Russakovsky,Jia Deng,Hao Su,Jonathan Krause,Sanjeev Satheesh,Sean Ma,Zhiheng Huang,Andrej Karpathy,Aditya Khosla,Michael S. Bernstein,Alexander C. Berg,Li Fei-Fei +11 more
TL;DR: The ImageNet Large Scale Visual Recognition Challenge (ILSVRC) as mentioned in this paper is a benchmark in object category classification and detection on hundreds of object categories and millions of images, which has been run annually from 2010 to present, attracting participation from more than fifty institutions.
Book ChapterDOI
Microsoft COCO: Common Objects in Context
Tsung-Yi Lin,Michael Maire,Serge Belongie,James Hays,Pietro Perona,Deva Ramanan,Piotr Dollár,C. Lawrence Zitnick +7 more
TL;DR: A new dataset with the goal of advancing the state-of-the-art in object recognition by placing the question of object recognition in the context of the broader question of scene understanding by gathering images of complex everyday scenes containing common objects in their natural context.
Proceedings ArticleDOI
You Only Look Once: Unified, Real-Time Object Detection
TL;DR: Compared to state-of-the-art detection systems, YOLO makes more localization errors but is less likely to predict false positives on background, and outperforms other detection methods, including DPM and R-CNN, when generalizing from natural images to other domains like artwork.