TextCaps: A Dataset for Image Captioning with Reading Comprehension.

doi:10.1007/978-3-030-58536-5_44

Open AccessBook ChapterDOI

TextCaps: A Dataset for Image Captioning with Reading Comprehension.

- pp 742-758

TLDR

The TextCaps dataset as mentioned in this paper is a large dataset with 145k captions for 28k images, which is used to study how to comprehend text in the context of an image, requiring spatial, semantic and visual reasoning between multiple text tokens and visual entities such as objects.

Abstract:

Image descriptions can help visually impaired people to quickly understand the image content. While we made significant progress in automatically describing images and optical character recognition, current approaches are unable to include written text in their descriptions, although text is omnipresent in human environments and frequently critical to understand our surroundings. To study how to comprehend text in the context of an image we collect a novel dataset, TextCaps, with 145k captions for 28k images. Our dataset challenges a model to recognize text, relate it to its visual context, and decide what part of the text to copy or paraphrase, requiring spatial, semantic, and visual reasoning between multiple text tokens and visual entities, such as objects. We study baselines and adapt existing approaches to this new task, which we refer to as image captioning with reading comprehension. Our analysis with automatic and human studies shows that our new TextCaps dataset provides many new technical challenges over previous datasets.

Citations

PDF

Open Access

More filters

Book ChapterDOI

Going Full-TILT Boogie on Document Understanding with Text-Image-Layout Transformer

Rafal Powalski, +5 more

TL;DR: This article proposed a TILT neural network architecture which simultaneously learns layout information, visual features, and textual semantics, and achieved state-of-the-art results in extracting information from documents and answering questions which demand layout understanding.

...read moreread less

Proceedings ArticleDOI

TextOCR: Towards large-scale end-to-end reasoning for arbitrary-shaped scene text

Amanpreet Singh, +5 more

TL;DR: TextOCR as discussed by the authors is an arbitrary-shaped scene text detection and recognition with 900k annotated words collected on real images from TextVQA dataset, which can do scene text based reasoning on an image in an end-to-end fashion.

...read moreread less

Proceedings ArticleDOI

Towards Accurate Text-based Image Captioning with Content Diversity Exploration

Guanghui Xu, +5 more

TL;DR: Zhang et al. as mentioned in this paper proposed an anchor-centred graph (ACG) based method for multi-view caption generation to improve the content diversity of generated captions.

...read moreread less

Posted Content

Simple is not Easy: A Simple Strong Baseline for TextVQA and TextCaps

Qi Zhu, +3 more

- 09 Dec 2020 -

arXiv: Computer Vision and Pattern Recog...

TL;DR: This paper argues that a simple attention mechanism can do the same or even better job without any bells and whistles of multi-modality encoder design, and finds this simple baseline model consistently outperforms state-of-the-art (SOTA) models on two popular benchmarks, TextVQA and all three tasks of ST-V QA.

...read moreread less

Posted Content

Structured Multimodal Attentions for TextVQA

Chenyu Gao, +6 more

- 01 Jun 2020 -

arXiv: Computer Vision and Pattern Recog...

TL;DR: An end-to-end structured multimodal attention (SMA) neural network is proposed to mainly solve the first two issues above.

...read moreread less

Collapse

References

PDF

Open Access

More filters

Posted Content

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

Jacob Devlin, +3 more

- 11 Oct 2018 -

arXiv: Computation and Language

TL;DR: A new language representation model, BERT, designed to pre-train deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers, which can be fine-tuned with just one additional output layer to create state-of-the-art models for a wide range of tasks.

...read moreread less

Proceedings ArticleDOI

Bleu: a Method for Automatic Evaluation of Machine Translation

Kishore Papineni, +3 more

TL;DR: This paper proposed a method of automatic machine translation evaluation that is quick, inexpensive, and language-independent, that correlates highly with human evaluation, and that has little marginal cost per run.

...read moreread less

Proceedings Article

Faster R-CNN: towards real-time object detection with region proposal networks

Shaoqing Ren, +3 more

TL;DR: Ren et al. as discussed by the authors proposed a region proposal network (RPN) that shares full-image convolutional features with the detection network, thus enabling nearly cost-free region proposals.

...read moreread less

Journal ArticleDOI

Enriching Word Vectors with Subword Information

Piotr Bojanowski, +3 more

- 12 Jun 2017 -

Transactions of the Association for Comp...

TL;DR: This paper proposed a new approach based on skip-gram model, where each word is represented as a bag of character n-grams, words being represented as the sum of these representations, allowing to train models on large corpora quickly and allowing to compute word representations for words that did not appear in the training data.

...read moreread less

Proceedings ArticleDOI

CIDEr: Consensus-based image description evaluation

Ramakrishna Vedantam, +2 more

TL;DR: A novel paradigm for evaluating image descriptions that uses human consensus is proposed and a new automated metric that captures human judgment of consensus better than existing metrics across sentences generated by various sources is evaluated.

...read moreread less

Collapse

TextCaps: A Dataset for Image Captioning with Reading Comprehension.

Citations

Going Full-TILT Boogie on Document Understanding with Text-Image-Layout Transformer

TextOCR: Towards large-scale end-to-end reasoning for arbitrary-shaped scene text

Towards Accurate Text-based Image Captioning with Content Diversity Exploration

Simple is not Easy: A Simple Strong Baseline for TextVQA and TextCaps

Structured Multimodal Attentions for TextVQA

References

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

Bleu: a Method for Automatic Evaluation of Machine Translation

Faster R-CNN: towards real-time object detection with region proposal networks

Enriching Word Vectors with Subword Information

CIDEr: Consensus-based image description evaluation

Related Papers (5)

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

Bleu: a Method for Automatic Evaluation of Machine Translation

Faster R-CNN: towards real-time object detection with region proposal networks

Attention is All you Need

Show, Attend and Tell: Neural Image Caption Generation with Visual Attention