TextCaps: A Dataset for Image Captioning with Reading Comprehension.
Oleksii Sidorov,Ronghang Hu,Marcus Rohrbach,Amanpreet Singh +3 more
- pp 742-758
TLDR
The TextCaps dataset as mentioned in this paper is a large dataset with 145k captions for 28k images, which is used to study how to comprehend text in the context of an image, requiring spatial, semantic and visual reasoning between multiple text tokens and visual entities such as objects.Citations
More filters
Book ChapterDOI
Going Full-TILT Boogie on Document Understanding with Text-Image-Layout Transformer
Rafal Powalski,Lukasz Borchmann,Dawid Jurkiewicz,Tomasz Dwojak,Michał Pietruszka,Gabriela Pałka +5 more
TL;DR: This article proposed a TILT neural network architecture which simultaneously learns layout information, visual features, and textual semantics, and achieved state-of-the-art results in extracting information from documents and answering questions which demand layout understanding.
Proceedings ArticleDOI
TextOCR: Towards large-scale end-to-end reasoning for arbitrary-shaped scene text
TL;DR: TextOCR as discussed by the authors is an arbitrary-shaped scene text detection and recognition with 900k annotated words collected on real images from TextVQA dataset, which can do scene text based reasoning on an image in an end-to-end fashion.
Proceedings ArticleDOI
Towards Accurate Text-based Image Captioning with Content Diversity Exploration
TL;DR: Zhang et al. as mentioned in this paper proposed an anchor-centred graph (ACG) based method for multi-view caption generation to improve the content diversity of generated captions.
Posted Content
Simple is not Easy: A Simple Strong Baseline for TextVQA and TextCaps
TL;DR: This paper argues that a simple attention mechanism can do the same or even better job without any bells and whistles of multi-modality encoder design, and finds this simple baseline model consistently outperforms state-of-the-art (SOTA) models on two popular benchmarks, TextVQA and all three tasks of ST-V QA.
Posted Content
Structured Multimodal Attentions for TextVQA
TL;DR: An end-to-end structured multimodal attention (SMA) neural network is proposed to mainly solve the first two issues above.
References
More filters
Posted Content
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
TL;DR: A new language representation model, BERT, designed to pre-train deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers, which can be fine-tuned with just one additional output layer to create state-of-the-art models for a wide range of tasks.
Proceedings ArticleDOI
Bleu: a Method for Automatic Evaluation of Machine Translation
TL;DR: This paper proposed a method of automatic machine translation evaluation that is quick, inexpensive, and language-independent, that correlates highly with human evaluation, and that has little marginal cost per run.
Proceedings Article
Faster R-CNN: towards real-time object detection with region proposal networks
TL;DR: Ren et al. as discussed by the authors proposed a region proposal network (RPN) that shares full-image convolutional features with the detection network, thus enabling nearly cost-free region proposals.
Journal ArticleDOI
Enriching Word Vectors with Subword Information
TL;DR: This paper proposed a new approach based on skip-gram model, where each word is represented as a bag of character n-grams, words being represented as the sum of these representations, allowing to train models on large corpora quickly and allowing to compute word representations for words that did not appear in the training data.
Proceedings ArticleDOI
CIDEr: Consensus-based image description evaluation
TL;DR: A novel paradigm for evaluating image descriptions that uses human consensus is proposed and a new automated metric that captures human judgment of consensus better than existing metrics across sentences generated by various sources is evaluated.