FUNSD: A Dataset for Form Understanding in Noisy Scanned Documents
Guillaume Jaume,Hazim Kemal Ekenel,Jean-Philippe Thiran +2 more
- Vol. 2, pp 1-6
Reads0
Chats0
TLDR
This work presents a new dataset for form understanding in noisy scanned documents (FUNSD) that aims at extracting and structuring the textual content of forms, and is the first publicly available dataset with comprehensive annotations to address FoUn task.Abstract:
We present a new dataset for form understanding in noisy scanned documents (FUNSD) that aims at extracting and structuring the textual content of forms. The dataset comprises 199 real, fully annotated, scanned forms. The documents are noisy and vary widely in appearance, making form understanding (FoUn) a challenging task. The proposed dataset can be used for various tasks, including text detection, optical character recognition, spatial layout analysis, and entity labeling/linking. To the best of our knowledge, this is the first publicly available dataset with comprehensive annotations to address FoUn task. We also present a set of baselines and introduce metrics to evaluate performance on the FUNSD dataset, which can be downloaded at https://guillaumejaume.github.io/FUNSD.read more
Citations
More filters
Proceedings ArticleDOI
LayoutLMv3: Pre-training for Document AI with Unified Text and Image Masking
TL;DR: LayoutLMv3 is proposed to pre-train multimodal Transformers for Document AI with unified text and image masking, and is pre-trained with a word-patch alignment objective to learn cross-modal alignment by predicting whether the corresponding image patch of a text word is masked.
Proceedings ArticleDOI
SelfDoc: Self-Supervised Document Representation Learning
Peizhao Li,Jiuxiang Gu,Jason Kuen,Vlad I. Morariu,Handong Zhao,Rajiv Jain,Varun Manjunatha,Hongfu Liu +7 more
TL;DR: SelfDoc as discussed by the authors proposes a task-agnostic pre-training framework for document image understanding, which exploits the positional, textual, and visual information of every semantically meaningful component in a document, and models the contextualization between each block of content.
Book ChapterDOI
Going Full-TILT Boogie on Document Understanding with Text-Image-Layout Transformer
Rafal Powalski,Lukasz Borchmann,Dawid Jurkiewicz,Tomasz Dwojak,Michał Pietruszka,Gabriela Pałka +5 more
TL;DR: This article proposed a TILT neural network architecture which simultaneously learns layout information, visual features, and textual semantics, and achieved state-of-the-art results in extracting information from documents and answering questions which demand layout understanding.
Proceedings ArticleDOI
LayoutLMv2: Multi-modal Pre-training for Visually-rich Document Understanding
Yang Xu,Yiheng Xu,Tengchao Lv,Lei Cui,Furu Wei,Guoxin Wang,Yijuan Lu,Dinei Florencio,Cha Zhang,Wanxiang Che,Min Zhang,Lidong Zhou +11 more
TL;DR: In this article, a two-stream multi-modal Transformer encoder is proposed to model the interaction among text, layout, and image in a single multimodal framework.
Proceedings ArticleDOI
LayoutLM: Pre-training of Text and Layout for Document Image Understanding
TL;DR: Li et al. as mentioned in this paper proposed a pre-training model for document image understanding, where text and layout information are jointly learned in a single framework for document-level pre-learning, which achieves new state-of-the-art results in several downstream tasks.
References
More filters
Proceedings ArticleDOI
Deep Residual Learning for Image Recognition
TL;DR: In this article, the authors proposed a residual learning framework to ease the training of networks that are substantially deeper than those used previously, which won the 1st place on the ILSVRC 2015 classification task.
Journal ArticleDOI
Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks
TL;DR: This work introduces a Region Proposal Network (RPN) that shares full-image convolutional features with the detection network, thus enabling nearly cost-free region proposals and further merge RPN and Fast R-CNN into a single network by sharing their convolutionAL features.
Proceedings ArticleDOI
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
TL;DR: BERT as mentioned in this paper pre-trains deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers, which can be fine-tuned with just one additional output layer to create state-of-the-art models for a wide range of tasks.
Proceedings ArticleDOI
An Overview of the Tesseract OCR Engine
TL;DR: The Tesseract OCR engine, as was the HP Research Prototype in the UNLV Fourth Annual Test of OCR Accuracy, is described in a comprehensive overview.
Proceedings ArticleDOI
EAST: An Efficient and Accurate Scene Text Detector
TL;DR: This work proposes a simple yet powerful pipeline that yields fast and accurate text detection in natural scenes, and significantly outperforms state-of-the-art methods in terms of both accuracy and efficiency.