MIDV-LAIT: A Challenging Dataset for Recognition of IDs with Perso-Arabic, Thai, and Indian Scripts.
05 Sep 2021-pp 258-272
TL;DR: In this article, the authors presented a new dataset for identity documents (ID) recognition called MIDV-LAIT, which includes textual fields in Perso-Arabic, Thai, and Indian scripts.
Abstract: In this paper, we present a new dataset for identity documents (IDs) recognition called MIDV-LAIT. The main feature of the dataset is the textual fields in Perso-Arabic, Thai, and Indian scripts. Since open datasets with real IDs may not be published, we synthetically generated all the images and data. Even faces are generated and do not belong to any particular person. Recently some datasets have appeared for evaluation of the IDs detection, type identification, and recognition, but these datasets cover only Latin-based and Cyrillic-based languages. The proposed dataset is to fix this issue and make it easier to evaluate and compare various methods. As a baseline, we process all the textual field images in MIDV-LAIT with Tesseract OCR. The resulting recognition accuracy shows that the dataset is challenging and is of use for further researches.
TL;DR: The DLC-2021 dataset is presented, which consists of 1424 video clips captured in a wide range of real-world conditions, focused on tasks relating to ID document forensics, and contains images of synthetic IDs with generated owner photos and artificial personal information.
Abstract: Various government and commercial services, including, but not limited to, e-government, fintech, banking, and sharing economy services, widely use smartphones to simplify service access and user authorization. Many organizations involved in these areas use identity document analysis systems in order to improve user personal-data-input processes. The tasks of such systems are not only ID document data recognition and extraction but also fraud prevention by detecting document forgery or by checking whether the document is genuine. Modern systems of this kind are often expected to operate in unconstrained environments. A significant amount of research has been published on the topic of mobile ID document analysis, but the main difficulty for such research is the lack of public datasets due to the fact that the subject is protected by security requirements. In this paper, we present the DLC-2021 dataset, which consists of 1424 video clips captured in a wide range of real-world conditions, focused on tasks relating to ID document forensics. The novelty of the dataset is that it contains shots from video with color laminated mock ID documents, color unlaminated copies, grayscale unlaminated copies, and screen recaptures of the documents. The proposed dataset complies with the GDPR because it contains images of synthetic IDs with generated owner photos and artificial personal information. For the presented dataset, benchmark baselines are provided for tasks such as screen recapture detection and glare detection. The data presented are openly available in Zenodo.
TL;DR: In this paper , a data-driven approach was proposed to train a memory-efficient local feature descriptor for identity documents location and classification on mobile and embedded devices, based on the specifics of document detection in smartphone camera-captured images with a template matching approach.
Abstract: In this paper, we propose a data-driven approach to training a memory-efficient local feature descriptor for identity documents location and classification on mobile and embedded devices. The proposed algorithm for retrieving a dataset of patches is based on the specifics of document detection in smartphone camera-captured images with a template matching approach. The retrieved dataset of patches relevant to the domain, which includes splits for features training, features selection, and testing, is made public. We train a binary descriptor using the retrieved dataset of patches, each bit of the descriptor relies on a single computationally-efficient feature. To estimate the influence of different feature spaces on the descriptor performance, we perform descriptor training experiments using gradient-based and intensity-based features. Extensive experiments in identity document location and classification benchmarks showed that the resulting 128 and 192-bit descriptors which use gradient-based features outperformed a state-of-the-art 512-bit BEBLID descriptor for arbitrary keypoints matching in all cases except the cases of extreme projective distortions, being significantly more efficient in cases of low lighting. The 64-bit gradient-based descriptor obtained within the approach showed better quality than 128 and 256-bit BinBoost descriptors in scanned document images. To evaluate the influence of the descriptor size on the matching speed, we propose a model based on the required number of processor instructions for computing the Hamming distance between a pair of descriptors on various energy-efficient processor architectures.
20 Apr 2019
••26 Jul 2009
TL;DR: The purpose of this database is the large-scale benchmarking of open-vocabulary,multi-font, multi-size and multi-style text recognition systems in Arabic.
Abstract: We report on the creation of a database composed of images of Arabic Printed words. The purpose of this database is the large-scale benchmarking of open-vocabulary, multi-font, multi-size and multi-style text recognition systems in Arabic. The challenges that are addressed by the database are in the variability of the sizes, fonts and style used to generate the images. A focus is also given on low-resolution images where anti-aliasing is generating noise on the characters to recognize. The database is synthetically generated using a lexicon of 113’284 words, 10 Arabic fonts, 10 font sizes and 4 font styles. The database contains 45’313’600 single word images totaling to more than 250 million characters. Ground truth annotation is provided for each image. The database is called APTI for Arabic Printed Text Images.
••04 Feb 2013
TL;DR: A generic Optical Character Recognition system for Arabic script languages called Nabocr is presented, initially trained to recognize both Urdu Nastaleeq and Arabic Naskh fonts, however, it can be trained by users to be used for other ArabicScript languages.
Abstract: In this paper, we present a generic Optical Character Recognition system for Arabic script languages called Nabocr. Nabocr uses OCR approaches specific for Arabic script recognition. Performing recognition on Arabic script text is relatively more difficult than Latin text due to the nature of Arabic script, which is cursive and context sensitive. Moreover, Arabic script has different writing styles that vary in complexity. Nabocr is initially trained to recognize both Urdu Nastaleeq and Arabic Naskh fonts. However, it can be trained by users to be used for other Arabic script languages. We have evaluated our system's performance for both Urdu and Arabic. In order to evaluate Urdu recognition, we have generated a dataset of Urdu text called UPTI (Urdu Printed Text Image Database), which measures different aspects of a recognition system. The performance of our system for Urdu clean text is 91%. For Arabic clean text, the performance is 86%. Moreover, we have compared the performance of our system against Tesseract's newly released Arabic recognition, and the performance of both systems on clean images is almost the same.
TL;DR: The Mobile Identity Document Video dataset (MIDV-500) as mentioned in this paper is a collection of 500 video clips for 50 different identity document types with ground truth which allows to perform research in a wide scope of document analysis problems.
Abstract: A lot of research has been devoted to identity documents analysis and recognition on mobile devices. However, no publicly available datasets designed for this particular problem currently exist. There are a few datasets which are useful for associated subtasks but in order to facilitate a more comprehensive scientific and technical approach to identity document recognition more specialized datasets are required. In this paper we present a Mobile Identity Document Video dataset (MIDV-500) consisting of 500 video clips for 50 different identity document types with ground truth which allows to perform research in a wide scope of document analysis problems. The paper presents characteristics of the dataset and evaluation results for existing methods of face detection, text line recognition, and document fields data extraction. Since an important feature of identity documents is their sensitiveness as they contain personal data, all source document images used in MIDV-500 are either in public domain or distributed under public copyright licenses. The main goal of this paper is to present a dataset. However, in addition and as a baseline, we present evaluation results for existing methods for face detection, text line recognition, and document data extraction, using the presented dataset.
TL;DR: A page-level handwritten document image dataset of 11 official Indic scripts, composed of 1458 document text-pages written by 463 individuals from various parts of India, is presented and the benchmark results for handwritten script identification (HSI) are reported.
Abstract: Without publicly available dataset, specifically in handwritten document recognition (HDR), we cannot make a fair and/or reliable comparison between the methods. Considering HDR, Indic script’s document recognition is still in its early stage compared to others such as Roman and Arabic. In this paper, we present a page-level handwritten document image dataset (PHDIndic_11), of 11 official Indic scripts: Bangla, Devanagari, Roman, Urdu, Oriya, Gurumukhi, Gujarati, Tamil, Telugu, Malayalam and Kannada. PHDIndic_11 is composed of 1458 document text-pages written by 463 individuals from various parts of India. Further, we report the benchmark results for handwritten script identification (HSI). Beside script identification, the dataset can be effectively used in many other applications of document image analysis such as script sentence recognition/understanding, text-line segmentation, word segmentation/recognition, word spotting, handwritten and machine printed texts separation and writer identification.
Related Papers (5)
11 Aug 2002
01 Jan 2012
04 Feb 2013