scispace - formally typeset
Search or ask a question

Showing papers in "International Journal on Document Analysis and Recognition in 2021"


Journal ArticleDOI
TL;DR: A deep neural network model with an encoder-decoder architecture that translates images of math formulas into their LaTeX markup sequences and shows state-of-the-art performance on both sequence-based and image-based evaluation metrics.
Abstract: In this paper, we propose a deep neural network model with an encoder–decoder architecture that translates images of math formulas into their LaTeX markup sequences. The encoder is a convolutional neural network that transforms images into a group of feature maps. To better capture the spatial relationships of math symbols, the feature maps are augmented with 2D positional encoding before being unfolded into a vector. The decoder is a stacked bidirectional long short-term memory model integrated with the soft attention mechanism, which works as a language model to translate the encoder output into a sequence of LaTeX tokens. The neural network is trained in two steps. The first step is token-level training using the maximum likelihood estimation as the objective function. At completion of the token-level training, the sequence-level training objective function is employed to optimize the overall model based on the policy gradient algorithm from reinforcement learning. Our design also overcomes the exposure bias problem by closing the feedback loop in the decoder during sequence-level training, i.e., feeding in the predicted token instead of the ground truth token at every time step. The model is trained and evaluated on the IM2LATEX-100 K dataset and shows state-of-the-art performance on both sequence-based and image-based evaluation metrics.

47 citations


Journal ArticleDOI
TL;DR: In this article, deep neural networks are utilized through different DenseNet and Xception architectures, being further boosted by means of data augmentation and test time augmentation, and the comparison of the proposed method with various state-of-the-art alternatives is performed.
Abstract: In spite of various applications of digit, letter and word recognition, only a few studies have dealt with Persian scripts. In this paper, deep neural networks are utilized through different DenseNet and Xception architectures, being further boosted by means of data augmentation and test time augmentation. Dividing the datasets to training, validation and test sets, and utilizing k-fold cross-validation, the comparison of the proposed method with various state-of-the-art alternatives is performed. Three datasets: HODA, Sadri and Iranshahr are used, which offer the most comprehensive collections of samples in terms of handwriting styles and the forms each letter may take depending on its position within a word. On the HODA dataset, we achieve recognition rates of 99.49% and 98.10% for digits and characters, being 99.72%, 89.99% and 98.82% for digits, characters and words from the Sadri dataset, respectively, as well as 98.99% for words from the Iranshahr dataset, each of which outperforms the performances achieved by the most advanced alternative networks, namely ResNet50 and VGG16. An additional contribution of the paper arises from its capability of words recognition as a holistic image classification. This improves the resulting speed and versatility significantly, as it does not require explicit character models, unlike earlier alternatives such as hidden Markov models and convolutional recursive neural networks. In addition, computation times have been compared with alternative state-of-the-art models and better performance has been observed.

12 citations


Journal ArticleDOI
TL;DR: This paper investigates to develop a method for Vietnamese identity card recognition based on deep features network that achieves an accuracy of more than 96.7% and 89.8% on character level and word level, respectively.
Abstract: Optical character recognition (OCR) is a technology for converting text automatically on images into data strings for editing, indexing, and searching. The strings can be applied for many tasks such as to digitize old documents, translate into other languages, or to test and verify text positions. Recently, Know Your Customer (KYC) has become an industry standard for making sure that people are who they say they are. While the scope of Know Your Customer is constantly expanding, ID verification is still a crucial first step in KYC processes. Mobile OCR is one of the technological solutions that is making this part of KYC easier than ever for customers to comply with. KYC processes require financial services companies to verify the identities of their customers OCR to extract data by reading IDs, bank cards, and documents. In this paper, we investigate to develop a method for Vietnamese identity card recognition based on deep features network. On several major data fields of identity cards, it achieves an accuracy of more than 96.7% and 89.7% on character level and word level, respectively.

11 citations


Journal ArticleDOI
TL;DR: Arrow R-CNN as mentioned in this paper extends the Faster-RCNN object detector with an arrow head and tail keypoint predictor and a diagram-aware postprocessing method to improve the recognition of handwritten diagrams.
Abstract: We address the problem of offline handwritten diagram recognition. Recently, it has been shown that diagram symbols can be directly recognized with deep learning object detectors. However, object detectors are not able to recognize the diagram structure. We propose Arrow R-CNN, the first deep learning system for joint symbol and structure recognition in handwritten diagrams. Arrow R-CNN extends the Faster R-CNN object detector with an arrow head and tail keypoint predictor and a diagram-aware postprocessing method. We propose a network architecture and data augmentation methods targeted at small diagram datasets. Our diagram-aware postprocessing method addresses the insufficiencies of standard Faster R-CNN postprocessing. It reconstructs a diagram from a set of symbol detections and arrow keypoints. Arrow R-CNN improves state-of-the-art substantially: on a scanned flowchart dataset, we increase the rate of recognized diagrams from 37.7 to 78.6%.

11 citations


Journal ArticleDOI
TL;DR: In this article, the task of instance segmentation on the document image domain is defined, which is especially important in complex layouts whose contents should interact for the proper rendering of the page, i.e., the proper text wrapping around an image.
Abstract: Information extraction is a fundamental task of many business intelligence services that entail massive document processing. Understanding a document page structure in terms of its layout provides contextual support which is helpful in the semantic interpretation of the document terms. In this paper, inspired by the progress of deep learning methodologies applied to the task of object recognition, we transfer these models to the specific case of document object detection, reformulating the traditional problem of document layout analysis. Moreover, we importantly contribute to prior arts by defining the task of instance segmentation on the document image domain. An instance segmentation paradigm is especially important in complex layouts whose contents should interact for the proper rendering of the page, i.e., the proper text wrapping around an image. Finally, we provide an extensive evaluation, both qualitative and quantitative, that demonstrates the superior performance of the proposed methodology over the current state of the art.

10 citations


Journal ArticleDOI
TL;DR: In this paper, four data augmentation strategies are employed in order to generate more shape and dynamic variations to improve the performance of recognition systems using small datasets, which is a serious issue investigated by many studies that deal with the current challenge.
Abstract: The lack of large training data in the context of deep learning applications is a serious issue investigated by many studies that deal with the current challenge. In this paper, we introduce new data augmentation methods that generate more shape and dynamic variations to improve the performance of recognition systems using small datasets. Four data augmentation strategies are employed in our work. The first strategy employs the geometric methods that include: italicity angle, change of magnitude ratio, and baseline inclination angle. The second strategy applies a frequency treatment that attenuates or amplifies the trajectory high harmonics to generate handwriting modified styles. The third strategy employs the beta-elliptic model to extract a combined static and dynamic representation of the handwritten trajectory which undergoes limited random change around its parameters in order to generate more modified samples. The hybrid strategy consists of combining these strategies to maximize variations of the online handwriting trajectory (OHT). We evaluated our approach of data augmentation in the context of multi-lingual online handwriting recognition (OHR) tasks using end-to-end CNN architecture. Four databases; ADAB, ALTEC-OnDB, and Online_KHATT for Arabic script, and UNIPEN for Latin characters, are used to validate the proposed strategy. The obtained results show the effectiveness and the advantage of the adopted strategies compared with those registered before database extension or reported in the state-of-the-art systems.

10 citations


Journal ArticleDOI
TL;DR: Qualitative and quantitative comparisons with the state-of-the-art methods demonstrate the superiority of the proposed SKFont method, which resolves long overdue shortfalls such as blurriness, breaking, and a lack of delivery of delicate shapes and styles by using the ‘skeleton-driven’ conditional deep adversarial network.
Abstract: In our research, we study the problem of font synthesis using an end-to-end conditional deep adversarial network with a small sample of Korean characters (Hangul). Hangul comprises of 11,172 characters and is composed by writing in multiple placement patterns. Traditionally, font design has required heavy-loaded human labor, easily taking one year to finish one style set. Even with the help of programmable approaches, it still takes a long time and cannot escape the limitations around the freedom to change parameters. Many trials have been attempted in deep neural network areas to generate characters without any human intervention. Our research focuses on an end-to-end deep learning model, the Skeleton-Driven Font generator (SKFont): when given 114 samples, the system automatically generates the rest of the characters in the same given font style. SKFont involves three steps: First, it generates complete target font characters by observing 114 target characters. Then, it extracts the skeletons (structures) of the synthesized characters obtained from the first step. This process drives the system to sustain the main structure of the characters throughout the whole generation processes. Finally, it transfers the style of the target font onto these learned structures. Our study resolves long overdue shortfalls such as blurriness, breaking, and a lack of delivery of delicate shapes and styles by using the ‘skeleton-driven’ conditional deep adversarial network. Qualitative and quantitative comparisons with the state-of-the-art methods demonstrate the superiority of the proposed SKFont method.

7 citations


Journal ArticleDOI
TL;DR: The release of BE-Arabic-9K, a dataset of more than 9000 high-quality scanned images from over 700 Arabic books, and a page layout segmentation and text extraction baseline model based on fine-tuned Faster R-CNN structure (FFRA), making it an outstanding baseline model to challenge.
Abstract: Datasets of documents in Arabic are urgently needed to promote computer vision and natural language processing research that addresses the specifics of the language. Unfortunately, publicly available Arabic datasets are limited in size and restricted to certain document domains. This paper presents the release of BE-Arabic-9K, a dataset of more than 9000 high-quality scanned images from over 700 Arabic books. Among these, 1500 images have been manually segmented into regions and labeled by their functionality. BE-Arabic-9K includes book pages with a wide variety of complex layouts and page contents, making it suitable for various document layout analysis and text recognition research tasks. The paper also presents a page layout segmentation and text extraction baseline model based on fine-tuned Faster R-CNN structure (FFRA). This baseline model yields cross-validation results with an average accuracy of 99.4% and F1 score of 99.1% for text versus non-text block classification on 1500 annotated images of BE-Arabic-9K. These results are remarkably better than those of the state-of-the-art Arabic book page segmentation system ECDP. FFRA also outperforms three other prior systems when tested on a competition benchmark dataset, making it an outstanding baseline model to challenge.

7 citations


Journal ArticleDOI
TL;DR: A comprehensive survey of various techniques available for identification and recognition of multilingual scripts from the last few decades that are mainly focused on Indic scripts is presented in this article, where some potential non-Indic script identification works are also incorporated for ease of understanding.
Abstract: Script recognition has many real-life applications like optical character recognition, document archiving, writer identification, searching within the documents, etc. Automatic script recognition from multilingual documents is a stimulating task, where the system must identify and recognize several types of scripts that can be available on a single page. In offline script recognition, printed or handwritten documents are firstly scanned followed by the process of script recognition, whereas in online script recognition documents are already in soft-copy form. Most of the script recognition techniques presented by researchers so far are based on traditional image processing frameworks. But nowadays, it is observed that Deep Learning-based techniques are more capable of achieving a script recognition task efficiently as well as accurately. This paper provides a comprehensive survey of various techniques available for identification and recognition of multilingual scripts from the last few decades that are mainly focused on Indic scripts. However, some potential non-Indic script identification works are also incorporated for ease of understanding. We hope that this survey can act as a compendium as well as provide future directions to researchers for developing generic OCRs.

7 citations


Journal ArticleDOI
TL;DR: A complete framework for text line segmentation in historical Arabic or Latin document images and a post-processing step based on topological structure analysis is introduced to extract complete text lines (including the ascender and descender components).
Abstract: One of the most important preliminary tasks in a transcription system of historical document images is text line segmentation. Nevertheless, this task remains complex due to the idiosyncrasies of ancient document images. In this article, we present a complete framework for text line segmentation in historical Arabic or Latin document images. A two-step procedure is described. First, a deep fully convolutional networks (FCN) architecture has been applied to extract the main area covering the text core. In order to select the highest performing FCN architecture, a thorough performance benchmarking of the most recent and widely used FCN architectures for segmenting text lines in historical Arabic or Latin document images has been conducted. Then, a post-processing step, which is based on topological structure analysis is introduced to extract complete text lines (including the ascender and descender components). This second step aims at refining the obtained FCN results and at providing sufficient information for text recognition. Our experiments have been carried out using a large number of Arabic and Latin document images collected from the Tunisian national archives as well as other benchmark datasets. Quantitative and qualitative assessments are reported in order to firstly pinpoint the strengths and weaknesses of the different FCN architectures and secondly to illustrate the effectiveness of the proposed post-processing method.

7 citations


Journal ArticleDOI
TL;DR: In this article, a self-supervised approach was proposed for document fragments association using deep metric learning methods. But this method is not suitable for the task of reconstructing the papyri by hand.
Abstract: This work focuses on document fragments association using deep metric learning methods. More precisely, we are interested in ancient papyri fragments that need to be reconstructed prior to their analysis by papyrologists. This is a challenging task to automatize using machine learning algorithms because labeled data is rare, often incomplete, imbalanced and of inconsistent conservation states. However, there is a real need for such software in the papyrology community as the process of reconstructing the papyri by hand is extremely time-consuming and tedious. In this paper, we explore ways in which papyrologists can obtain useful matching suggestion on new data using Deep Convolutional Siamese-Networks. We emphasize on low-to-no human intervention for annotating images. We show that the from-scratch self-supervised approach we propose is more effective than using knowledge transfer from a large dataset, the former achieving a top-1 accuracy score of 0.73 on a retrieval task involving 800 fragments.

Journal ArticleDOI
TL;DR: A shape-aware dual-stream convolutional neural network for the segmentation of narrative text boxes and speech balloons of various shapes is presented and a method for the development of ground-truth images in a semiautomatic way is proposed.
Abstract: Most of the recent research works on comic document images have focused on the reading and distribution of comics digitally due to the evolution of technologies. In this work, the extraction of narrative text boxes and speech balloons, which contain the conversations among comic characters along with their feelings, is presented. Due to the huge variety of drawing styles, the shape of these speech balloons is complex, and extraction is difficult. We present a shape-aware dual-stream convolutional neural network for the segmentation of narrative text boxes and speech balloons of various shapes. In our dual-stream architecture, an added shape module processes edge information of the speech balloons and narrative texts with the main module. Later, the concatenation of these two modules produces more accurate segmentation of speech balloons and narrative text boxes. The proposed method achieves significant performance improvements in terms of both region accuracy (mIOU) and boundary accuracy (F-measure and Hausdorff distance) compared to other state-of-the-art methods on various publicly available comic datasets (namely eBDtheque, DCM and Manga 109 dataset subset) in different languages. In addition, we have developed a new dataset (BCBId) for comics in Bangla, the eighth most spoken language in the world, and propose a method for the development of ground-truth images in a semiautomatic way.

Journal ArticleDOI
TL;DR: In this paper, a model based on convolutional neural networks was proposed to extract the machine-readable zone (MRZ) information from digital images of passports of arbitrary orientation and size.
Abstract: Detecting and extracting information from the machine-readable zone (MRZ) on passports and visas is becoming increasingly important for verifying document authenticity. However, computer vision methods for performing similar tasks, such as optical character recognition, fail to extract the MRZ from digital images of passports with reasonable accuracy. We present a specially designed model based on convolutional neural networks that is able to successfully extract MRZ information from digital images of passports of arbitrary orientation and size. Our model achieves 100% MRZ detection rate and 99.25% character recognition macro-f1 score on a passport and visa dataset.

Journal ArticleDOI
TL;DR: In this paper, the authors proposed a method to reveal the undertext completely using deep generative networks, by leveraging prior spatial information of the under-text script, which is done by generating the under text from a separately trained generative network to match it to the original palimpsest image after mixing it with foreground text.
Abstract: A palimpsest is a historical manuscript in which the original text (termed under-text) was erased and overwritten with another script in order to recycle the parchment. One of the main challenges in studying palimpsests is to reveal the under-text. Due to the development of multi-spectral imaging, the original text can sometimes be recovered through material differences of inks and parchment (Easton et al., in: 2011 19th European signal processing conference, IEEE, 2011). However, generally, the revealed text can be observed only partially due to the overlap with newer text and degradation of the material. In this work, we propose revealing the under-text completely using deep generative networks, by leveraging prior spatial information of the under-text script. To optimize the under-text, we mimic the process of palimpsest creation. This is done by generating the under-text from a separately trained generative network to match it to the palimpsest image after mixing it with foreground text. The mixing process is represented by a separate neural network, that is optimized with the under-text image to match the original palimpsest. We also add an additional background generative network to compensate for the unevenness of the background. We propose a novel way of training the background generative network, that does not require isolated background samples and can use any region with layers of text. This paper illustrates the first known attempt to solve palimpsest text layer separation with deep generative networks. We evaluate our method performance on artificial and real palimpsest manuscripts by measuring character recognition and pixel-wise accuracy of the reconstructed under-text.

Journal ArticleDOI
TL;DR: This work uses the concept of semantic segmentation with the help of a multi-scale convolutional neural network to segment text lines from both flatbed scanned/camera-captured heavily warped printed and handwritten documents.
Abstract: Paper documents are ideal sources of useful information and have a profound impact on every aspect of human lives. These documents may be printed or handwritten and contain information as combinations of texts, figures, tables, charts, etc. This paper proposes a method to segment text lines from both flatbed scanned/camera-captured heavily warped printed and handwritten documents. This work uses the concept of semantic segmentation with the help of a multi-scale convolutional neural network. The results of line segmentation using the proposed method outperform a number of similar proposals already reported in the literature. The performance and efficacy of the proposed method have been corroborated by the test result on a variety of publicly available datasets, including ICDAR, Alireza, IUPR, cBAD, Tobacco-800, IAM, and our dataset.

Journal ArticleDOI
TL;DR: Cui et al. as mentioned in this paper proposed a triplet attention Mogrifier network (TAMN) for print Mongolian text recognition, which uses a spatial transformation network to correct deformed Mongolian images.
Abstract: Mongolian is a language spoken in Inner Mongolia, China. In the recognition process, due to the shooting angle and other reasons, the image and text will be deformed, which will cause certain difficulties in recognition. This paper propose a triplet attention Mogrifier network (TAMN) for print Mongolian text recognition. The network uses a spatial transformation network to correct deformed Mongolian images. It uses gated recurrent convolution layers (GRCL) combine with triplet attention module to extract image features for the corrected images. The Mogrifier long short-term memory (LSTM) network gets the context sequence information in the feature and finally uses the decoder’s LSTM attention to get the prediction result. Experimental results show the spatial transformation network can effectively recognize deformed Mongolian images, and the recognition accuracy can reach 90.30%. This network achieves good performance in Mongolian text recognition compare with the current mainstream text recognition network. The dataset has been publicly available at https://github.com/ShaoDonCui/Mongolian-recognition .

Journal ArticleDOI
TL;DR: Deep&Syntax, a hybrid system that takes advantages of recurrent patterns to delimit each record, by combining u-shaped networks and logical rules, can be used for massive parish register processing, as collecting and annotating a sufficiently large and representative set of training data is not always achievable.
Abstract: This work focuses on the layout analysis of historical handwritten registers, in which local religious ceremonies were recorded. The aim of this work is to delimit each record using few available training data. To this end, two approaches are proposed. Firstly, three state-of-the-art object detection networks are explored and compared. Further experiments are then conducted on Mask R-CNN, as it yields the best performance. Secondly, we introduce and investigate Deep&Syntax, a hybrid system that takes advantages of recurrent patterns to delimit each record, by combining u-shaped networks and logical rules. Finally, these two approaches are evaluated on 3708 French records (sixteenth–eighteenth centuries), as well as on the Esposalles public database, containing 253 Spanish records (seventeenth century). While both systems perform well on homogeneous documents, we observe a significant drop in performance with Mask R-CNN on more challenging documents, especially when trained on a small, non-representative subset. By contrast, Deep&Syntax relies on steady patterns and is therefore able to process a wider range of documents with less training data. When both systems are trained on 120 documents, Deep&Syntax produces 15% more match configurations and reduces the ZoneMap surface error metric by 30%. It also outperforms Mask R-CNN when trained on a database three times smaller. As Deep&Syntax generalizes better, we believe it can be used for massive parish register processing, as collecting and annotating a sufficiently large and representative set of training data is not always achievable.

Journal ArticleDOI
TL;DR: Li et al. as discussed by the authors designed a self-attention-based fusion module that serves as a block in their ensemble trainable network, which allows to simultaneously learn the discriminant features of image and text modalities throughout the training stage.
Abstract: In the recent past, complex deep neural networks have received huge interest in various document understanding tasks such as document image classification and document retrieval. As many document types have a distinct visual style, learning only visual features with deep CNNs to classify document images has encountered the problem of low inter-class discrimination, and high intra-class structural variations between its categories. In parallel, text-level understanding jointly learned with the corresponding visual properties within a given document image has considerably improved the classification performance in terms of accuracy. In this paper, we design a self-attention-based fusion module that serves as a block in our ensemble trainable network. It allows to simultaneously learn the discriminant features of image and text modalities throughout the training stage. Besides, we encourage mutual learning by transferring the positive knowledge between image and text modalities during the training stage. This constraint is realized by adding a truncated Kullback–Leibler divergence loss (Tr- $$\hbox {KLD}_{{\mathrm{Reg}}}$$ ) as a new regularization term, to the conventional supervised setting. To the best of our knowledge, this is the first time to leverage a mutual learning approach along with a self-attention-based fusion module to perform document image classification. The experimental results illustrate the effectiveness of our approach in terms of accuracy for the single-modal and multi-modal modalities. Thus, the proposed ensemble self-attention-based mutual learning model outperforms the state-of-the-art classification results based on the benchmark RVL-CDIP and Tobacco-3482 datasets.

Journal ArticleDOI
TL;DR: In this paper, the authors proposed two models, description synthesis from image cue (DSIC) and transformer-based description generation (TBDG), for text generation from floor plan images.
Abstract: Image captioning is a widely known problem in the area of AI. Caption generation from floor plan images has applications in indoor path planning, real estate, and providing architectural solutions. Several methods have been explored in the literature for generating captions or semi-structured descriptions from floor plan images. Since only the caption is insufficient to capture fine-grained details, researchers also proposed descriptive paragraphs from images. However, these descriptions have a rigid structure and lack flexibility, making it difficult to use them in real-time scenarios. This paper offers two models, description synthesis from image cue (DSIC) and transformer-based description generation (TBDG), for text generation from floor plan images. These two models take advantage of modern deep neural networks for visual feature extraction and text generation. The difference between both models is in the way they take input from the floor plan image. The DSIC model takes only visual features automatically extracted by a deep neural network, while the TBDG model learns textual captions extracted from input floor plan images with paragraphs. The specific keywords generated in TBDG and understanding them with paragraphs make it more robust in a general floor plan image. Experiments were carried out on a large-scale publicly available dataset and compared with state-of-the-art techniques to show the proposed model’s superiority.

Journal ArticleDOI
TL;DR: This work proposes a learning-free approach based on a state-of-the-art Naive Bayes Nearest-Neighbour classifier for the task of pattern detection in manuscript images that has already been successfully applied to an actual research question from South Asian studies about palm-leaf manuscripts.
Abstract: Automatic pattern detection has become increasingly important for scholars in the humanities as the number of manuscripts that have been digitised has grown. Most of the state-of-the-art methods used for pattern detection depend on the availability of a large number of training samples, which are typically not available in the humanities as they involve tedious manual annotation by researchers (e.g. marking the location and size of words, drawings, seals and so on). This makes the applicability of such methods very limited within the field of manuscript research. We propose a learning-free approach based on a state-of-the-art Naive Bayes Nearest-Neighbour classifier for the task of pattern detection in manuscript images. The method has already been successfully applied to an actual research question from South Asian studies about palm-leaf manuscripts. Furthermore, state-of-the-art results have been achieved on two extremely challenging datasets, namely the AMADI_LontarSet dataset of handwriting on palm leaves for word-spotting and the DocExplore dataset of medieval manuscripts for pattern detection. A performance analysis is provided as well in order to facilitate later comparisons by other researchers. Finally, an easy-to-use implementation of the proposed method is developed as a software tool and made freely available.

Journal ArticleDOI
TL;DR: This work proposes a novel generative document restoration method which allows conditioning the restoration on a guiding signal in the form of target text transcription and which does not need paired high- and low-quality images for training.
Abstract: Most image enhancement methods focused on restoration of digitized textual documents are limited to cases where the text information is still preserved in the input image, which may often not be the case. In this work, we propose a novel generative document restoration method which allows conditioning the restoration on a guiding signal in the form of target text transcription and which does not need paired high- and low-quality images for training. We introduce a neural network architecture with an implicit text-to-image alignment module. We demonstrate good results on inpainting, debinarization and deblurring tasks, and we show that the trained models can be used to manually alter text in document images. A user study shows that that human observers confuse the outputs of the proposed enhancement method with reference high-quality images in as many as 30% of cases.

Journal ArticleDOI
TL;DR: The authors proposed TableSegNet, a compact architecture of a fully convolutional network to detect and separate tables simultaneously, which consists of a deep convolution path to detect table regions in low resolution and a shallower path to locate table locations in high resolution.
Abstract: Advances in image object detection lead to applying deep convolution neural networks in the document image analysis domain. Unlike general colorful and pattern-rich objects, tables in document images have properties that limit the capacity of deep learning structures. Significant variation in size and aspect ratio and the local similarity among document components are the main challenges that require both global features for detection and local features for the separation of nearby objects. To deal with these challenges, we present TableSegNet, a compact architecture of a fully convolutional network to detect and separate tables simultaneously. TableSegNet consists of a deep convolution path to detect table regions in low resolution and a shallower path to locate table locations in high resolution and split the detected regions into individual tables. To improve the detection and separation capacity, TableSegNet uses convolution blocks of wide kernel sizes in the feature extraction process and an additional table-border class in the main output. With only 8.1 million parameters and trained purely on document images from the beginning, TableSegNet has achieved state-of-the-art F1 score at the IoU threshold of 0.9 on the ICDAR2019 and the highest number of correctly detected tables on the ICDAR2013 table detection datasets.

Journal ArticleDOI
TL;DR: TextPolar as discussed by the authors predicts the text center line via pixel-level segmentation and adopts polar coordinates instead of Euclidean coordinates to precisely depict the contour of text regions.
Abstract: How to precisely detect arbitrary-shaped texts in natural images has recently become a new hot topic in areas of computer vision and pattern recognition. However, the performance of most existing methods is still unsatisfactory mainly due to the intrinsic drawback of their representations for text instances. In this paper, we propose a segmentation-based method, TextPolar, for irregular scene text detection by using a novel text representation. Specifically, we predict the text center line via pixel-level segmentation and adopt polar coordinates instead of Euclidean coordinates to precisely depict the contour of text regions. Moreover, the whole detection network is also carefully designed by integrating the specific dilated convolution for multi-scale feature maps to extract rich context features. Experiments conducted on several popular scene text benchmarks, including both curved and multi-oriented text datasets, demonstrate that the proposed TextPolar obtains superior or competitive performance compared to the state of the art, e.g., 83.0% F-score for SCUT-CTW1500, 72.6% F-score for ICDAR2017-MLT, etc.

Journal ArticleDOI
TL;DR: In this article, a dynamic multi-task learning module is proposed to automatically generate/learn task weights according to the training importance of tasks, which enables the network to focus on training the hard tasks instead of being stuck in the overtraining of easy tasks.
Abstract: Face recognition of realistic visual images (e.g., photos) has been well studied and made significant progress in the recent decade. However, face recognition between realistic visual images/photos and caricatures is still a challenging problem. Unlike the photos, the different artistic styles of caricatures introduce extreme non-rigid distortions of caricatures. The great representational gap between the different modalities of photos and caricatures is a big challenge for photo-caricature face recognition. In this paper, we propose to conduct cross-modal photo-caricature face recognition via multi-task learning, which can learn the features of different modalities with different tasks. Instead of manually setting the task weights as in conventional multi-task learning, this work proposes a dynamic weights learning module which can automatically generate/learn task weights according to the training importance of tasks. The learned task weights enable the network to focus on training the hard tasks instead of being stuck in the overtraining of easy tasks. The experimental results demonstrate the effectiveness of the proposed dynamic multi-task learning for cross-modal photo-caricature face recognition. The performance on the datasets CaVI and WebCaricature show the superiority over the state-of-art methods. The implementation code is provided here. ( https://github.com/hengxyz/cari-visual-recognition-via-multitask-learning.git ).

Journal ArticleDOI
TL;DR: In this paper, the authors used Siamese networks, concepts of similarity, one-shot learning, and context/memory awareness to improve the performance of document classification in the huge real-world document dataset.
Abstract: The automation of document processing has recently gained attention owing to its great potential to reduce manual work. Any improvement in information extraction systems or reduction in their error rates aids companies working with business documents because lowering reliance on cost-heavy and error-prone human work significantly improves the revenue. Neural networks have been applied to this area before, but they have been trained only on relatively small datasets with hundreds of documents so far. To successfully explore deep learning techniques and improve information extraction, we compiled a dataset with more than 25,000 documents. We expand on our previous work in which we proved that convolutions, graph convolutions, and self-attention can work together and exploit all the information within a structured document. Taking the fully trainable method one step further, we now design and examine various approaches to using Siamese networks, concepts of similarity, one-shot learning, and context/memory awareness. The aim is to improve micro $$F_{1}$$ of per-word classification in the huge real-world document dataset. The results verify that trainable access to a similar (yet still different) page, together with its already known target information, improves the information extraction. The experiments confirm that all proposed architecture parts (Siamese networks, employing class information, query-answer attention module and skip connections to a similar page) are all required to beat the previous results. The best model yields an 8.25% gain in the $$F_{1}$$ score over the previous state-of-the-art results. Qualitative analysis verifies that the new model performs better for all target classes. Additionally, multiple structural observations about the causes of the underperformance of some architectures are revealed, since all the techniques used in this work are not problem-specific and can be generalized for other tasks and contexts.

Journal ArticleDOI
TL;DR: A hybrid model is proposed, which considers both section headers and body texts, to recognize generic sections in scholarly documents automatically, and achieves 91.67% $$F_{1}$$ -value in the generic section recognization, which is better than the baseline.
Abstract: Discourse parsing of scholarly documents is the premise and basis for standardizing the writing of scholarly documents, understanding their content, and quickly locating and extracting specific information from them. With the continuous emergence of a large number of scholarly documents, how to automatically analyze scholarly documents quickly and effectively has become a research hotspot. In this paper, we propose a hybrid model, which considers both section headers and body texts, to recognize generic sections in scholarly documents automatically. We conduct a comprehensive analysis of the semantic difference between short phrases and long narrative text chunks on the SectLabel dataset. The experimental results show that our model achieves 91.67% $$F_{1}$$ -value in the generic section recognization, which is better than the baseline.

Journal ArticleDOI
TL;DR: In this paper, an off-the-shelf deep embedding network is used to project both textual words and word images into a common sub-space and retrieve document snippets that potentially answer a question.
Abstract: This work addresses the problem of Question Answering (QA) on handwritten document collections. Unlike typical QA and Visual Question Answering (VQA) formulations where the answer is a short text, we aim to locate a document snippet where the answer lies. The proposed approach works without recognizing the text in the documents. We argue that the recognition-free approach is suitable for handwritten documents and historical collections where robust text recognition is often difficult. At the same time, for human users, document image snippets containing answers act as a valid alternative to textual answers. The proposed approach uses an off-the-shelf deep embedding network which can project both textual words and word images into a common sub-space. This embedding bridges the textual and visual domains and helps us retrieve document snippets that potentially answer a question. We evaluate results of the proposed approach on two new datasets: (i) HW-SQuAD: a synthetic, handwritten document image counterpart of SQuAD1.0 dataset and (ii) BenthamQA: a smaller set of QA pairs defined on documents from the popular Bentham manuscripts collection. We also present a thorough analysis of the proposed recognition-free approach compared to a recognition-based approach which uses text recognized from the images using an OCR. Datasets presented in this work are available to download at docvqa.org.