scispace - formally typeset
Search or ask a question

Showing papers on "Document processing published in 2021"


Book ChapterDOI
05 Sep 2021
TL;DR: The LayoutParser library as mentioned in this paper is an open-source library for streamlining the usage of deep learning in document image analysis research and applications, which includes a set of simple and intuitive interfaces for applying and customizing DL models for layout detection, character recognition, and many other document processing tasks.
Abstract: Recent advances in document image analysis (DIA) have been primarily driven by the application of neural networks. Ideally, research outcomes could be easily deployed in production and extended for further investigation. However, various factors like loosely organized codebases and sophisticated model configurations complicate the easy reuse of important innovations by a wide audience. Though there have been on-going efforts to improve reusability and simplify deep learning (DL) model development in disciplines like natural language processing and computer vision, none of them are optimized for challenges in the domain of DIA. This represents a major gap in the existing toolkit, as DIA is central to academic research across a wide range of disciplines in the social sciences and humanities. This paper introduces LayoutParser, an open-source library for streamlining the usage of DL in DIA research and applications. The core LayoutParser library comes with a set of simple and intuitive interfaces for applying and customizing DL models for layout detection, character recognition, and many other document processing tasks. To promote extensibility, LayoutParser also incorporates a community platform for sharing both pre-trained models and full document digitization pipelines. We demonstrate that LayoutParser is helpful for both lightweight and large-scale digitization pipelines in real-word use cases. The library is publicly available at https://layout-parser.github.io.

51 citations


Journal ArticleDOI
TL;DR: In this article, a natural language data augmentation-based small-sample training framework for automatic information extraction modeling is proposed, where the cross combination-based text augmentation algorithm is employed to build up automatic information extract models without large-scale raw data and manual annotations.

30 citations


Journal ArticleDOI
TL;DR: The model takes into account the contextual aspect of pre-trained language models trained on a huge amount of data on general domains for word representation and employs transfer learning by stacking Convolutional Neural Networks to learn hidden representation for classification.

19 citations


Journal ArticleDOI
TL;DR: In this article, the task of instance segmentation on the document image domain is defined, which is especially important in complex layouts whose contents should interact for the proper rendering of the page, i.e., the proper text wrapping around an image.
Abstract: Information extraction is a fundamental task of many business intelligence services that entail massive document processing. Understanding a document page structure in terms of its layout provides contextual support which is helpful in the semantic interpretation of the document terms. In this paper, inspired by the progress of deep learning methodologies applied to the task of object recognition, we transfer these models to the specific case of document object detection, reformulating the traditional problem of document layout analysis. Moreover, we importantly contribute to prior arts by defining the task of instance segmentation on the document image domain. An instance segmentation paradigm is especially important in complex layouts whose contents should interact for the proper rendering of the page, i.e., the proper text wrapping around an image. Finally, we provide an extensive evaluation, both qualitative and quantitative, that demonstrates the superior performance of the proposed methodology over the current state of the art.

10 citations


Posted Content
TL;DR: In this paper, the authors evaluated the use of image preprocessing and document layout for sleep apnea, Apnea hypopnea index (AHI) and oxygen saturation (SaO2) from scanned sleep study reports.
Abstract: Scanned documents in electronic health records (EHR) have been a challenge for decades, and are expected to stay in the foreseeable future. Current approaches for processing often include image preprocessing, optical character recognition (OCR), and text mining. However, there is limited work that evaluates the choice of image preprocessing methods, the selection of NLP models, and the role of document layout. The impact of each element remains unknown. We evaluated this method on a use case of two key indicators for sleep apnea, Apnea hypopnea index (AHI) and oxygen saturation (SaO2) values, from scanned sleep study reports. Our data that included 955 manually annotated reports was secondarily utilized from a previous study in the University of Texas Medical Branch. We performed image preprocessing: gray-scaling followed by 1 iteration of dilating and erode, and 20% contrast increasing. The OCR was implemented with the Tesseract OCR engine. A total of seven Bag-of-Words models (Logistic Regression, Ridge Regression, Lasso Regression, Support Vector Machine, k-Nearest Neighbor, Na\"ive Bayes, and Random Forest) and three deep learning-based models (BiLSTM, BERT, and Clinical BERT) were evaluated. We also evaluated the combinations of image preprocessing methods (gray-scaling, dilate & erode, increased contrast by 20%, increased contrast by 60%), and two deep learning architectures (with and without structured input that provides document layout information). Our proposed method using Clinical BERT reached an AUROC of 0.9743 and document accuracy of 94.76% for AHI, and an AUROC of 0.9523, and document accuracy of 91.61% for SaO2. We demonstrated the proper use of image preprocessing and document layout could be beneficial to scanned document processing.

10 citations


Proceedings ArticleDOI
06 Jun 2021
TL;DR: In this paper, a hierarchical recurrent neural network (RNN) architecture is proposed to address the hierarchical structure inherent to the handwritten document, and the novelty of feature aggregation pooling technique for transferring data between hierarchical levels allows achieving higher computational efficiency for using the suggested approach in on-device mobile computing.
Abstract: The paper presents an original solution to the online handwritten document processing in a free form, which is aimed at separating multi-class handwritten documents into texts, tables, formulas, drawings, etc. Stroke classification is an important step in automatic document layout analysis (DLA) in handwritten document recognition systems. Major DLA challenges arise due to a wide diversity of handwritten content, various writing styles, a lack of contextual knowledge, and the complicated structure of freeform handwritten documents. In this paper, we propose the hierarchical recurrent neural network (RNN) architecture to address the hierarchical structure inherent to the handwritten document. The novelty of feature aggregation pooling technique for transferring data between hierarchical levels allows achieving higher computational efficiency for using the suggested approach in on-device mobile computing. The presented approach gives an access to new state-of-the-art results in the task of multi-class classification with an accuracy of 97.25% on the IAMonDo dataset. This result can serve as the basis for efficient mobile applications for freeform handwriting document recognition.

9 citations


Book ChapterDOI
05 Sep 2021
TL;DR: In this paper, a hierarchical deep neural network (HDNN) architecture with high computational efficiency is proposed for handwritten document processing and particularly for multi-class stroke classification, which uses a stack of 1D convolutional neural networks (CNN) on the lower level and a stacked RNN on the upper level.
Abstract: Stroke classification is an essential task for applications with free-form handwriting input. Implementation of this type of application for mobile devices places stringent requirements on different aspects of embedded machine learning models, which results in finding a trade-off between model performance and model complexity. In this work, a novel hierarchical deep neural network (HDNN) architecture with high computational efficiency is proposed. It is adopted for handwritten document processing and particularly for multi-class stroke classification. The architecture uses a stack of 1D convolutional neural networks (CNN) on the lower (point) hierarchical level and a stack of recurrent neural networks (RNN) on the upper (stroke) level. The novel fragment pooling techniques for feature transition between hierarchical levels are presented. On-device implementation of the proposed architecture establishes new state-of-the-art results in the multi-class handwritten document processing with a classification accuracy of 97.58% on the IAMonDo dataset. Our method is also more efficient in both processing time and memory consumption than the previous state-of-the-art RNN-based stroke classifier.

3 citations


Posted Content
TL;DR: The layoutparser library as mentioned in this paper provides a set of simple and intuitive interfaces for applying and customizing deep learning models for layout detection, character recognition, and many other document processing tasks.
Abstract: Recent advances in document image analysis (DIA) have been primarily driven by the application of neural networks. Ideally, research outcomes could be easily deployed in production and extended for further investigation. However, various factors like loosely organized codebases and sophisticated model configurations complicate the easy reuse of important innovations by a wide audience. Though there have been on-going efforts to improve reusability and simplify deep learning (DL) model development in disciplines like natural language processing and computer vision, none of them are optimized for challenges in the domain of DIA. This represents a major gap in the existing toolkit, as DIA is central to academic research across a wide range of disciplines in the social sciences and humanities. This paper introduces layoutparser, an open-source library for streamlining the usage of DL in DIA research and applications. The core layoutparser library comes with a set of simple and intuitive interfaces for applying and customizing DL models for layout detection, character recognition, and many other document processing tasks. To promote extensibility, layoutparser also incorporates a community platform for sharing both pre-trained models and full document digitization pipelines. We demonstrate that layoutparser is helpful for both lightweight and large-scale digitization pipelines in real-word use cases. The library is publicly available at this https URL.

3 citations


Book ChapterDOI
05 Sep 2021
TL;DR: In this article, a system for simultaneous detection of struck-out words and localisation of the struck out strokes using a single network architecture based on Generative Adversarial Network (GAN) is introduced.
Abstract: The presence of struck-out texts in handwritten manuscripts adversely affects the performance of state-of-the-art automatic handwritten document processing systems. The information of struck-out words (STW) are often important for real-time applications like handwritten character recognition, writer identification, digital transcription, forensic applications, historical document analysis etc. Hence, the detection of STW and localisation of struck-out strokes (SS) are crucial tasks. In this paper, we introduce a system for simultaneous detection of STWs and localisation of the SS using a single network architecture based on Generative Adversarial Network (GAN). The system requires no prior information about the type of SS stroke and it is also able to robustly handle variant of strokes like straight, slanted, cris-cross, multiple-lines, underlines and partial STW as well. However, we also present a methodology to generate STW with high variability of SS for network learning. We have evaluated the proposed pipeline on publicly available IAM dataset and also on struck-out words collected from real-world writers with high variability factors like age, gender, stroke-width, stroke-type etc. The evaluation metrics show robustness and applicability in real-world scenario.

2 citations


Proceedings ArticleDOI
19 Sep 2021
TL;DR: A novel Line Counting formulation for HTLS – that involves counting the number of text lines from the top at every pixel location – is proposed that helps learn an end-to-end HTLS solution that directly predicts per-pixel line number for a given document image.
Abstract: Handwritten Text Line Segmentation (HTLS) is a low-level but important task for many higher-level document processing tasks like handwritten text recognition. It is often formulated in terms of semantic segmentation or object detection in deep learning. However, both formulations have serious shortcomings. The former requires heavy post-processing of splitting/merging adjacent segments, while the latter may fail on dense or curved texts. In this paper, we propose a novel Line Counting formulation for HTLS -- that involves counting the number of text lines from the top at every pixel location. This formulation helps learn an end-to-end HTLS solution that directly predicts per-pixel line number for a given document image. Furthermore, we propose a deep neural network (DNN) model LineCounter to perform HTLS through the Line Counting formulation. Our extensive experiments on the three public datasets (ICDAR2013-HSC, HIT-MW, and VML-AHTE) demonstrate that LineCounter outperforms state-of-the-art HTLS approaches. Source code is available at this https URL.

2 citations


Patent
07 Jan 2021
TL;DR: In this article, a machine learning model was used to detect languages utilized in the digitized documents, and to translate the digitised documents, in other languages that are different than a common language, into the common language and to generate translated documents.
Abstract: A device receives documents from various sources, and processes the documents, with an optical character recognition engine, to generate digitized documents. The device processes the digitized documents, with a first machine learning model, to detect languages utilized in the digitized documents, and processes the digitized documents, in other languages that are different than a common language and with a second machine learning model, to translate the digitized documents, in the other languages, into the common language and to generate translated digitized documents. The device processes the translated digitized documents and untranslated digitized documents, with a classification model, to generate classified documents, and processes the classified documents, with a third machine learning model, to generate extracted information from the classified documents. The device validates the extracted information based on business rules and to generate validated extracted information, and generates a smart contract for a transaction based on the validated extracted information.

Proceedings ArticleDOI
16 Aug 2021
TL;DR: This research focuses on the reverse engineering of SDS document types to adapt to various layouts and the harnessing of meta-algorithmic and neural network approaches to provide a means of moving industrial institutions towards a digital universal SDS processing methodology.
Abstract: Chemical Safety Data Sheets (SDS) are the primary method by which chemical manufacturers communicate the ingredients and hazards of their products to the public. These SDSs are used for a wide variety of purposes ranging from environmental calculations to occupational health assessments to emergency response measures. Although a few companies have provided direct digital data transfer platforms using xml or equivalent schemata, the vast majority of chemical ingredient and hazard communication to product users still occurs through the use of millions of PDF documents that are largely loaded through manual data entry into downstream user databases. This research focuses on the reverse engineering of SDS document types to adapt to various layouts and the harnessing of meta-algorithmic and neural network approaches to provide a means of moving industrial institutions towards a digital universal SDS processing methodology. The complexities of SDS documents including the lack of format standardization, text and image combinations, and multi-lingual translation needs, combined, limit the accuracy and precision of optical character recognition tools. The approach in this document is to translate entire SDSs from thousands of chemical vendors, each with distinct formatting, to machine-encoded text with a high degree of accuracy and precision. Then the system will "read" and assess these documents as a human would; that is, ensuring that the documents are compliant, determining whether chemical formulations have changed, ensuring reported values are within expected thresholds, and comparing them to similar products for more environmentally friendly alternatives.

Journal ArticleDOI
20 Jul 2021
TL;DR: In this article, the authors provide a high-quality, highly diverse, multi-layout, and annotated invoice documents dataset for extracting key information from unstructured documents and develop an artificial intelligence (AI)-based tool to identify and extract named entities in the invoice documents.
Abstract: The day-to-day working of an organization produces a massive volume of unstructured data in the form of invoices, legal contracts, mortgage processing forms, and many more. Organizations can utilize the insights concealed in such unstructured documents for their operational benefit. However, analyzing and extracting insights from such numerous and complex unstructured documents is a tedious task. Hence, the research in this area is encouraging the development of novel frameworks and tools that can automate the key information extraction from unstructured documents. However, the availability of standard, best-quality, and annotated unstructured document datasets is a serious challenge for accomplishing the goal of extracting key information from unstructured documents. This work expedites the researcher’s task by providing a high-quality, highly diverse, multi-layout, and annotated invoice documents dataset for extracting key information from unstructured documents. Researchers can use the proposed dataset for layout-independent unstructured invoice document processing and to develop an artificial intelligence (AI)-based tool to identify and extract named entities in the invoice documents. Our dataset includes 630 invoice document PDFs with four different layouts collected from diverse suppliers. As far as we know, our invoice dataset is the only openly available dataset comprising high-quality, highly diverse, multi-layout, and annotated invoice documents.

Proceedings ArticleDOI
16 Aug 2021
TL;DR: In this article, the authors evaluated the quality and time performance of 13 new algorithms and 50 existing algorithms for document binarization using a dataset of offset, laser, and deskjet printed documents, photographed using four widely used mobile devices with the strobe flash on and off, under two different angles and places of capture.
Abstract: Smartphones with cameras are omnipresent in today's world and are very often used to photograph documents. Document binarization is a key process in many document processing platforms. This competition on binarizing photographed documents assessed the quality and time performance of 13 new algorithms and 50 existing algorithms. The evaluation dataset is composed of offset, laser, and deskjet printed documents, photographed using four widely-used mobile devices with the strobe flash on and off, under two different angles and places of capture.

Posted Content
TL;DR: Vec2GC (Vector to Graph Communities) as mentioned in this paper is an end-to-end pipeline to cluster terms or documents for any given text corpus using community detection on a weighted graph, created using text representation learning.
Abstract: NLP pipelines with limited or no labeled data, rely on unsupervised methods for document processing. Unsupervised approaches typically depend on clustering of terms or documents. In this paper, we introduce a novel clustering algorithm, Vec2GC (Vector to Graph Communities), an end-to-end pipeline to cluster terms or documents for any given text corpus. Our method uses community detection on a weighted graph of the terms or documents, created using text representation learning. Vec2GC clustering algorithm is a density based approach, that supports hierarchical clustering as well.

Proceedings ArticleDOI
25 Jun 2021
TL;DR: In this paper, the authors compared several neural networks viz: Simple (Artificial) Neural Network, Convolutional Neural Network and Recurrent Neural Network that use deep learning to implement Handwritten Character Recognition.
Abstract: In this current tech-savvy world, there is a rising challenge for software systems to be able to recognize characters via computing systems, a lot of crucial and sensitive data is scanned through documents that are solely paper-based and are accessible to us only in the form of newspapers, books, thesis, articles, documents etc. which are in printed format only. Nowadays, there is an ever-increasing demand for storing this crucial data that is apparently present only in these paper-based documents into a storage disk of digital nature and then reutilizing the same whenever deemed necessary simply by a predefined search process. A simple way to transfer data from these paper documents into digital storage systems is to first scan those documents and then store them as images. But the challenge is introduced when we feel the need to reutilize this data as it gets quite challenging to read a specific data from these documents. A major cause for this challenge is that the font properties of these characters that appear in paper documents are different when compared to the fonts of the characters in computing systems. Hence, a computer is ceases to recognize these characters while reading them. This concept of processing data from hard paper documents in digital storage spaces and then reading it is called Document Processing. In Document Processing, we make use of a system called Optical Character Recognition to achieve the needful. To further expand our understanding of how these systems work, this paper analyzes and compares several neural networks viz: Simple (Artificial) Neural Network, Convolutional Neural Network and Recurrent Neural Network, that use Deep Learning to implement Handwritten Character Recognition.

Journal ArticleDOI
TL;DR: In this paper, the authors used Siamese networks, concepts of similarity, one-shot learning, and context/memory awareness to improve the performance of document classification in the huge real-world document dataset.
Abstract: The automation of document processing has recently gained attention owing to its great potential to reduce manual work. Any improvement in information extraction systems or reduction in their error rates aids companies working with business documents because lowering reliance on cost-heavy and error-prone human work significantly improves the revenue. Neural networks have been applied to this area before, but they have been trained only on relatively small datasets with hundreds of documents so far. To successfully explore deep learning techniques and improve information extraction, we compiled a dataset with more than 25,000 documents. We expand on our previous work in which we proved that convolutions, graph convolutions, and self-attention can work together and exploit all the information within a structured document. Taking the fully trainable method one step further, we now design and examine various approaches to using Siamese networks, concepts of similarity, one-shot learning, and context/memory awareness. The aim is to improve micro $$F_{1}$$ of per-word classification in the huge real-world document dataset. The results verify that trainable access to a similar (yet still different) page, together with its already known target information, improves the information extraction. The experiments confirm that all proposed architecture parts (Siamese networks, employing class information, query-answer attention module and skip connections to a similar page) are all required to beat the previous results. The best model yields an 8.25% gain in the $$F_{1}$$ score over the previous state-of-the-art results. Qualitative analysis verifies that the new model performs better for all target classes. Additionally, multiple structural observations about the causes of the underperformance of some architectures are revealed, since all the techniques used in this work are not problem-specific and can be generalized for other tasks and contexts.

Patent
10 Jun 2021
TL;DR: In this article, a device for processing value documents, more particularly bank notes, has been proposed, having at least one image capture unit which is configured to capture at least four images (4) of at least two character strings (2, 3) located on a value document, and an evaluation unit which detects, in each image (4), one or more first characters contained in the at least first character string (2) and one ore more second character contained in at least second character string(3), to form, from at least some of the first characters and/or
Abstract: The invention relates to a device for processing value documents, more particularly bank notes, having at least one image capture unit which is configured to capture at least one image (4) of at least two character strings (2, 3) located on a value document, and an evaluation unit which is configured to detect, in the at least one image (4), one or more first characters contained in the at least one first character string (2) and one ore more second characters contained in at least one second character string (3), to form, from at least some of the first characters and/or at least some of the second characters, a concatenated character string (5) and to store, in a storage unit (30), image sections (7, 8) of the at least one image (4), said image sections showing the first and/or second characters contained in the concatenated character string (5), together with the concatenated character string (5). The invention also relates to a corresponding method for processing value documents, and a value document processing system.

Patent
21 Jan 2021
TL;DR: In this article, a biometric user characteristic associated with a document processing job was captured via biometric authentication component, and a log entry comprising the user characteristic and a plurality of details associated with the job was created.
Abstract: Examples dis closed herein relate to receiving a request to perform a document processing job, capturing a biometric user characteristic associated with the request via a biometric authentication component, and creating a log entry comprising the biometric user characteristic and a plurality of details associated with the document processing job.

Posted Content
TL;DR: In this article, the authors propose a novel VDU model that is end-to-end trainable without underpinning OCR framework and a synthetic document image generator to pre-train the model to mitigate the dependencies on large-scale real document images.
Abstract: Understanding document images (e.g., invoices) has been an important research topic and has many applications in document processing automation. Through the latest advances in deep learning-based Optical Character Recognition (OCR), current Visual Document Understanding (VDU) systems have come to be designed based on OCR. Although such OCR-based approach promise reasonable performance, they suffer from critical problems induced by the OCR, e.g., (1) expensive computational costs and (2) performance degradation due to the OCR error propagation. In this paper, we propose a novel VDU model that is end-to-end trainable without underpinning OCR framework. To this end, we propose a new task and a synthetic document image generator to pre-train the model to mitigate the dependencies on large-scale real document images. Our approach achieves state-of-the-art performance on various document understanding tasks in public benchmark datasets and private industrial service datasets. Through extensive experiments and analysis, we demonstrate the effectiveness of the proposed model especially with consideration for a real-world application.

Posted Content
TL;DR: The authors proposed unsupervised Language-Agnostic Weighted Document Representations (LAWDR), which leverages the geometry of pre-trained sentence embeddings and leverage it to derive document representations without fine-tuning.
Abstract: Cross-lingual document representations enable language understanding in multilingual contexts and allow transfer learning from high-resource to low-resource languages at the document level. Recently large pre-trained language models such as BERT, XLM and XLM-RoBERTa have achieved great success when fine-tuned on sentence-level downstream tasks. It is tempting to apply these cross-lingual models to document representation learning. However, there are two challenges: (1) these models impose high costs on long document processing and thus many of them have strict length limit; (2) model fine-tuning requires extra data and computational resources, which is not practical in resource-limited settings. In this work, we address these challenges by proposing unsupervised Language-Agnostic Weighted Document Representations (LAWDR). We study the geometry of pre-trained sentence embeddings and leverage it to derive document representations without fine-tuning. Evaluated on cross-lingual document alignment, LAWDR demonstrates comparable performance to state-of-the-art models on benchmark datasets.


Patent
Murakami Takashi1
01 Jun 2021
TL;DR: An image processing apparatus includes a camera, an image reading unit, and a controller as discussed by the authors, which is configured to capture an image of a face of a person and to output a document image.
Abstract: An image processing apparatus includes a camera, an image reading unit, and a controller. The camera is configured to capture an image of a face of a person. The image reading unit is configured to read a document and to output a document image. The controller is configured to perform control to permit certain processing on the document image if the image of the face captured by the camera matches a face image extracted from the document image.

Patent
18 Mar 2021
TL;DR: In this article, a document processing system using augmented reality and virtual reality, and a processing method therefor, is presented, where one user can write and store various types of virtual documents and allow other users to view the virtual documents in the augmented reality or virtual image.
Abstract: The present invention relates to a document processing system using augmented reality and virtual reality, and a processing method therefor. The document processing system of the present invention shares contents of an object so that one user can write the contents of the object at the location of the object displayed in an augmented reality or virtual reality image to write and store various types of virtual documents and allow other users to view the virtual documents in the augmented reality or virtual image. The document processing system writes and shares the virtual documents by using a mobile terminal capable of expressing augmented reality or virtual reality. The document processing system shares the virtual documents between mobile terminals in a P2P method or a method of using a server. According to the present invention, a new document sharing platform can be implemented to provide a differentiated service to a user by displaying a virtual document written by using augmented reality or virtual reality to be shared in real time.

Posted Content
TL;DR: Li et al. as mentioned in this paper proposed a novel Line Counting formulation for handwritten text line segmentation that involves counting the number of text lines from the top at every pixel location, which helps learn an end-to-end HTLS solution that directly predicts per-pixel line number.
Abstract: Handwritten Text Line Segmentation (HTLS) is a low-level but important task for many higher-level document processing tasks like handwritten text recognition. It is often formulated in terms of semantic segmentation or object detection in deep learning. However, both formulations have serious shortcomings. The former requires heavy post-processing of splitting/merging adjacent segments, while the latter may fail on dense or curved texts. In this paper, we propose a novel Line Counting formulation for HTLS -- that involves counting the number of text lines from the top at every pixel location. This formulation helps learn an end-to-end HTLS solution that directly predicts per-pixel line number for a given document image. Furthermore, we propose a deep neural network (DNN) model LineCounter to perform HTLS through the Line Counting formulation. Our extensive experiments on the three public datasets (ICDAR2013-HSC, HIT-MW, and VML-AHTE) demonstrate that LineCounter outperforms state-of-the-art HTLS approaches. Source code is available at this https URL.

Patent
18 Feb 2021
TL;DR: In this article, the authors describe a method or a system able to process documents to extract features, predict outcomes and visualize feature relations, which can be used to predict feature relations.
Abstract: Among other things, technologies disclosed herein include a method or a system able to process documents to extract features, predict outcomes and visualize feature relations.

Patent
18 Feb 2021
TL;DR: In this paper, a document processing application segments an electronic document image into strips and then computes, from a combined mask derived from the first mask and the second mask, an output electronic document that identifies elements in the electronic document and the respective element types.
Abstract: Techniques for document segmentation. In an example, a document processing application segments an electronic document image into strips. A first strip overlaps a second strip. The application generates a first mask indicating one or more elements and element types in the first strip by applying a predictive model network to image content in the first strip and a prior mask generated from image content of the first strip. The application generates a second mask indicating one or more elements and element types in the second strip by applying the predictive model network to image content in the second strip and the first mask. The application computes, from a combined mask derived from the first mask and the second mask, an output electronic document that identifies elements in the electronic document and the respective element types.

Patent
18 Mar 2021
TL;DR: In this article, a document extraction system may efficiently route tasks to the manual and automated systems based on a predicted probability that the results generated by the automated system meet some baseline level of accuracy.
Abstract: A document extraction system executed by a processor, may process documents using manual and automated systems. The document extraction system may efficiently route tasks to the manual and automated systems based on a predicted probability that the results generated by the automated system meet some baseline level of accuracy. To increase document processing speed, documents having a high likelihood of accurate automated processing may be routed to an automated system. To ensure a baseline level of accuracy, documents having a smaller likelihood of accurate automated processing may be routed to a manual system.

Posted Content
TL;DR: This paper presented a novel data generation tool for document processing, which focuses on providing a maximal level of visual information in a normal type document, ranging from character position to paragraph-level position, and enables working with a large dataset on low-resource languages as well as providing a mean of processing thorough full-level information of the documented text.
Abstract: We present a novel data generation tool for document processing. The tool focuses on providing a maximal level of visual information in a normal type document, ranging from character position to paragraph-level position. It also enables working with a large dataset on low-resource languages as well as providing a mean of processing thorough full-level information of the documented text. The data generation tools come with a dataset of 320000 Vietnamese synthetic document images and an instruction to generate a dataset of similar size in other languages. The repository can be found at: this https URL

Proceedings ArticleDOI
24 Apr 2021
TL;DR: Wang et al. as mentioned in this paper proposed a pre-trained language model called GovAlbert Based on Albert (GovAlbert-CRF) for the processing of Chinese government official documents.
Abstract: The automated processing of Chinese government documents is in its early stage, and information extraction based on Named Entity Recognition (NER) plays an important role in the automated processing and analysis of Chinese government documents. This paper proposes and implements the pre-trained language model called GovAlbert Based on Albert which the pre-trained language model, which for the processing of Chinese government official documents. We study and analyze NER tasks of the Chinese government official document based on the pre-trained language model, and annotate the Chinese government official documents' Entity recognition corpus, and construct four named entity recognition models based on GovAlbert. The experimental results show that the GovAlbert model for government official document processing has an improved macro-average F1 value (harmonized average of accuracy and recall) than Albert. four named entity recognition models based on GovAlbert in multiple NER tasks of government official documents are all better than the public pre-training model, and through experiments, it has been explored that the GovAlbert-CRF combined model can achieve the best F1 value, so it can be better qualified for the NER tasks of government official documents.