scispace - formally typeset
Search or ask a question

Showing papers on "Document processing published in 2017"


Journal ArticleDOI
TL;DR: The most vital processes in script identification are addressed in detail: identification and discriminating methods, features extraction (local and global, and classification), and classification.
Abstract: In recent years, with the widespread of Internet and digitized processing of multi-script documents worldwide, script identification techniques have become more important in the pattern recognition field. Script identification concerns methods for identifying different scripts in multi-lingual, multi-script documents. This paper presents a comprehensive overview on research activities in the field and focuses on the most valuable results obtained so far. The most vital processes in script identification are addressed in detail: identification and discriminating methods, features extraction (local and global), and classification. Different kinds of approaches have been developed and promising results have been achieved. This paper reports SoA performance results. This paper reports methods concerning handwritten, printed, and hybrid document processing. More research is necessary to meet the performance levels essential for everyday applications.

49 citations


Proceedings ArticleDOI
01 Mar 2017
TL;DR: The proposed system uses Convolutional neural network to extract features and is tested against a newly constructed dataset of six Malayalam characters, which shows remarkable improvement in recognizing characters of other languages.
Abstract: Optical Character Recognition is the process of converting an input text image into a machine encoded format. Different methods are used in OCR for different languages. The main steps of optical character recognition are pre-processing, segmentation and recognition. Recognizing handwritten text is harder than recognizing printed text. Convolutional Neural Network has shown remarkable improvement in recognizing characters of other languages. But CNNs have not been implemented for Malayalam handwritten characters yet. The proposed system uses Convolutional neural network to extract features. This is method different from the conventional method that requires handcrafted features that needs to be used for finding features in the text. We have tested the network against a newly constructed dataset of six Malayalam characters. This is method different from the conventional method that requires handcrafted features that needs to be used for finding features in the text.

28 citations


Journal ArticleDOI
TL;DR: This paper attempts to analyze and classify the various text extraction schemes for the scene-text and document images, and compares different approaches of these images based on common problems and discusses their merits and demerits.
Abstract: One of the major applications of text retrieval from images is to extract the text information and then recognize its characters. This is helpful for indexing the images within storage media. When we want to search a particular image or document, there is no need to go through a large bunch of images. We go only through the group of indexed images, so that the task of finding the particular image becomes easy. Extracting text lines from scanned document images present a major problem in optical character recognition process as skewed text lines raise the complexity. The problem gets even worse with the text lines of different orientations. Such lines are called as multi-skewed lines. These multi-skewed lines are easily observed in both printed and handwritten documents. It is a challenging task to design a real time system, which can maintain a high recognition rate with good accuracy and is independent of the type of documents and character fonts. In this paper, we attempt to analyze and classify...

27 citations



Proceedings ArticleDOI
22 Mar 2017
TL;DR: The general difficulties in Arabic language text, the main process of a typical OCR system and some enhancements to Arabic OCR systems are described, and a novel approach for identifying handwritten isolated Arabic characters using encoded Freeman chain code is described.
Abstract: Optical Character Recognition (OCR) is the process of identifying text in an image and convert it into a digital form. Several approaches have been attempted to accurately recognize characters in printed Arabic language. This survey focuses on OCR in handwritten Arabic language. We will describe the general difficulties in Arabic language text, the main process of a typical OCR system and some enhancements to Arabic OCR systems. We will also describe a novel approach for identifying handwritten isolated Arabic characters using encoded Freeman chain code. Several handwritten Arabic characters were trained and tested, and the preliminary experimental results are promising.

24 citations


Proceedings ArticleDOI
01 Nov 2017
TL;DR: The current state of the anyOCR system, its architecture, as well as its major features are described, which mainly emphasize the techniques requires for digitizing a historical archive with high accuracy.
Abstract: Currently an intensive amount of research is going on in the field of digitizing historical archives for converting scanned document images into searchable full text This paper presents the "anyOCR" system which mainly emphasize the techniques requires for digitizing a historical archive with high accuracy It is an open-source system for the research community who can easily apply the anyOCR system for digitizing historical archives The anyOCR system supports a complete document processing pipeline, which includes layout analysis, training OCR models and text line prediction, with an addition of intelligent and interactive layout and OCR error corrections web applications The anyOCR system can also be used for contemporary document images containing diverse, simple to complex, layouts This paper describes the current state of the anyOCR system, its architecture, as well as its major features This paper also provides information about the availability, documentation, and tutorials of the anyOCR system

23 citations


Proceedings ArticleDOI
01 Aug 2017
TL;DR: The aim is to develop an efficient method which uses a custom image to train the classifier, which extract distinct features from the input image for classifying its contents as characters specifically letters and digits.
Abstract: The aim is to develop an efficient method which uses a custom image to train the classifier. This OCR extract distinct features from the input image for classifying its contents as characters specifically letters and digits. Input to the system is digital images containing the patterns to be classified. The analysis and recognition of the patterns in images are becoming more complex, yet easy with advances in technological knowledge. Therefore it is proposed to develop sophisticated strategies of pattern analysis to cope with these difficulties. The present work involves application of pattern recognition using KNN to recognize handwritten or printed text.

20 citations


Patent
09 Jun 2017
TL;DR: In this article, an accounting document processing system comprising a finance management system and an original bill storage cabinet is described, which includes a reimbursement module, a checking module, and a bill true and false identification module; a paper bill conveyer, a subject coding printer, a scanner, a printer, and an electron accounting document storage device are arranged in the original bill cabinet.
Abstract: The invention discloses an accounting document processing system comprising a finance management system and an original bill storage cabinet; the finance management system comprises a reimbursement module, a checking module, and a bill true and false identification module; a paper bill conveyer, a subject coding printer, a scanner, a printer, an electron accounting document storage device and a paper accounting document storage device are arranged in the original bill storage cabinet. The accounting document processing system can automatically realize the following functions: cost reimbursement, cost reimbursement examination, original bill checking, bill true and false identification, original bill coding, original bill scanning, original bill video image file storage, accounting document automatic classification and storage, and accounting document query; the accounting document processing system can mitigate finance personnel labor intensity, and can improve the economic benefits.

14 citations


Proceedings ArticleDOI
10 Nov 2017
TL;DR: This work uses fully supervised Deep CNN semantic segmentation to separate content layers from historical document images containing diverse content types, including handwriting, machine print, form lines, and stamps, using CNNs for semantic pixel labeling.
Abstract: Convolutional Neural Networks (CNNs) have produced excellent results in natural scene semantic pixel labeling tasks. We examine the application of this idea to document processing, using fully supervised Deep CNN semantic segmentation to separate content layers from historical document images containing diverse content types, including handwriting, machine print, form lines, and stamps. For efficiency, we employ a downsampling-upsampling network to make dense pixel predictions. CNNs achieve high generalization accuracy on document images with interleaved, overlapping strokes, even when trained on a solitary pixel-labeled document image. We also show a proof-of-concept extension of the semantic segmentation task to handwritten cursive character recognition, enabling a new "segmentation-free" approach to handwriting transcription.

13 citations


Journal ArticleDOI
TL;DR: This paper attempted to combine neural network technology and character recognition technology, and came up with effectively new method of handwritten character recognition, finding a feasible new way to solve practical difficulties between handwritten characters recognition.
Abstract: The offline handwritten character recognition is an important branch of character recognition, which involves character recognition, image processing, digital signal processing, artificial intelligence, fuzzy mathematics, information theory, computer and other disciplines. In the process of off-line handwritten character recognition, it only processed two-dimensional character dot images, and there are problems such as too many character classes, complex font structure, large deformation of handwritten characters and other issues. Currently off-line handwritten character recognition technology is still immature, which is still in the laboratory research stage. This paper attempted to combine neural network technology and character recognition technology, and came up with effectively new method of handwritten character recognition, finding a feasible new way to solve practical difficulties between handwritten character recognition. In this paper, according to the building process of the multi-level...

12 citations


Proceedings ArticleDOI
01 Feb 2017
TL;DR: A method for handwritten text recognition (HWR) of this font is proposed and a method for preprocessing and normalization of data and optical character recognition based on SVM classifier is proposed.
Abstract: Comenia script is a novel handwritten text introduced at primary schools in the Czech Republic This paper describes a method for handwritten text recognition (HWR) of this font In particular it proposes a method for preprocessing and normalization of data and optical character recognition based on SVM classifier We have trained and statistically evaluated several models, where we have focused on recognition of different styles of writing of the same characters — for the forensic purposes and identification of the author of a document The best model has achieved 9286 % accuracy without any further postprocessing, eg a spellchecker We also proposed using more than one classification model for character recognition that has shown to increase accuracy when compared to a single model approach

Proceedings ArticleDOI
01 Nov 2017
TL;DR: The presented paper investigates the problems of receipt image analysis and describes approaches for receipt image pre-processing, receipt text detection, receipts text recognition and receipt text analysis to make receipt analysis system adaptable for a real-life environment.
Abstract: The automatic receipt analysis problem is very relevant due to high cost of manual document processing. Therefore, the presented paper investigates the problems of receipt image analysis. It describes approaches for receipt image pre-processing, receipt text detection, receipt text recognition and receipt text analysis. These approaches allow to make receipt analysis system adaptable for a real-life environment and to convert the input information to a usable format for analysing information in the receipts. A pipeline for payment data processing staring with image capture to payment data posting is defined and appropriate technologies for every stage of the process are proposed. Advantages and limitations of these technologies are reviewed and open research challenges are identified. The payment data processing is analyzed as an enabler of digital transformation of expense reporting processes.

Journal ArticleDOI
01 Jan 2017
TL;DR: The authors have developed a sequence of document processing and transmission of data during the examination, and propose a language for describing the construction of the facility, taking into account the classification criteria of the structures and construction works.
Abstract: When you transfer the design documents to the examination office, number of incompatible electronic documents increases dramatically The article discusses the way to solve the problem of transferring of the text and graphic data of design documentation for state and non-state expertise, as well as verification of estimates and requirement management The methods for the recognition of the system elements and requirements for the transferring of text and graphic design documents are provided The need to use the classification and coding of various elements of information systems (structures, objects, resources, requirements, contracts, etc) in data transferring systems is indicated separately The authors have developed a sequence of document processing and transmission of data during the examination, and propose a language for describing the construction of the facility, taking into account the classification criteria of the structures and construction works

Proceedings ArticleDOI
01 Oct 2017
TL;DR: The assessment of correlation of control of processes of relocation of documents and change of objects in the distributed large-scale systems is carried out and the direction of transition to the object-associated electronic document management systems is shown.
Abstract: Development of document processing systems is considered. The direction of transition to the object-associated electronic document management systems is shown. The assessment of correlation of control of processes of relocation of documents and change of objects in the distributed large-scale systems is carried out.

Proceedings ArticleDOI
31 Aug 2017
TL;DR: The use of an automatic baseline detection technique, based on interest point clustering, in Arabic handwritten documents is studied, revealing that this technique provides promising results for this task.
Abstract: Document processing comprises different steps depending on the nature of the documents. For text documents, specially for handwritten documents, transcription of their contents is one of the main tasks. Handwritten Text Recognition (HTR) is the process of automatically obtaining the transcription of the content of a handwritten text document. In document processing, the basic unit for the acquisition process is the page image, whilst line image is the basic form for the HTR process. This is a bottle-neck which is holding back the massive industrial document processing. Baseline detection can be used not only to segment page images into line images but also for many other document processing steps. Baseline detection problem can be formulated as a clustering problem over a set of interest points. In this work, we study the use of an automatic baseline detection technique, based on interest point clustering, in Arabic handwritten documents. The experiments reveal that this technique provides promising results for this task.

Journal ArticleDOI
TL;DR: Results show that small, computationally cheap WGs can be used without loosing the excellent CATTI and KWS performance achieved with huge WGs.
Abstract: Two document processing applications are considered: computer-assisted transcription of text images (CATTI) and Keyword Spotting (KWS), for transcribing and indexing handwritten documents, respectively. Instead of working directly on the handwriting images, both of them employ meta-data structures called word graphs (WG), which are obtained using segmentation-free handwritten text recognition technology based on N-gram language models and hidden Markov models. A WG contains most of the relevant information of the original text (line) image required by CATTI and KWS but, if it is too large, the computational cost of generating and using it can become unafordable. Conversely, if it is too small, relevant information may be lost, leading to a reduction of CATTI or KWS performance. We study the trade-off between WG size and performance in terms of effectiveness and efficiency of CATTI and KWS. Results show that small, computationally cheap WGs can be used without loosing the excellent CATTI and KWS performance achieved with huge WGs.

01 Aug 2017
TL;DR: This paper deals with unsupervised classification of textual documents also called text clustering using Self-Organizing Maps of Kohonen in two new situations: a conceptual representation of texts and a representationbased on n-grams, instead of a representation based on words.
Abstract: With the great and rapidly growing number of documents available in digital form (Internet, library, CD-Rom…), the automatic classification of texts has become a significant research field and a fundamental task in document processing. This paper deals with unsupervised classification of textual documents also called text clustering using Self-Organizing Maps of Kohonen in two new situations: a conceptual representation of texts and a representation based on n-grams, instead of a representation based on words. The effects of these combinations are examined in several experiments using 4 measurements of similarity. The Reuters-21578 corpus is used for evaluation. The evaluation was done by using the F-measure and the entropy.

01 Jan 2017
TL;DR: Results from this study reveal the potential research issues, namely morphology analysis, question classification, and term weighting algorithm for question classification in Question Answering framework.
Abstract: Question Answering System could automatically provide an answer to a question posed by human in natural languages. This system consists of question analysis, document processing, and answer extraction module. Question Analysis module has task to translate query into a form that can be processed by document processing module. Document processing is a technique for identifying candidate documents, containing answer relevant to the user query. Furthermore, answer extraction module receives the set of passages from document processing module, then determine the best answers to user. Challenge to optimize Question Answering framework is to increase the performance of all modules in the framework. The performance of all modules that has not been optimized has led to the less accurate answer from question answering systems. Based on this issues, the objective of this study is to review the current state of question analysis, document processing, and answer extraction techniques. Result from this study reveals the potential research issues, namely morphology analysis, question classification, and term weighting algorithm for question classification.

Patent
24 Oct 2017
TL;DR: In this paper, the content of a target document is converted into an object capable of being recognized by a first application program; a target folder is created, the target document and the object and the objects are added into the target folder, and a pointing record pointing to the object is added into a formatted file in the target file according to preset rules.
Abstract: The invention relates to a document processing method and system, a readable storage medium and computer equipment. The method comprises the steps that content of a target document is converted into an object capable of being recognized by a first application program; a target folder is created, the target document and the object capable of being recognized by the first application program are added into the target folder, and a pointing record pointing to the object capable of being recognized by the first application program is added into a formatted file in the target folder according to preset rules; and the target folder is compressed to obtain a replacement document with a name the same as that of the target document. During practical application of the document processing method and system, the first application program is used to open the replacement document, the replacement document is displayed by analyzing the object capable of being recognized by the first application program, and the practical application demand is met.

Journal ArticleDOI
TL;DR: This paper presents various techniques presented by different researchers for Punjabi character recognition work, noting that recognition accuracy depends upon volume of training dataset and testing dataset and may be improved by using various optimized feature selection techniques.
Abstract: Objectives: A framework for character recognition is essential used to convert a digital image of character into machine coded format character. This fundamental trademark can be used to determine numerous real life applications. Methods/ Statistical Analysis: To classify hand-written documents, either offline or online, the recognition of character is tremendously influenced by variety of styles of same writer on various circumstances and even different writers. Distortion and noise included during digitization is additionally a noteworthy issue in recognition of character that influences the recognition/classification accuracy adversely. Findings: It has been get to know that recognition of hand-written Gurmukhi characters is an exceptionally troublesome task. There are enormous difficulties in handwritten character recognition because of various writing style of scholars. This paper presents various techniques presented by different researchers for Punjabi character recognition work. It has been also noticed that recognition accuracy depends upon volume of training dataset and testing dataset and may be improved by using various optimized feature selection techniques. Application/ Improvements: A lot of research papers have been surveyed and it is seen that work on different strategies have been attempted.

Journal ArticleDOI
TL;DR: In this article, a character acknowledgment programming framework is proposed to perform document image analysis which changes records in paper organization to electronic arrangement. But, this procedure is referred to as document analysis.
Abstract: There is developing interest for the product frameworks to perceive characters in PC framework when data is looked over paper archives as we realize that we have number of daily papers and books which are in printed arrange identified with various subjects. This procedure is likewise called document image analysis. To viably utilize Optical Character Recognition for character acknowledgment so as to perform Document Image Analysis, we are utilizing the data in Grid arrange. For document processing archive preparing we require a product framework called character recognition system. Along these lines the need is to create character acknowledgment programming framework to perform Document Image Analysis which changes records in paper organization to electronic arrangement. Now and then in this record handling we have to prepare the data that is identified with dialects other than the English. With the help of kohonen neural network training handwritten character recognition is also done. Keywords: perceive character, document image analysis, Grid arrange, electronic arrangement, neural network, handwritten character recognition.

Patent
06 Jul 2017
TL;DR: In this article, a document processing method and apparatus are disclosed in embodiment of the present application, the document is encrypted and decrypted by using a target password generated based on a geographic location, which ensures the security of password transmission, and improves the convenience of authorization and decryption.
Abstract: A document processing method and apparatus are disclosed in embodiments of the present application. The method includes: obtaining a target geographic location where the first electronic device is currently located, when a predetermined operation for a document is detected; determining whether the target geographic location matches a target password that is used to encrypt the document in advance; and if so, allowing the predetermined operation to be performed on the document; and if not, preventing the predetermined operation from being performed on the document. The apparatus includes an obtaining module, a determining module, and a decision module. In the embodiment of the present application, the document is encrypted and decrypted by using a target password generated based on a geographic location, which ensures the security of password transmission, and improves the security of the use of the document and the convenience of authorization and decryption.

Proceedings ArticleDOI
01 Sep 2017
TL;DR: The article further explains how signal rank processing algorithm can be used to improve text recognition accuracy and experimental evidence is provided of the expediency of this algorithm for use as a text recognition means on mobile devices.
Abstract: The article looks at the current state of the mobile apps market and its potential to generate solutions for B2C and B2B segments. It also explores the issue of how mobile devices can be made capable of recognizing printed text containing images. Technologies are described for mobile application implementation. The article further explains how signal rank processing algorithm can be used to improve text recognition accuracy. Experimental evidence is provided of the expediency of this algorithm for use as a text recognition means on mobile devices.

Patent
Michael L. Yeung1, Chen Jia1
26 Oct 2017
TL;DR: A document processing system includes an embedded controller that has both a local area network interface for receiving documents over an associated network and a wireless Personal Area Network Interface for communicating with mobile computing devices using BLUETOOTH low power communications as discussed by the authors.
Abstract: A document processing system includes an embedded controller that has both a local area network interface for receiving documents over an associated network and a wireless personal area network interface for communicating with mobile computing devices using BLUETOOTH low power communications. The embedded controller transmits broadcasts message to nearby mobile computing devices that send responses to the broadcast messages. Based on the RSSI of a received response, the embedded controller can determines whether a particular mobile computing device is in close proximity, indicating that a user is present at the document processing system. The embedded controller sends identification data associated with the document processing system to the mobile computing device. The user of the mobile computing device selects the document processing system to print a document, and the mobile computing device uses the identification data to route the document to the document processing system over a wireless local area network.

Proceedings Article
Yufan Yang1, Yi Feng1, Jidong Ge1, Yemao Zhou1, Jin Zeng1, Chuanyi Li1, Bin Luo1 
01 Jan 2017
TL;DR: An automatic method based on the editing distance algorithm is devised, which constructs the disparity model between different statutes strings, to obtain the standardized writing of the same type data.
Abstract: With the continuous advancement of the informatization of the Chinese People's Court, the court's view on the extraction and application of information has not only been on the structured data, but also for the semi-structured and unstructured data. In the process of in-depth study of the judgment document, many cases require the collection of the document result as an important data dimension, and the key is that the statute is the core of the whole result, so the integrity and correctness of the statute obtained has played a key role for the process of the judgment document processing. However, in the process of writing a specific judgment document, the same statute has different string forms due to the diversity of writing, which leads directly to the error data source. Comparing the editing distance between the strings can judge the similarity of them to a certain extent. Therefore, an automatic method based on the editing distance algorithm is devised, which constructs the disparity model between different statutes strings, to obtain the standardized writing of the same type data. Using this method can remove the non-standard writing of statutes, and ultimately access to the standard statutes collection. This method has a higher efficiency than the method of enumerating all the writing circumstances, which needs the manual participation, additional data storage and update.

Patent
29 Jun 2017
TL;DR: In this paper, a system and method for monitoring document processing device operations, gauging corresponding cost, including monetary or environmental cost, and facilitating review of actual and projected costs associated with usage levels.
Abstract: A system and method for monitoring document processing device operations, gauging corresponding cost, including monetary or environmental cost, and facilitating review of actual and projected costs associated with usage levels. The system receives usage data corresponding to metered use of an associated networked document processing device by each of a plurality of users. Memory associated with the processor stores received usage data and stores relationship data corresponding to a relationship between usage data and data corresponding to an associated environmental impact. The processor applies relationship data to usage data to compute environmental impact data representative of an environmental impact corresponding to the usage data. The display then generates an image corresponding to the environmental impact data. An administrator is enabled to view historic or projected costs and manage quotas accordingly.

Patent
08 Dec 2017
TL;DR: In this paper, a data access method and device for multiple subsystems of an internet platform, and an electronic device is presented, which includes the steps of obtaining the data from the multiple subsystem of the internet platform and conducting document processing to generate corresponding documents.
Abstract: The invention provides a data access method and device for multiple subsystems of an internet platform, and an electronic device. The data access method includes the steps of obtaining the data from the multiple subsystems of the internet platform, and conducting document processing to generate corresponding documents; building the data index of a search engine based on the generated documents; inputting the index content into the search engine, and retrieving in the data index of the search engine; displaying data, which is output by the search engine and matched with the retrieved content, corresponding to the data index. In a data layer, by an efficient and low-coupling method, the data access between the multiple subsystems is achieved so as to contribute to finding out cheating behaviors in the internet platform.

Proceedings ArticleDOI
01 Feb 2017
TL;DR: A framework for efficient processing of survey documents by recognizing the answers of the users' in each document and analyzing the answers considering all the documents to produce final survey results.
Abstract: Document processing implies processing of documents in a way that the information of the documents can be rendered, retrieved or presented in an understandable way. Survey document processing is the task of analyzing survey documents to obtain people's opinion about product or services. Nowadays, survey is conducted in two different forms. One is online survey and another is using printed documents. The survey that is conducted using printed documents is difficult to process due to the lack of automated processing tools for such types of documents. Considering this fact, in this paper we present a framework for efficient processing of survey documents. Our system works in two major steps. First, we recognize the answers of the users' in each document. Then, we analyze the answers considering all the documents to produce final survey results. We have evaluated our system with sufficient number of survey documents and found that our system can perform almost accurate analysis of survey documents.

Patent
24 Nov 2017
TL;DR: In this paper, a document processing method, a mobile terminal and a computer readable storage medium are described. But the authors do not discuss the use of hard link mode for document classification.
Abstract: The invention discloses a document processing method, a mobile terminal and a computer readable storage medium. The document processing method comprises the steps that when a demand for classification of a document is detected, a document folder is marked for the document in a preset catalog; and a hard link document corresponding to the document is established in the marked document folder. According to the embodiment of the invention, the document is classified through a hard link mode; efficiency of document classification is increased; and storage space is saved.

Proceedings Article
01 Jan 2017
TL;DR: This study proposes an alternative solution to solve the above problems dealing with massive amount of document text images by integrating Hadoop MapReduce and Spark's MLlib for authorship identification through data processing parallelization.
Abstract: The era of Big Data has arrived and an average of about quintillions of data is produced daily. Data can be in many forms such as image, document or movie. For document file, there are digitalized document and handwritten document that often relates to the issue of copyright or ownership. This is due to improper authentication that leads to unhealthy authorship claimed of that particular handwritten document. Authorship identification is a sub-area of Document Image Analysis and Identification (DIAR). DIAR aim is to analyze and identify document authorship. However, for big scale of documents text images, the issue of document processing time becomes crucial for better authorship identification. Therefore, in this study, we propose an alternative solution to solve the above problems dealing with massive amount of document text images by integrating Hadoop MapReduce and Spark's MLlib for authorship identification through data processing parallelization. MapReduce processing is used as the platform to pre- process these huge document text images in Hadoop Distributed File Systems (HDFS), follows by the authorship identification through Apache Spark machine learning library.The experiments show the integration is successfully implemented for big size of document text images. However, further improvement is needed for the post-analytics of the reduced document text images for better identification.