scispace - formally typeset
Search or ask a question

Showing papers on "Document processing published in 2013"


Proceedings ArticleDOI
25 Aug 2013
TL;DR: This work presents an approach for information extraction which purely builds on end-user provided training examples and intentionally omits efficient known extraction techniques like rule based extraction that require intense training and/or information extraction expertise.
Abstract: Automatic information extraction from scanned business documents is especially valuable in the application domain of document archiving. But current systems for automated document processing still require a lot of configuration work that can only be done by experienced users or administrators. We present an approach for information extraction which purely builds on end-user provided training examples and intentionally omits efficient known extraction techniques like rule based extraction that require intense training and/or information extraction expertise. Our evaluation on a large corpus of business documents shows competitive results of above 85% F1-measure on 10 commonly used fields like document type, sender, receiver and date. The system is deployed and used inside the commercial document management system DocuWare.

82 citations


Patent
Masahito Yamamoto1
08 Jan 2013
TL;DR: In this paper, a document processing system is proposed to store a plurality of items of document data, each containing metadata pertaining to the contents of each item of the document data and relation information representing the relations between the relations.
Abstract: This invention is directed to a document processing system and control method thereof. The system stores a plurality of items of document data each containing metadata pertaining to the contents of each item of document data, and relation information representing the relations between the plurality of items of document data. When scanned image data or facsimile-received image data is input, document data related to the input image data is specified among the plurality of items of stored document data, based on the metadata contained in each item of document data. Relation information representing the relation between the input image data and the specified related document data is stored. Even document data obtained from a paper document is able to be stored as document data subjected to search processing.

79 citations


Journal ArticleDOI
TL;DR: A novel summarizer, namely Yago-based Summarizer, that relies on an ontology-based evaluation and selection of the document sentences and an established entity recognition and disambiguation step based on the Yago ontology is integrated into the summarization process.
Abstract: Sentence-based multi-document summarization is the task of generating a succinct summary of a document collection, which consists of the most salient document sentences. In recent years, the increasing availability of semantics-based models (e.g., ontologies and taxonomies) has prompted researchers to investigate their usefulness for improving summarizer performance. However, semantics-based document analysis is often applied as a preprocessing step, rather than integrating the discovered knowledge into the summarization process. This paper proposes a novel summarizer, namely Yago-based Summarizer, that relies on an ontology-based evaluation and selection of the document sentences. To capture the actual meaning and context of the document sentences and generate sound document summaries, an established entity recognition and disambiguation step based on the Yago ontology is integrated into the summarization process. The experimental results, which were achieved on the DUC'04 benchmark collections, demonstrate the effectiveness of the proposed approach compared to a large number of competitors as well as the qualitative soundness of the generated summaries.

59 citations


Proceedings Article
10 Jun 2013
TL;DR: A new database for handwritten Arabic characters (HACDB), designed to cover all shapes of Arabic characters including overlapping ones, is introduced, which contains 6,600 shapes of characters written by 50 writers.
Abstract: Automatic off-line Arabic handwriting recognition based on segmentation still faces big challenges. A database, covering all shapes of handwritten Arabic characters, is required to facilitate the recognition process. This paper introduces a new database for handwritten Arabic characters (HACDB), designed to cover all shapes of Arabic characters including overlapping ones. It contains 6,600 shapes of characters written by 50 writers. This database can be used for training and testing the words for their recognition after segmentation. Also, it presents the possibility for comparing different approaches and evaluate their accuracy on a common base.

49 citations


Proceedings ArticleDOI
15 Oct 2013
TL;DR: This paper integrates character n-gram language models into the spotting system in order to provide an additional language context and demonstrates that character language models significantly improve the spotting performance.
Abstract: Facing high error rates and slow recognition speed for full text transcription of unconstrained handwriting images, keyword spotting is a promising alternative to locate specific search terms within scanned document images. We have previously proposed a learning-based method for keyword spotting using character hidden Markov models that showed a high performance when compared with traditional template image matching. In the lexicon-free approach pursued, only the text appearance was taken into account for recognition. In this paper, we integrate character n-gram language models into the spotting system in order to provide an additional language context. On the modern IAM database as well as the historical George Washington database, we demonstrate that character language models significantly improve the spotting performance.

40 citations


Proceedings ArticleDOI
25 Aug 2013
TL;DR: This paper presents an automatic forgery detection method based on document's intrinsic features at character level based on the one hand on outlier character detection in a discriminant feature space and on the other hand on the detection of strictly similar characters.
Abstract: Paper documents still represent a large amount of information supports used nowadays and may contain critical data. Even though official documents are secured with techniques such as printed patterns or artwork, paper documents suffer from a lack of security. However, the high availability of cheap scanning and printing hardware allows non-experts to easily create fake documents. As the use of a watermarking system added during the document production step is hardly possible, solutions have to be proposed to distinguish a genuine document from a forged one. In this paper, we present an automatic forgery detection method based on document's intrinsic features at character level. This method is based on the one hand on outlier character detection in a discriminant feature space and on the other hand on the detection of strictly similar characters. Therefore, a feature set is computed for all characters. Then, based on a distance between characters of the same class, the character is classified as a genuine one or a fake one.

39 citations


Journal Article
TL;DR: This paper focuses on recognition of English alphabet in a given scanned text document with the help of Neural Networks and uses character extraction and edge detection algorithm for training the neural network to classify and recognize the handwritten characters.
Abstract: Recognition of Handwritten text has been one of the active and challenging areas of research in the field of image processing and pattern recognition. It has numerous applications which include, reading aid for blind, bank cheques and conversion of any hand written document into structural text form. In this paper we focus on recognition of English alphabet in a given scanned text document with the help of Neural Networks. Using Mat labNeural Network toolbox, we tried to recognize handwritten characters by projecting them on different sized grids. The first step is image acquisition which acquires the scanned image followed by noise filtering, smoothing and normalization of scanned image, rendering image suitable for segmentation where image is decomposed into sub images. Feature Extraction improves recognition rate and misclassification. We use character extraction and edge detection algorithm for training the neural network to classify and recognize the handwritten characters.

34 citations


Patent
27 Nov 2013

30 citations


Patent
26 Sep 2013
TL;DR: In this article, the authors provide a mechanism for generating section metadata for an electronic document. But they focus on the correlation of concepts within the textual content of the document and not on the content itself.
Abstract: Mechanisms are provided for generating section metadata for an electronic document. These mechanisms receive a document and analyze the document to identify concepts present within textual content of the document. The mechanisms correlate concepts within the textual content with one another to identify concept groups based on the application of one or more rules defining related concepts or concept patterns. The mechanisms determine sections of text within the textual content based on the correlation of concepts within the textual content. Based on results of the determining, the mechanisms generate section metadata for the document and store the section metadata in association with the document for use by a document processing system.

30 citations


Proceedings ArticleDOI
25 Aug 2013
TL;DR: A system to encounter such adverse situation in context of English and Gurumukhi Script is proposed and promising results with both types of features are reported.
Abstract: Character recognition problems of distinct scripts have their own script specific characteristics. The state-of-art optical character recognition systems use different methodolgies, to recognize different script characters, which are most effective for the corresponding script. The identificaton of the script of the individual character has not brought much attention between researchers, most of the script identification work is on document, line and word level. In this multilingual/multiscript world presence of different script characters in a single document is very common. We here propose a system to encounter such adverse situation in context of English and Gurumukhi Script. Experiments on multifont and multisized characters with Gabor features based on directional frequency and Gradient features based on gradient information of an individual character to identify it as Gurumukhi or English and also as character or numeral are reported here. Treating it as four class classification problem, multi-class Support Vector Machine(One Vs One) has been used for classification. We got promising results with both types of features. The average identification rates obtained with Gabor and Gradient features are 98.9% and 99.45% respectively.

28 citations


Journal ArticleDOI
TL;DR: An algorithm for segmentation of connected handwritten digits based on the selection of feature points, through a skeletonization process, and the clustering of the touching region via Self-Organizing Maps is presented.
Abstract: Segmentation is an important issue in document image processing systems as it can break a sequence of characters into its components. Its application over digits is common in bank checks, mail and historical document processing, among others. This paper presents an algorithm for segmentation of connected handwritten digits based on the selection of feature points, through a skeletonization process, and the clustering of the touching region via Self-Organizing Maps. The segmentation points are then found, leading to the final segmentation. The method can deal with several types of connection between the digits, having also the ability to map multiple touching. The proposed algorithm achieved encouraging results, both relating to other state-of-the-art algorithms and to possible improvements.

Patent
Chikashi Sugiura1
08 Feb 2013
TL;DR: In this article, an electronic apparatus includes a line recognition module, a character recognition module and a generator, and the generator generates, if the first and second lines satisfy a condition, document data using first character codes corresponding to the first line and second character codes corresponding to the second line.
Abstract: According to one embodiment, an electronic apparatus includes a line recognition module, a character recognition module and a generator. The line recognition module recognizes lines in a handwritten document. The character recognition module recognizes character codes corresponding to handwritten characters in a first line and a second line which follows the first line. The generator generates, if the first and second lines satisfy a condition, document data using first character codes corresponding to the first line and second character codes corresponding to the second line, the formed document data including either one of the first character codes at a position of the second line or including at least one of the second character codes at a position of the first line.

Patent
03 Apr 2013
TL;DR: In this paper, the authors present an approach for automatically determining at least one particular action to be performed based on the content of the at most one of the notes and automatically performing the action.
Abstract: Managing electronic notes includes storing data for at least one of the electronic notes, determining at least one particular action to be performed based on the content of the at least one of the notes, and automatically performing the at least one particular action. The at least one particular action may be determined automatically based on data stored in the at least one of the notes or may be determined by a user providing input to select an action. Actions may be recommended to a user based on at least one of: the extracted terms and the additional online information. Managing electronic notes may also include storing additional data and instructions to perform the at least one particular action.

Proceedings ArticleDOI
25 Aug 2013
TL;DR: This paper describes the Arabic Recognition Competition: Multi-font Multi-size Digitally Represented Text held in the context of the 12th International Conference on Document Analysis and Recognition (ICDAR'2013), during August 25-28, 2013, Washington DC, United States of America.
Abstract: This paper describes the Arabic Recognition Competition: Multi-font Multi-size Digitally Represented Text held in the context of the 12th International Conference on Document Analysis and Recognition (ICDAR'2013), during August 25-28, 2013, Washington DC, United States of America. This competition has used the freely available Arabic Printed Text Image (APTI) database. A first edition took place in ICDAR'2011. In this edition, four groups with six systems are participating in the competition. The systems are compared using the recognition rates at character and word levels. The systems were tested in a blind manner using set 6 of APTI database. A short description of the participating groups, their systems, the experimental setup, and the observed results are presented.

Proceedings ArticleDOI
26 Mar 2013
TL;DR: This paper compares between different techniques that have been used to extract the features of Arabic handwriting scripts in online recognition systems and explains the structure and strategy of those reviewed techniques.
Abstract: Online recognition of Arabic handwritten text has been an ongoing research problem for many years. Generally, online text recognition field has been gaining more interest lately due to the increasing popularity of hand-held computers, digital notebooks and advanced cellular phones. Most of the online text recognition systems consist of three main phases which are preprocessing, feature extraction, and recognition phase. This paper compares between different techniques that have been used to extract the features of Arabic handwriting scripts in online recognition systems. Those techniques attempt to extract the feature vector of Arabic handwritten words, characters, numbers or strokes. This vector then will be fed into the recognition engine to recognize the pattern using the feature vector. The structure and strategy of those reviewed techniques are explained in this article. The strengths and weaknesses of using these techniques will also be discussed.

Patent
24 Jul 2013
TL;DR: In this paper, a method and system implements storing one or more encrypted electronic documents and document information associated therewith, organizing the electronic documents to facilitate access by a user; and enabling remote secure access to the one OR more electronic documents through a user device.
Abstract: A method and system implements storing one or more encrypted electronic documents and document information associated therewith, organizing the one or more electronic documents to facilitate access by a user; and enabling remote secure access to the one or more electronic documents through a user device. The one or more electronic documents are a copy of one or more physical documents or a copy of documents that is not a physical document. The document information of an electronic document includes information on a location of the physical document. The electronic document(s) and the document information are stored in a separate storage databases.

Proceedings ArticleDOI
04 Feb 2013
TL;DR: The advantages of the proposed adaptation of object recognition for image documents is that it does not use character recognition or segmentation and it is robust to rotation, scale, illumination, blur, noise and local distortions.
Abstract: This article presents a method to recognize and to localize semi-structured documents such as ID cards, tickets, invoices, etc. Standard object recognition methods based on interest points work well on natural images but fail on document images because of repetitive patterns like text. In this article, we propose an adaptation of object recognition for image documents. The advantages of our method is that it does not use character recognition or segmentation and it is robust to rotation, scale, illumination, blur, noise and local distortions. Furthermore, tests show that an average precision of 97.2% and recall of 94.6% is obtained for matching 7 different kinds of documents in a database of 2155 documents.

Patent
15 Mar 2013
TL;DR: In this article, a method for processing electronic documents is presented, which includes: receiving a plurality of electronic documents stored in a file container created based on a file system; retrieving metadata from the file container, the metadata indicating forensic information about the plurality of Electronic documents; applying an interactive filtering to the metadata according to user inputs; and selectively extracting one or more electronic documents from file container according to results of the interactive filtering.
Abstract: There is provided a method for processing electronic documents. The method includes: receiving a plurality of electronic documents stored in a file container created based on a file system; retrieving metadata from the file container, the metadata indicating forensic information about the plurality of electronic documents; applying an interactive filtering to the metadata according to user inputs; and selectively extracting one or more electronic documents from the file container according to results of the interactive filtering.

Patent
04 Dec 2013
TL;DR: A paperless conference system and method based on internal working automation is presented in this article, which consists of a conference server device, a number of conference end devices, a customer service end device, an electronic table card device, and an image acquiring and outputting device.
Abstract: The invention provides paperless conference system and method based on internal working automation. A conference server device, a number of conference end devices, a number of customer service end devices, a number of secretary end devices, a number of electronic table card devices and a number of image acquiring and outputting devices. The conference server device provides data encryption storing, interactive processing and interface on screen outputting. Each conference end devices provides the functions of attendance, voting, questionnaire, conference data viewing, video synchronization and call service. Each customer service end device responds to the service call of a participant. Each secretary end device is provided with conference type, data inputting, control video connection, statistical statement checking and conference hall music controlling. Each electronic table card device displays the participant name, and has the functions of conference data querying, call service and desktop OA connection. Each image acquiring and outputting device provides high definition video inputting and projection outputting. The system and the method realize full range conference document processing, and have the characteristics of high security, simplified steps and high efficiency.

Patent
08 Mar 2013
TL;DR: In this paper, a system and a method for multi-modal identity recognition is presented. The system includes a face recognition unit, a voice recognition unit and a control unit.
Abstract: A device, a system and a method are provided for multi-modal identity recognition. The device includes a face recognition unit, a voice recognition unit, and a control unit. The face recognition unit is configured for generating a first recognition result by obtaining and processing face recognition information of a customer and by comparing the processed face recognition information with face recognition information stored in a facial feature database. The voice recognition unit is configured for generating a second recognition result by obtaining and processing voice recognition information of a customer and by comparing the processed voice recognition information with voice recognition information stored in an audio signature database. The control unit is configured for confirming an identity of the customer based on the first recognition result and the second recognition result.

Proceedings ArticleDOI
25 Aug 2013
TL;DR: A novel evaluation approach that responds to the evaluation of reading order results generated by layout analysis methods by incorporating region correspondence analysis is proposed and a sophisticated reading order representation scheme is presented and used by the system.
Abstract: Reading order detection and representation is an important task in many digitisation scenarios involving the preservation of the logical structure of a document. The corresponding need for the evaluation of reading order results generated by layout analysis methods poses a particular challenge due to potential deviations between ground truth and actually detected segmentation of the page. To this end a novel evaluation approach that responds to this problem by incorporating region correspondence analysis is proposed. Furthermore, a sophisticated reading order representation scheme is presented and used by the system allowing the grouping of objects with ordered and/or unordered relations. This is a typical requirement for documents with complex layouts such as magazines and newspapers. The evaluation method has been validated using the results of two state-of-the-art OCR / layout analysis systems and a basic top-to-bottom reading order detection algorithm applied on representative samples from the PRImA contemporary and the IMPACT historical document datasets.

Proceedings ArticleDOI
25 Aug 2013
TL;DR: This paper deals with segmentation and recognition of online handwritten Bangla cursive text containing basic and compound characters and all types of modifiers, and discovered some rules analyzing different joining patterns of Bangla characters.
Abstract: Recognition of Bangla compound characters has rarely got attention from researchers. This paper deals with segmentation and recognition of online handwritten Bangla cursive text containing basic and compound characters and all types of modifiers. Here, at first, we segment cursive words into primitives. Next primitives are recognized. A primitive may represent a character/compound character or a part of a character/compound character having meaningful structural information or a part incurred while joining two characters. We manually analyzed all the input texts written by different groups of people to create a ground truth set of distinct classes of primitives for result verification and we obtained 251 valid primitive classes. For automatic segmentation of text into primitives, we discovered some rules analyzing different joining patterns of Bangla characters. Applying these rules and using combination of online and offline information the segmentation technique was proposed. We achieved correct primitive segmentation rate of 97.89% from the 4984 online words. Directional features were used in SVM for recognition and we achieved average primitive recognition rate of 97.45%.

Proceedings ArticleDOI
14 Nov 2013
TL;DR: A series of classifiers namely Logistic Model Tree, Random Forest, Multi Layer Perceptron, Sequential Minimal Optimization, LibLINEAR, RBFNetwork and Fuzzy Unordered Rule Induction Algorithm are applied on the feature set to classify among the six handwritten scripts and the results are compared.
Abstract: For a multi script/lingual country like India Script identification is a complex real life problem for automation of document processing. Handwritten script identification is again much more complex compared to print one. Here scripts from multi script handwritten documents are identified and then performance is compared using different well known classifiers. We followed a two stage approach for the same. Firstly, we have identified six scripts used for writing six official languages of India in Handwritten domain, which are easily available to us. Using some Abstract/Mathematical features, Structure based features and Script dependent features at document level a 41 dimensional feature set is prepared. Then, a series of classifiers namely Logistic Model Tree, Random Forest, Multi Layer Perceptron, Sequential Minimal Optimization, LibLINEAR, RBFNetwork and Fuzzy Unordered Rule Induction Algorithm are applied on the feature set to classify among the six handwritten scripts and the results are compared. Among all these classifiers, Logistic Model Tree shows highest accuracy rate of 91.2% with a 5 fold cross validation whereas SMO model has lowest convergence time of 0.05s.

Patent
19 Mar 2013
TL;DR: In this paper, a PDF document recognition method is proposed, which comprises the steps as follows: S1: analyzing path objects in a PDF text document, and recognizing forms in the PDF document; S2: analyzing text objects outside table areas in PDF documents, and S3: writing recognition results into a temporary file, or writing the recognition result into a PDF file in the form of an attachment.
Abstract: The invention discloses a PDF document recognition method The method comprises the steps as follows: S1: analyzing path objects in a PDF document, and recognizing forms in the PDF document; S2: analyzing text objects outside table areas in the PDF document, and recognizing text content in the PDF document; S3: writing recognition results into a temporary file, or writing the recognition results into a PDF file in the form of an attachment By the aid of the PDF document recognition method, objects such as the forms, paragraphs, titles, lists and the like in the PDF document can be recognized, so that the PDF document can be edited with one paragraph as the unit, labels can be added to the PDF document conveniently, the reading order can be determined, and persons with dysopia can read conveniently; meanwhile, documents in other formats can be exported according to the recognition results, so that users can read and edit the PDF document conveniently

Patent
29 Jul 2013
TL;DR: In this article, a document processing engine identifies a zone address token, which is matched to the corresponding geo-fenced zone and a device configured with the processing engine can then link to the target zone or target zone service via the address.
Abstract: Systems and methods of accessing and managing geo-fence zones are presented. Specific geo-fence zone addresses can be recognized by a document processing engine. Based on address identification rules, the processing engine identifies a zone address token, which is matched to the corresponding geo-fenced zone. A device configured with the processing engine can then link to the target zone or target zone service via the address.

Patent
Steven Sampson1
03 Apr 2013
TL;DR: In this paper, a matrix is created for words included in documents and distances are identified between pairs of the words, each distance is based on a number of the documents that differ in including a corresponding pair of words.
Abstract: Creating subgroups of documents using optical character recognition data is described. A matrix is created for words included in documents. Each column-row combination in the matrix indicates whether a corresponding word that is associated with the column-row combination is included in a corresponding document that is associated with the column-row combination. Distances are identified between pairs of the words. Each distance is based on a number of the documents that differ in including a corresponding pair of the words. Word clusters are created. Each word cluster includes pairs of words associated with a corresponding distance less than a distance threshold. Sets of word clusters are created. A set of word clusters includes word clusters that are not associated with any of the documents associated with other word clusters in the set. Subgroups of the digitized documents are created based on a set of word clusters with a highest word score.

Proceedings ArticleDOI
22 Jun 2013
TL;DR: This study investigates the recent work for character segmentation and challenges for segmentation for Arabic script based languages.
Abstract: Segmentation based Arabic script based languages character recognition has been a popular field of research for many years. The challenging nature of Arabic script recognition has attracted the attention of researchers from both industry and academic circles but these efforts have not achieved good results until now. Segmentation of Urdu script when written in Nasta'liq writing style is very difficult task due to the complexity of writing style as compare to Naskh writing style. Good segmentation is one of reasons for high accuracy. Character segmentation has been a critical phase of the OCR process. The higher recognition rates for isolated characters as compare to results of words or connected character well illustrate the importance of segmentation. Current study investigate the recent work for character segmentation and challenges for segmentation for Arabic script based languages.

01 Jan 2013
TL;DR: A system of English handwriting recognition based on 40-point feature extraction of the character based on multilayer feed forward neural network that will be suitable for converting handwritten documents into structural text form and recognizing handwritten names.
Abstract: We present in this paper a system of English handwriting recognition based on 40-point feature extraction of the character. Basically an off-line handwritten alphabetical character recognition system using multilayer feed forward neural network has been described in our work. Firstly a new method, called, 40-point feature extraction is introduced for extracting the features of the handwritten alphabets. Secondly, we use the data to train the artificial neural network. In the end, we test the artificial neural network and conclude that this method has a good performance at handwritten character recognition. This system will be suitable for converting handwritten documents into structural text form and recognizing handwritten names.

Proceedings ArticleDOI
06 Aug 2013
TL;DR: A new multi-step skew detection technique for printed Arabic documents that exploits the unique property of the writing line of Arabic script and is based on connected component analysis and projection profiles is proposed.
Abstract: Document skew correction is one of the core preprocessing steps in document analysis systems. In this paper, the author proposes a new multi-step skew detection technique for printed Arabic documents. The technique exploits the unique property of the writing line of Arabic script and is based on connected component analysis and projection profiles. The proposed technique works for different types of Arabic documents having text and non-text zones with unrestricted layout. Moreover due to the multi-step approach, it can detect skews with the resolution of ± 0.05 degrees. Experiments conducted on different Arabic documents shows the effectiveness of the technique.

Patent
07 May 2013
TL;DR: In this article, a method, a system, and a computer program product for evaluating a handwritten document comprising one or more text fields are provided, which includes identifying a character in each of the one-or-more text fields in a digital image by applying a character recognition technique.
Abstract: A method, a system, and a computer program product for evaluating a handwritten document comprising one or more text fields are provided. The method includes identifying a character in each of the one or more text fields in a digital image by applying a character recognition technique. The character type of the identified character is then compared with a predefined character type corresponding to the associated text field of the one or more text fields. The character type in each of the one or more text fields is then validated based on the comparison. Thereafter the identified character for each of the one or more text fields is recommended while digitalization of the handwritten document.