Showing papers on "Document processing published in 2011"

PDF

Open Access

Proceedings Article•DOI•

Text Detection and Character Recognition in Scene Images with Unsupervised Feature Learning

[...]

Adam Coates¹, Blake Carpenter¹, Carl Case¹, Sanjeev Satheesh¹, Bipin Suresh¹, Tao Wang¹, David J. Wu¹, Andrew Y. Ng¹ - Show less +4 more•Institutions (1)

Stanford University¹

18 Sep 2011

TL;DR: This paper applies large-scale algorithms for learning the features automatically from unlabeled data to construct highly effective classifiers for both detection and recognition to be used in a high accuracy end-to-end system.

...read moreread less

Abstract: Reading text from photographs is a challenging problem that has received a significant amount of attention. Two key components of most systems are (i) text detection from images and (ii) character recognition, and many recent methods have been proposed to design better feature representations and models for both. In this paper, we apply methods recently developed in machine learning -- specifically, large-scale algorithms for learning the features automatically from unlabeled data -- and show that they allow us to construct highly effective classifiers for both detection and recognition to be used in a high accuracy end-to-end system.

...read moreread less

402 citations

Patent•

Distributed document co-authoring and processing

[...]

Oudi Antebi, Roy Antebi, Reem Bensimhon, Lev Waisberg, Arthur Teplitzki - Show less +1 more

21 Jan 2011

TL;DR: In this article, a method and a device are disclosed including plug-in software components that are integrated with document processing software suites, which provide a set of integrated interfaces for collaborative document processing in conjunction with multiple remote file, data, and application service providers.

...read moreread less

Abstract: A method and a device are disclosed including plug-in software components that are integrated with document processing software suites. The plug-in software components provide a set of integrated interfaces for collaborative document processing in conjunction with multiple remote file, data, and application service providers. The set of interfaces enable coauthoring a document, document merging, discovering and displaying context-sensitive metadata on a software dashboard based on permissions associated with the metadata and/or a client computing device, caching, symmetric distributed document merge with the multiple service providers, and integrated search and insertion of multimedia data in documents, among others. The documents typically include, but are not limited to formatted text documents, spreadsheet documents, and slide presentation documents.

...read moreread less

131 citations

Patent•

Systems for mobile image capture and remittance processing

[...]

Grigori Nepomniachtchi, James DeBello, Josh Roach

17 Oct 2011

TL;DR: In this paper, a method and system for document image capture and processing using mobile devices is presented, where the image is optimized and enhanced for data extraction from the document as depicted.

...read moreread less

Abstract: The present invention relates to automated document processing and more particularly, to methods and systems for document image capture and processing using mobile devices. In accordance with various embodiments, methods and systems for document image capture on a mobile communication device are provided such that the image is optimized and enhanced for data extraction from the document as depicted. These methods and systems may comprise capturing an image of a document using a mobile communication device; transmitting the image to a server; and processing the image to create a bi-tonal image of the document for data extraction. Additionally, these methods and systems may comprise capturing a first image of a document using the mobile communication device; automatically detecting the document within the image; geometrically correcting the image; binarizing the image; correcting the orientation of the image; correcting the size of the image; and outputting the resulting image of the document.

...read moreread less

109 citations

Patent•

System and Method Implementing a Text Analysis Service

[...]

Greg Holmberg

15 Nov 2011

TL;DR: In this paper, a text analysis task object that includes instructions regarding a document processing pipeline and a document identifier is generated by accessing, by a worker system, the task object and generating the pipeline according to the instructions.

...read moreread less

Abstract: One embodiment includes a computer implemented method of processing documents. The method includes generating a text analysis task object that includes instructions regarding a document processing pipeline and a document identifier. The method further includes accessing, by a worker system, the text analysis task object and generating the document processing pipeline according to the instructions. The method further includes performing text analysis using the document processing pipeline on a document identified by the document identifier.

...read moreread less

78 citations

Posted Content•

A Review of Research on Devnagari Character Recognition

[...]

Vikas J. Dongre, Vijay H. Mankar

13 Jan 2011-arXiv: Computer Vision and Pattern Recognition

TL;DR: An overview of DOCR systems is presented and the available DOCR techniques are reviewed in this article, where the current status of the DOCR is discussed and directions for future research are suggested.

...read moreread less

Abstract: English Character Recognition (CR) has been extensively studied in the last half century and progressed to a level, sufficient to produce technology driven applications. But same is not the case for Indian languages which are complicated in terms of structure and computations. Rapidly growing computational power may enable the implementation of Indic CR methodologies. Digital document processing is gaining popularity for application to office and library automation, bank and postal services, publishing houses and communication technology. Devnagari being the national language of India, spoken by more than 500 million people, should be given special attention so that document retrieval and analysis of rich ancient and modern Indian literature can be effectively done. This article is intended to serve as a guide and update for the readers, working in the Devnagari Optical Character Recognition (DOCR) area. An overview of DOCR systems is presented and the available DOCR techniques are reviewed. The current status of DOCR is discussed and directions for future research are suggested.

...read moreread less

69 citations

Proceedings Article•DOI•

Neural network based handwritten character recognition system without feature extraction

[...]

J. Pradeep¹, E. Srinivasan¹, S. Himavathi¹•Institutions (1)

Pondicherry Engineering College¹

18 Mar 2011

TL;DR: An attempt is made to recognize handwritten characters for English alphabets without feature extraction using multilayer Feed Forward neural network, which yields good recognition rates which are comparable to that of feature extraction based schemes for handwritten character recognition.

...read moreread less

Abstract: Handwriting recognition has been one of the active and challenging research areas in the field of image processing and pattern recognition. It has numerous applications which include, reading aid for blind, bank cheques and conversion of any hand written document into structural text form. In this paper an attempt is made to recognize handwritten characters for English alphabets without feature extraction using multilayer Feed Forward neural network. Each character data set contains 26 alphabets. Fifty different character data sets are used for training the neural network. The trained network is used for classification and recognition. In the proposed system, each character is resized into 30×20 pixels, which is directly subjected to training. That is, each resized character has 600 pixels and these pixels are taken as features for training the neural network. The results show that the proposed system yields good recognition rates which are comparable to that of feature extraction based schemes for handwritten character recognition.

...read moreread less

53 citations

Proceedings Article•DOI•

Offline handwritten character recognition using neural network

[...]

Anshul Gupta¹, Manisha Srivastava¹, Chitralekha Mahanta¹•Institutions (1)

Indian Institute of Technology Guwahati¹

01 Dec 2011

TL;DR: This paper adopts segmentation based handwritten word recognition where neural networks are used to identify individual characters and post processing technique that uses lexicon is employed to improve the overall recognition accuracy.

...read moreread less

Abstract: Character Recognition (CR) has been an active area of research in the past and due to its diverse applications it continues to be a challenging research topic. In this paper, we focus especially on offline recognition of handwritten English words by first detecting individual characters. The main approaches for offline handwritten word recognition can be divided into two classes, holistic and segmentation based. The holistic approach is used in recognition of limited size vocabulary where global features extracted from the entire word image are considered. As the size of the vocabulary increases, the complexity of holistic based algorithms also increases and correspondingly the recognition rate decreases rapidly. The segmentation based strategies, on the other hand, employ bottom-up approaches, starting from the stroke or the character level and going towards producing a meaningful word. After segmentation the problem gets reduced to the recognition of simple isolated characters or strokes and hence the system can be employed for unlimited vocabulary. We here adopt segmentation based handwritten word recognition where neural networks are used to identify individual characters. A number of techniques are available for feature extraction and training of CR systems in the literature, each with its own superiorities and weaknesses.We explore these techniques to design an optimal offline handwritten English word recognition system based on character recognition. Post processing technique that uses lexicon is employed to improve the overall recognition accuracy.

...read moreread less

45 citations

Patent•

Document processing job control via a mobile device

[...]

Kaoru Watanabe¹, Guiluan Luo, Zhenyu Lu, Jiang Hong•Institutions (1)

Ricoh¹

24 Feb 2011

TL;DR: In this article, a mobile device scans identification information of a document processing device, generates job settings data and sends them to a network service, where the network service stores the job settings in a repository, generates a job identifier, associates the job identifier with the job sets, and sends the job identifiers to the mobile device.

...read moreread less

Abstract: Techniques are provided for creating a document processing job without manually inputting information to a document processing device. A mobile device scans identification information of a document processing device, generates job settings data and sends them to a network service. In response to receiving the job settings data, the network service stores the job settings data in a repository, generates a job identifier, associates the job identifier with the job settings, and sends the job identifier to the mobile device. The mobile device receives the job identifier and sends it to the document processing device. The document processing device uses the job identifier to retrieve the job settings data from the network service. In response to receiving the job settings data, the document processing device processes one or more documents according to the job settings.

...read moreread less

43 citations

Proceedings Article•DOI•

A Benchmark Kannada Handwritten Document Dataset and Its Segmentation

[...]

Alireza Alaei¹, P. Nagabhushan¹, Umapada Pal²•Institutions (2)

University of Mysore¹, Indian Statistical Institute²

18 Sep 2011

TL;DR: An unconstrained Kannada handwritten text database (KHTD) is introduced and two types of ground truths based on pixels information and content information are generated for the database, which can be utilized in many areas of document image processing such as sentence recognition/understanding, text-line segmentation, word segmentation; and character segmentation.

...read moreread less

Abstract: Research towards Indian handwritten document analysis achieved increasing attention in recent years In pattern recognition and especially in handwritten document recognition, standard databases play vital roles for evaluating performances of algorithms and comparing results obtained by different groups of researchers For Indian languages, there is a lack of standard database of handwritten texts to evaluate performance of different document recognition approaches and for comparison purpose In this paper, an unconstrained Kannada handwritten text database (KHTD) is introduced The KHTD contains 204 handwritten documents of four different categories written by 51 native speakers of Kannada Total number of text-lines and words in the dataset are 4298 and 26115, respectively In most of text-pages of the KHTD contains either an overlapping or a touching text-lines and the average number of text-lines in each document on the database is 21 Two types of ground truths based on pixels information and content information are generated for the database Providing these two types of ground truths for the KHTD, it can be utilized in many areas of document image processing such as sentence recognition/understanding, text-line segmentation, word segmentation, word recognition, and character segmentation To provide a framework for other researches, recent text-line segmentation results on this dataset are also reported The KHTD is available for research purposes

...read moreread less

42 citations

Patent•

Distributed document processing and management

[...]

Oudi Antebi, Roy Antebi, Reem Bensimhon, Lev Waisberg, Arthur Teplitzki - Show less +1 more

21 Jan 2011

TL;DR: In this paper, a method and a device are disclosed including plug-in software components that are integrated with document processing software suites, which provide a set of integrated interfaces for collaborative document processing in conjunction with multiple remote file, data, and application service providers.

...read moreread less

Abstract: A method and a device are disclosed including plug-in software components that are integrated with document processing software suites. The plug-in software components provide a set of integrated interfaces for collaborative document processing in conjunction with multiple remote file, data, and application service providers. The set of interfaces enable coauthoring a document, document merging, displaying context-sensitive metadata on a software dashboard, caching, symmetric distributed document merge with the multiple service providers, and integrated search and insertion of multimedia data in documents, among others. The documents typically include, but are not limited to formatted text documents, spreadsheet documents, and slide presentation documents.

...read moreread less

39 citations

Book Chapter•DOI•

The IUPR dataset of camera-captured document images

[...]

Syed Saqib Bukhari¹, Faisal Shafait², Thomas M. Breuel¹•Institutions (2)

Kaiserslautern University of Technology¹, German Research Centre for Artificial Intelligence²

22 Sep 2011

TL;DR: A new dataset (the IUPR dataset) of camera-captured document images that contains images from different varieties of technical and non-technical books with more challenging problems, like different types of layouts, large variety of curl, wide range of perspective distortions, and high to low resolution.

...read moreread less

Abstract: Major challenges in camera-base document analysis are dealing with uneven shadows, high degree of curl and perspective distortions. In CBDAR 2007, we introduced the first dataset (DFKI-I) of camera-captured document images in conjunction with a page dewarping contest. One of the main limitations of this dataset is that it contains images only from technical books with simple layouts and moderate curl/skew. Moreover, it does not contain information about camera's specifications and settings, imaging environment, and document contents. This kind of information would be more helpful for understanding the results of the experimental evaluation of camera-based document image processing (binarization, page segmentation, dewarping, etc.). In this paper, we introduce a new dataset (the IUPR dataset) of camera-captured document images. As compared to the previous dataset, the new dataset contains images from different varieties of technical and non-technical books with more challenging problems, like different types of layouts, large variety of curl, wide range of perspective distortions, and high to low resolutions. Additionally, the document images in the new dataset are provided with detailed information about thickness of books, imaging environment and camera's viewing angle and its internal settings. The new dataset will help research community to develop robust camera-captured document processing algorithms in order to solve the challenging problems in the dataset and to compare different methods on a common ground.

...read moreread less

Book Chapter•DOI•

Review on OCR for Handwritten Indian Scripts Character Recognition

[...]

Munish Kumar, Manish Kumar Jindal¹, Rajendra Kumar Sharma²•Institutions (2)

Panjab University, Chandigarh¹, Thapar University²

23 Sep 2011

TL;DR: A survey on OCR of Devanagari, Bangla, Tamil, Oriya and Gurmukhi handwritten scripts is presented, finding no complete OCR system is available for recognition of handwritten text in any Indian script, in general.

...read moreread less

Abstract: Natural language processing and pattern recognition have been successfully applied to Optical Character Recognition (OCR). Character recognition is an important area in pattern recognition. Character recognition can be printed or handwritten. Handwritten character recognition can be offline or online. Many researchers have been done work on handwritten character recognition from the last few years. As compared to non-Indian scripts, the research on OCR of handwritten Indian scripts has not achieved that perfection. There are large numbers of systems available for handwritten character recognition for non-Indian scripts. But there is no complete OCR system is available for recognition of handwritten text in any Indian script, in general. Few attempts have been carried out on the recognition of Devanagari, Bangla, Tamil, Oriya and Gurmukhi handwritten scripts. In this paper, we presented a survey on OCR of these most popular Indian scripts.

...read moreread less

Journal Article•DOI•

Data-intensive document clustering on graphics processing unit (GPU) clusters

[...]

Yongpeng Zhang¹, Frank Mueller¹, Xiaohui Cui², Thomas E. Potok²•Institutions (2)

North Carolina State University¹, Oak Ridge National Laboratory²

01 Feb 2011-Journal of Parallel and Distributed Computing

TL;DR: The benefits of exploiting the computational power of graphics processing units to study two fundamental problems in document mining, namely to calculate the term frequency-inverse document frequency (TF-IDF) and cluster a large set of documents are assessed.

...read moreread less

Book Chapter•DOI•

Machine Learning for Document Structure Recognition

[...]

Gerhard Paaß¹, Iuliu Konya¹•Institutions (1)

Fraunhofer Society¹

01 Jan 2011

TL;DR: This chapter describes approaches for document structure recognition detecting the hierarchy of physical components in images of documents and transforms this into a hierarchy of logical components, such as titles, authors, and sections, which improves readability and is useful for indexing and retrieving information contained in documents.

...read moreread less

Abstract: The backbone of the information age is digital information which may be searched, accessed, and transferred instantaneously. Therefore the digitization of paper documents is extremely interesting. This chapter describes approaches for document structure recognition detecting the hierarchy of physical components in images of documents, such as pages, paragraphs, and figures, and transforms this into a hierarchy of logical components, such as titles, authors, and sections. This structural information improves readability and is useful for indexing and retrieving information contained in documents. First we present a rule-based system segmenting the document image and estimating the logical role of these zones. It is extensively used for processing newspaper collections showing world-class performance. In the second part we introduce several machine learning approaches exploring large numbers of interrelated features. They can be adapted to geometrical models of the document structure, which may be set up as a linear sequence or a general graph. These advanced models require far more computational resources but show a better performance than simpler alternatives and might be used in future.

...read moreread less

Handwritten Text Recognition for Historical Documents

[...]

Verónica Romero¹, Nicolás Serrano¹, Alejandro Héctor Toselli¹, Joan Andreu Sánchez¹, Enrique Vidal¹ - Show less +1 more•Institutions (1)

Polytechnic University of Valencia¹

01 Sep 2011

TL;DR: The state-of-the-art Handwritten Text Recognition techniques are applied for the automatic transcription of historical documents waiting to be transcribed into a textual electronic format (such as ASCII or PDF).

...read moreread less

Abstract: The amount of digitized legacy documents has been rising dramatically over the last years due mainly to the increasing number of on-line digital libraries publishing this kind of documents. The vast majority of them remain waiting to be transcribed into a textual electronic format (such as ASCII or PDF) that would provide historians and other researchers new ways of indexing, consulting and querying them. In this work, the state-of-the-art Handwritten Text Recognition techniques are applied for the automatic transcription of these historical documents. We report results for several ancient documents.

...read moreread less

Journal Article•DOI•

Document image analysis: issues, comparison of methods and remaining problems

[...]

Tanzila Saba¹, Ghazali Sulong¹, Amjad Rehman¹•Institutions (1)

Universiti Teknologi Malaysia¹

01 Feb 2011-Artificial Intelligence Review

TL;DR: The state of art of document image analysis is surveyed, recent trends are analyzed and challenges for future research in this field are identified.

...read moreread less

Abstract: Image analysis is an interesting research area with a large variety of challenging applications. Researchers have worked from decades on this topic, as witnessed by the scientific literature. However, document image analysis is the special case in image analysis as their spatial properties are different from natural images. Therefore, the main focus of this paper is to describe image denoising issues in general and document image issues in particular. Since the field of document processing is relatively new, it is also dynamic, so current methods have room for improvement and innovations are still being made. Several algorithms proposed in the literature are described. Critical discussions are reported about the current status of the field and open problems are highlighted. It is also demonstrated that, there are rarely definitive techniques for all cases of a certain problem. We surveyed the state of art, analyzed recent trends and tried to identify challenges for future research in this field.

...read moreread less

Proceedings Article•DOI•

A Semantic-based Document Processing Framework: A Security Perspective

[...]

Flora Amato¹, Valentina Casola¹, Nicola Mazzocca¹, Sara Romano¹•Institutions (1)

University of Naples Federico II¹

30 Jun 2011

TL;DR: This paper proposed a general framework for data transformation and implemented such model trough an architecture based on semantic analysis and presented a case study for the formalization and protection of e-health medical records.

...read moreread less

Abstract: The coexistence of different formats and physical supports to store data is one of the main open issues in document management systems, in particular, the presence of unstructured data represents a huge limitation for the elaboration and analysis of many documents and processes. At this aim we are exploiting the adoption of different techniques to analyze texts and automatically extract relevant information, concepts or complex relations, in this paper we proposed a general framework for data transformation and implemented such model trough an architecture based on semantic analysis. The analysis that can be performed on data has many different applications, in this paper we illustrate an interesting perspective related on how to enforce a fine grained access control on sensitive data that are in capsulated in unstructured, monolithic files. We also presented a case study for the formalization and protection of e-health medical records.

...read moreread less

Proceedings Article•DOI•

Scientific challenges underlying production document processing

[...]

Eric Saund

24 Jan 2011

TL;DR: The challenge therefore extends beyond the science behind document image recognition and into user interface and user experience design.

...read moreread less

Abstract: The Field of Document Recognition is bipolar. On one end lies the excellent work of academic institutions engaging in original research on scientifically interesting topics. On the other end lies the document recognition industry which services needs for high-volume data capture for transaction and back-office applications. These realms seldom meet, yet the need is great to address technical hurdles for practical problems using modern approaches from the Document Recognition, Computer Vision, and Machine Learning disciplines. We reflect on three categories of problems we have encountered which are both scientifically challenging and of high practical value. These are Doctype Classification, Functional Role Labeling, and Document Sets. Doctype Classification asks, "What is the type of page I am looking at?" Functional Role Labeling asks, "What is the status of text and graphical elements in a model of document structure?" Document Sets asks, "How are pages and their contents related to one another?" Each of these has ad hoc engineering approaches that provide 40-80% solutions, and each of them begs for a deeply grounded formulation both to provide understanding and to attain the remaining 20-60% of practical value. The practical need is not purely technical but also depends on the user experience in application setup and configuration, and in collection and groundtruthing of sample documents. The challenge therefore extends beyond the science behind document image recognition and into user interface and user experience design.

...read moreread less

Patent•

Print document processing system, cache apparatus, data processing apparatus, non-transitory computer readable medium storing program, and print document processing method

[...]

Michio Hayakawa¹•Institutions (1)

Fuji Xerox¹

03 May 2011

TL;DR: In this article, the first response providing unit provides a CREATING response if a state stored in the state memory in association with the image data indicates that image data is not in the cache memory and is currently being created by any other data processing apparatus.

...read moreread less

Abstract: A print document processing system includes a cache apparatus and plural data processing apparatuses. The cache apparatus includes a cache memory, a state memory, and a first response providing unit. Each data processing apparatus includes an image data creating unit, a query unit, and a controller. The cache memory stores image data created by each data processing apparatus. Upon receipt of a query issued by the query unit on image data of a document element, the first response providing unit provides a CREATING response if a state stored in the state memory in association with the image data indicates that the image data is not in the cache memory and is currently being created by any other data processing apparatus. Upon receipt of the CREATING response, the controller performs control to use the image data after created or cause the image data creating unit to create the image data.

...read moreread less

Optical Character Recognition using MATLAB

[...]

Krishna K Kumari, Preeti Mohanty, Gurudas C Nayak

01 Jan 2011

TL;DR: This paper attempts to develop an intelligent OCR system to store the documents in electronic form using Matlab.

...read moreread less

Abstract: Optical character recognition, usually abbreviated to OCR, is the mechanical or electronic translation of scanned images of handwritten, typewritten or printed text into machine-encoded text. It is widely used to convert books and documents into electronic files, to computerize a record-keeping system in an office, or to publish the text on a website. OCR makes it possible to edit the text, search for a word or phrase, store it more compactly, display or print a copy free of scanning artifacts, and apply techniques such as machine translation, text-to-speech and text mining to it. OCR is a field of research in pattern recognition, artificial intelligence and computer vision. OCR systems require calibration to read a specific font; early versions needed to be programmed with images of each character, and worked on one font at a time. "Intelligent “systems with a high degree of recognition accuracy for most fonts are now needed. Hence this paper attempts to develop an intelligent OCR system to store the documents in electronic form using Matlab

...read moreread less

Patent•

Distributed document processing

[...]

Avikam Baltsan, Ori Sarid, Aryeh Elimelech, Aharon Boker, Zvi Segal, Gideon Miller - Show less +2 more

10 Feb 2011

TL;DR: In this paper, a system for document processing including decomposing an image of a document into at least one data entry region sub-image, providing the data entry regions subimage to a data entry clerk available for processing the data-entry region subimage, receiving from the dataentry clerk a data value associated with the subimage and validating the value.

...read moreread less

Abstract: A system for document processing including decomposing an image of a document into at least one data entry region sub-image, providing the data entry region sub-image to a data entry clerk available for processing the data entry region sub-image, receiving from the data entry clerk a data entry value associated with the data entry region sub-image, and validating the data entry value.

...read moreread less

Book Chapter•DOI•

An efficient feature extraction method for handwritten character recognition

[...]

Manju Rani¹, Yogesh Kumar Meena¹•Institutions (1)

Malaviya National Institute of Technology, Jaipur¹

19 Dec 2011

TL;DR: A new feature extraction method for handwritten characters named Cross-corner is introduced, which uses the results of some promising feature extraction methods to find the best method for this application.

...read moreread less

Abstract: Handwritten character recognition in a particular language is one of the favourite topics for research from two last decades. Image processing and pattern recognition plays a lead role in handwritten character recognition. It is not a easy task to build a program to achieve hundred percent accuracy for handwritten characters because even humans too make mistakes to recognize characters. There are three main steps of handwritten character recognition- Data collection and preprocessing, feature extraction and classification. Data collection includes creating a raw file of handwritten character images. Preprocessing steps are applied to find a normalized binary image of handwritten character which is easy to process in next step. Feature extraction is the process of gathering data of different samples so that on the basis of this data we can classify samples with different features. Feature extraction from preprocessed handwritten character plays the most important role in character recognition. Thus feature extraction stage in handwritten character recognition system has a large scope for researchers. In this paper, we also introduce a new feature extraction method for handwritten characters named Cross-corner. We use the results of some promising feature extraction methods to find the best method for this application.

...read moreread less

Proceedings Article•DOI•

Suppression of non-text components in handwritten document images

[...]

Ram Sarkar¹, Sanjay Moulik¹, Nibaran Das¹, Subhadip Basu¹, Mita Nasipuri¹, Mahantapas Kundu¹ - Show less +2 more•Institutions (1)

Jadavpur University¹

22 Dec 2011

TL;DR: A modified RLSA, called Spiral Run Length Smearing Algorithm (SRLSA), is applied to suppress the non-text components from text ones in handwritten document images using a Support Vector Machine (SVM) classifier.

...read moreread less

Abstract: Document layout analysis is a pre-processing step to convert handwritten/printed documents into electronic form through Optical Character Recognition (OCR) system. Handwritten documents are usually unstructured i.e. they do not have a specific layout and most documents may contain some non-text regions e.g. graphs, tables, diagrams etc. Therefore, such documents cannot be directly given as input to the OCR system without suppressing the non-text regions in the documents. The traditional Run Length Smoothing Algorithm (RLSA) does not produce good results for handwritten document pages, since the text components in it have lesser pixel density than those in printed text. In present work, a modified RLSA, called Spiral Run Length Smearing Algorithm (SRLSA), is applied to suppress the non-text components from text ones in handwritten document images. The components in the document pages are then classified into text/non-text groups using a Support Vector Machine (SVM) classifier. The method shows a success rate of 83.3% on a dataset of 3000 components.

...read moreread less

Patent•

Print document processing system, cache apparatus, computer readable medium storing program, and print document processing method

[...]

Michio Hayakawa¹•Institutions (1)

Fuji Xerox¹

18 May 2011

TL;DR: In this article, an area reservation unit reserves the storage area by selecting a second document element, which has a lower cache priority than the first document element and which is currently being used by any of the data processing apparatuses, as a document element to be evicted from the cache memory.

...read moreread less

Abstract: A print document processing system includes a cache apparatus and plural data processing apparatuses. Each data processing apparatus includes an image data creating unit and a reservation request transmission unit. The cache apparatus includes a cache memory, a memory, and an area reservation unit. The image data creating unit creates image data for each document element of print document data. The memory stores use condition information and cache priority of the document element for each document element. Upon receipt of a reservation request for reserving a storage area for image data of a first document element from a data processing apparatus, the area reservation unit reserves the storage area by selecting a second document element, which has a lower cache priority than the first document element and which is currently being used by any of the data processing apparatuses, as a document element to be evicted from the cache memory.

...read moreread less

Patent•

Cloud computing system, document processing method, and storage medium

[...]

Yoshinobu Hamada¹•Institutions (1)

Canon Inc.¹

21 Apr 2011

TL;DR: In this article, a reception processing unit provided in a document processing system receives a document data processing request from a user device, and a division processing unit divides document data corresponding to the processing request and generates divided document data.

...read moreread less

Abstract: A reception processing unit provided in a document processing system receives a document data processing request from a user device. A division processing unit divides document data corresponding to the processing request and generates divided document data. A document processing unit performs document processing for the divided document data, and a coupling processing unit combines the document-processed divided document data. A resource management unit increases or decreases the number of the division processing units, the document processing units, and the coupling processing units in response to the processing status of each thereof.

...read moreread less

Patent•

Cloud computing system, document processing method, and storage medium

[...]

Yuki Ito¹•Institutions (1)

Canon Inc.¹

17 May 2011

TL;DR: In this paper, a scan system includes a request reception unit that stores a message corresponding to a document processing job in a queue in response to the receipt of an execution request for the job from an image forming apparatus.

...read moreread less

Abstract: A scan system includes a request reception unit that stores a message corresponding to a document processing job in a queue in response to the receipt of an execution request for the document processing job from an image forming apparatus; and a back-end processing unit that makes an acquisition request for the message with respect to the queue at a regular interval, acquires the message, executes a document processing job corresponding to the acquired message, and stores the execution result of the job. Before the image forming apparatus transmits an execution request for a document processing job to the request reception unit, the back-end processing unit acquires information indicating a location at which the execution result of the acquired job is stored, and the request reception unit causes the image forming apparatus to display information indicating a location at which the execution result of the acquired job is stored.

...read moreread less

Proceedings Article•DOI•

Exploiting Collection Level for Improving Assisted Handwritten Word Transcription of Historical Documents

[...]

Laurent Guichard¹, Joseph Chazalon¹, Bertrand Coüasnon¹•Institutions (1)

Intelligence and National Security Alliance¹

18 Sep 2011

TL;DR: A new architecture is proposed that allows the exploitation of handwritten word redundancies over pages by considering documents from a higher point of view, namely the collection level.

...read moreread less

Abstract: Transcription of handwritten words in historical documents is still a difficult task. When processing huge amount of pages, document-centered approaches are limited by the trade-off between automatic recognition errors and the tedious aspect of human user annotation work. In this article, we investigate the use of inter page dependencies to overcome those limitations. For this, we propose a new architecture that allows the exploitation of handwritten word redundancies over pages by considering documents from a higher point of view, namely the collection level. The experiments we conducted on handwritten word transcription show promising results in terms of recognition error and human user work reductions.

...read moreread less

Patent•

Efficient document processing system and method

[...]

Diane Larlus¹, Florent Perronnin¹•Institutions (1)

Xerox¹

20 Dec 2011

TL;DR: In this paper, a document processing system and method are disclosed, where local scores are incrementally computed for document samples, based on local features extracted from the respective sample, i.e., on fewer than all document samples.

...read moreread less

Abstract: A document processing system and method are disclosed. In the method local scores are incrementally computed for document samples, based on local features extracted from the respective sample. A global score is estimated for the document based on the local scores currently computed, i.e., on fewer than all document samples. A confidence in a decision for the estimated global score is computed. The computed confidence is based on the local scores currently computed and, optionally, the number of samples used in computing the estimated global score. A classification decision, such as a categorization or retrieval decision for the document is output, based on the estimated score when the computed confidence in the decision reaches a threshold value.

...read moreread less

Patent•

Systems and methods for automatically extracting data from eletronic documents using multiple character recognition engines

[...]

Girish Welling, Vartika Singh, Gopal Krishna, Tushar Mahata, Nirupam Sarkar, Depankar Neogi, Steven K. Ladd - Show less +3 more

14 Jan 2011

TL;DR: In this paper, a document analysis system that receives and processes jobs from a plurality of users to extract data from the electronic documents, is provided, in which each job may contain multiple electronic documents.

...read moreread less

Abstract: In a document analysis system that receives and processes jobs from a plurality of users, in which each job may contain multiple electronic documents, to extract data from the electronic documents, a method of automatically extracting data from each received electronic document using a plurality of character recognition engines is provided. The method includes: automatically processing each received electronic document page using each of a plurality of recognition engines to extract data; comparing quality of data extracted from each of the recognition engines to assign a confidence score to the extracted data; and selecting extracted data having highest confidence score as the correct extracted data.

...read moreread less

Patent•

System and Method to Provide Collaborative Document Processing Services Via Interframe Communication

[...]

Micah Lemonik, John Day-Richter

02 Sep 2011

TL;DR: In this article, a document is displayed within a first frame embedded within a second frame on a first device, wherein the second frame is in communication with a server by interframe communication.

...read moreread less

Abstract: A document is displayed within a first frame embedded within a second frame on a first device, wherein the second frame is in communication with a server. A first change to the document is received from a user of the first device, and the first change is transmitted to the server by interframe communication. A plurality of transformed changes to the document, including a transformed version of the first change and a version of a second change made by a user of a second device are received by interframe communication. The first frame may be an IFrame, for example.

...read moreread less