TL;DR: A new method for image segmentation and layout analysis that takes full advantage of color information is proposed, implemented in the DIA system WISDOM++ and tested on a corpus of multi-format documents concerning historic film censorships.
Abstract: Processing censorship cards of the 20/sup th/ century in order to support annotation and retrieval processes, leads to a number of challenges for many DIA systems. Problems due to the low layout quality and standard of such a material can be reduced by exploiting information conveyed by color. In this paper, taking into account lessons learned in the context of the 1ST project Collate, we propose a new method for image segmentation and layout analysis that takes full advantage of color information. The method has been implemented in the DIA system WISDOM++ and tested on a corpus of multi-format documents concerning historic film censorships.
Many institutions which collect and preserve cultural heritage, as historical documents, have shown a great interest in the digitalization of their resources and in the exploitation of mechanisms to provide online access to digitalized products.
This paper presents layout analysis issues and problems addressed in the EU funded project COLLATE, whose main goal is to provide film archivists adequate access to historic film-related documents and their associated metadata [5].
Finally, conclusions are drawn in Section 4.
2. The approach
A naïve approach to color document image processing would be to separate different colors and to process Proceedings of the 2005 Eight International Conference on Document Analysis and Recognition (ICDAR’05).
Images are segmented again and the spatial merging is applied on intersecting blocks.
At each step, the dissimilarity between two clusters of colors (inter-cluster dissimilarity) is evaluated on the basis of two measures: a) the Euclidean distance between two colors taken from distinct clusters (nearest neighbor based dissimilarity); b) the Euclidean distance between the centroids of the two clusters (centroid-based dissimilarity).
Authorized licensed use limited to: Donato Malerba.
A first step towards the reconstruction of layout structure consists of classifying the blocks according to their content type: text, horizontal line, vertical line, picture (i.e. halftone images) and graphics (e.g. line drawings).
3. Application
In this section the authors empirically evaluate the proposed approach in terms of the capability to isolate interesting blocks of different color for subsequent logical labeling.
In Fig. 4, a document image of the NFA class, that represents the most complex to analyze because of the overall low quality, is shown.
The document contains manual annotations (no_prec_doc, top right-hand corner), blue stamps (register_office and dispatch_officer, bottom page), red stamps (rubber_stamp, top left-hand corner) and revenue stamps (stamp, in the middle of the page).
The color-based layout analysis is able to isolate them, while the b/w layout analysis returns a single layout block for the whole central part of the document image and two spurious blocks extracted from the bottom of the image.
Indeed, for the FAA class, 205 components have been labeled in the color setting against 140 in the b/w, while 64 against 12 for the NFA class.
4. Conclusions
A new color-based layout analysis method has been proposed in order to meet challenges coming from processing censorship cards of European film archives of the 20ties and 30ties of the last century.
A comparison of the method with the original b/w version has been provided.
Results show that the color-based approach allows to isolate interesting blocks better than the previous version and to provide a more accurate base for understanding.
TL;DR: A novel framework for segmentation of documents with complex layouts performed by combination of clustering and conditional random fields (CRF) based modeling and has been extensively tested on multi-colored document images with text overlapping graphics/image.
Abstract: In this paper, we propose a novel framework for segmentation of documents with complex layouts. The document segmentation is performed by combination of clustering and conditional random fields (CRF) based modeling. The bottom-up approach for segmentation assigns each pixel to a cluster plane based on color intensity. A CRF based discriminative model is learned to extract the local neighborhood information in different cluster/color planes. The final category assignment is done by a top-level CRF based on the semantic correlation learned across clusters. The proposed framework has been extensively tested on multi-colored document images with text overlapping graphics/image.
12 citations
Cites methods from "A color-based layout analysis to pr..."
...Layout analysis using color information have been proposed in [9]–[11] to handle color document images with complex layouts such as forms, text overlaid on image, posters etc....
TL;DR: This survey provides a summary of color image segmentation techniques available now based on monochrome segmentation approaches operating in different color spaces and some novel approaches such as fuzzy method and physics-based method are investigated.
Abstract: Image segmentation is very essential and critical to image processing and pattern recognition. This survey provides a summary of color image segmentation techniques available now. Basically, color segmentation approaches are based on monochrome segmentation approaches operating in different color spaces. Therefore, we first discuss the major segmentation approaches for segmenting monochrome images: histogram thresholding, characteristic feature clustering, edge detection, region-based methods, fuzzy techniques, neural networks, etc.; then review some major color representation methods and their advantages/disadvantages; finally summarize the color image segmentation techniques using different color representations. The usage of color models for image segmentation is also discussed. Some novel approaches such as fuzzy method and physics-based method are investigated as well.
1,682 citations
"A color-based layout analysis to pr..." refers methods in this paper
...We used CIELab since it is considered "visually uniform" because adjacent color samples represent equal intervals of visual perception [4]....
TL;DR: The requirements and components for a proposed Document Analysis System, which assists a user in encoding printed documents for computer processing, are outlined and several critical functions have been investigated and the technical approaches are discussed.
Abstract: This paper outlines the requirements and components for a proposed Document Analysis System, which assists a user in encoding printed documents for computer processing. Several critical functions have been investigated and the technical approaches are discussed. The first is the segmentation and classification of digitized printed documents into regions of text and images. A nonlinear, run-length smoothing algorithm has been used for this purpose. By using the regular features of text lines, a linear adaptive classification scheme discriminates text regions from others. The second technique studied is an adaptive approach to the recognition of the hundreds of font styles and sizes that can occur on printed documents. A preclassifier is constructed during the input process and used to speed up a well-known pattern-matching method for clustering characters from an arbitrary print source into a small sample of prototypes. Experimental results are included.
TL;DR: A new method for filling a color table is presented that produces pictures of similar quality as existing methods, but requires less memory and execution time.
Abstract: A new method for filling a color table is presented that produces pictures of similar quality as existing methods, but requires less memory and execution time. All colors of an image are inserted in an octree, and this octree is reduced from the leaves to the root in such a way that every pixel has a well defined maximum error. The algorithm is described in PASCAL notation.
347 citations
"A color-based layout analysis to pr..." refers methods in this paper
...The quantization process follows the method proposed by [7] whose basic idea is to build a octree containing a maximum of K different leaves (a leaf corresponds to a color)....
TL;DR: The innovative aspects described in the paper are: the preprocessing algorithm, the adaptive page segmentation, the acquisition of block classification rules using techniques from machine learning, the layout analysis based on general layout principles, and a method that uses document layout information for conversion to HTML/XML formats.
Abstract: The transformation of scanned paper documents to a form suitable for an Internet browser is a complex process that requires solutions to several problems. The application of an OCR to some parts of the document image is only one of the problems. In fact, the generation of documents in HTML format is easier when the layout structure of a page has been extracted by means of a document analysis process. The adoption of an XML format is even better, since it can facilitate the retrieval of documents in the Web. Nevertheless, an effective transformation of paper documents into this format requires further processing steps, namely document image classification and understanding. WISDOM++ is a document processing system that operates in five steps: document analysis, document classification, document understanding, text recognition with an OCR, and transformation into HTML/XML format. The innovative aspects described in the paper are: the preprocessing algorithm, the adaptive page segmentation, the acquisition of block classification rules using techniques from machine learning, the layout analysis based on general layout principles, and a method that uses document layout information for conversion to HTML/XML formats. A benchmarking of the system components implementing these innovative aspects is reported.
129 citations
"A color-based layout analysis to pr..." refers methods in this paper
...WISDOM++ was originally developed to fully support the transformation of multi-page printed documents into XML format....
[...]
...We applied WISDOM++ with both the layout analysis methods to 108 document images in all belonging to 3 distinct classes, one for each archive (see
Table 1)....
[...]
...WISDOM++ decomposes the document page in a hybrid way, since it combines the image segmentation and a bottom-up layout analysis method to assemble basic blocks into larger frames....
[...]
...WISDOM++, originally developed to process blackand-white (binary) images, has been extended to take full advantage of color information in image segmentation and layout analysis steps....
[...]
...In this framework, we applied the DIA system WISDOM++ [1] to digitized documents available in three national film archives, namely Deutsches Filminstitut, Filmarchiv Austria and Národní Filmový Archiv (Czech Republic)....
Q1. What are the contributions in "A color-based layout analysis to process censorship cards of film archives" ?
In this paper, taking into account lessons learned in the context of the IST project Collate, the authors propose a new method for image segmentation and layout analysis that takes full advantage of color information.
Q2. What have the authors stated for future works in "A color-based layout analysis to process censorship cards of film archives" ?
For future works, the authors plan to evaluate the proposed approach in automatic/manual labeling.