scispace - formally typeset
Search or ask a question

Showing papers on "Document layout analysis published in 2015"


Journal ArticleDOI
TL;DR: This paper focuses on the Fuzzy logic Extraction approach for text summarization and the semantic approach of text summarizations using Latent Semantic Analysis.

79 citations


Journal ArticleDOI
TL;DR: Experiments on the ICDAR 2013 dataset show that the results obtained are very encouraging and proves the effectiveness and superiority of the proposed method.
Abstract: Table detection is a challenging problem and plays an important role in document layout analysis. In this paper, we propose an effective method to identify the table region from document images. First, the regions of interest (ROIs) are recognized as the table candidates. In each ROI, we locate text components and extract text blocks. After that, we check all text blocks to determine if they are arranged horizontally or vertically and compare the height of each text block with the average height. If the text blocks satisfy a series of rules, the ROI is regarded as a table. Experiments on the ICDAR 2013 dataset show that the results obtained are very encouraging. This proves the effectiveness and superiority of our proposed method.

55 citations


Proceedings ArticleDOI
23 Aug 2015
TL;DR: This paper presents a learning-based approach for text and non-text separation in document images by extracting a powerful set of features based on size, shape, stroke width and position of each connected component.
Abstract: Document image segmentation is crucial to OCR and other digitization processes. In this paper, we present a learning-based approach for text and non-text separation in document images. The training features are extracted at the level of connected components, a mid-level between the slow noise-sensitive pixel level, and the segmentation-dependent zone level. Given all types, shapes and sizes of connected components, we extract a powerful set of features based on size, shape, stroke width and position of each connected component. Adaboosting with Decision trees is used for labeling connected components. Finally, the classification of connected components into text and non-text is corrected based on classification probabilities and size as well as stroke width analysis of the nearest neighbors of a connected component. The performance of our approach has been evaluated on the two standard datasets: UW-III and ICDAR-2009 competition for document layout analysis. Our results demonstrate that the proposed approach achieves competitive performance for segmenting text and non-text in document images of variable content and degradation.

41 citations


Journal ArticleDOI
TL;DR: The proposed method solves locally applied language incompact usage in the process of document clus-tering by document repre-sentation based on tagging, and to improve clustering results by using knowledge technology – ontology.
Abstract: Text documents are very significant in the contemporary organizations, moreover their constant accumulation enlarges the scope of document storage. Standard text mining and information retrieval techniques of text document usually rely on word matching. An alternative way of information retrieval is clustering. In this paper we suggest to complement the traditional clustering method by document repre-sentation based on tagging, and to improve clustering results by using knowledge technology – ontology. The proposed method solves locally applied language incompact usage in the process of document clus-tering.

29 citations


Proceedings ArticleDOI
19 Mar 2015
TL;DR: The objective of this paper is to recognition of text from image for better understanding of the reader by using particular sequence of different processing module.
Abstract: Text recognition in images is a research area which attempts to develop a computer system with the ability to automatically read the text from images These days there is a huge demand in storing the information available in paper documents format in to a computer storage disk and then later reusing this information by searching process One simple way to store information from these paper documents in to computer system is to first scan the documents and then store them as images But to reuse this information it is very difficult to read the individual contents and searching the contents form these documents line-by-line and word-by-word The challenges involved in this the font characteristics of the characters in paper documents and quality of images Due to these challenges, computer is unable to recognize the characters while reading them Thus there is a need of character recognition mechanisms to perform Document Image Analysis (DIA) which transforms documents in paper format to electronic format In this paper we have discuss method for text recognition from images The objective of this paper is to recognition of text from image for better understanding of the reader by using particular sequence of different processing module

27 citations


Proceedings ArticleDOI
08 Feb 2015
TL;DR: A new dataset and a ground-truthing methodology for layout analysis of historical documents with complex layouts, targeting the simplicity and the efficiency of the layout ground truthing process on historical document images is proposed and developed.
Abstract: In this paper, we propose a new dataset and a ground-truthing methodology for layout analysis of historical documents with complex layouts. The dataset is based on a generic model for ground-truth presentation of the complex layout structure of historical documents. For the purpose of extracting uniformly the document contents, our model defines five types of regions of interest: page, text block, text line, decoration, and comment. Unconstrained polygons are used to outline the regions. A performance metric is proposed in order to evaluate various page segmentation methods based on this model. We have analysed four state-of-the-art ground-truthing tools: TRUVIZ, GEDI, WebGT, and Aletheia. From this analysis, we conceptualized and developed Divadia, a new tool that overcomes some of the drawbacks of these tools, targeting the simplicity and the efficiency of the layout ground truthing process on historical document images. With Divadia, we have created a new public dataset. This dataset contains 120 pages from three historical document image collections of different styles and is made freely available to the scientific community for historical document layout analysis research.

24 citations


Journal ArticleDOI
TL;DR: An efficient method for the classification of text and non-text components in document image using a combination of whitespace analysis with multi-layer homogeneous regions which is called recursive filter.
Abstract: A separation of text and non-text elements plays an important role in document layout analysis. A number of approaches have been proposed but the quality of separation result is still limited due to the complex of the document layout. In this paper, we present an efficient method for the classification of text and non-text components in document image. It is the combination of whitespace analysis with multi-layer homogeneous regions which called recursive filter. Firstly, the input binary document is analyzed by connected components analysis and whitespace extraction. Secondly, a heuristic filter is applied to identify non-text components. After that, using statistical method, we implement the recursive filter on multi-layer homogeneous regions to identify all text and non-text elements of the binary image. Finally, all regions will be reshaped and remove noise to get the text document and non-text document. Experimental results on the ICDAR2009 page segmentation competition dataset and other datasets prove the effectiveness and superiority of proposed method.

22 citations


Proceedings ArticleDOI
18 May 2015
TL;DR: Different techniques for recognizing types of partly very similar identity documents using state-of-the-art visual recognition approaches including feature representations based on recent achievements with convolutional neural networks are developed and evaluated.
Abstract: In this paper, we tackle the task of recognizing types of partly very similar identity documents using state-of-the-art visual recognition approaches. Given a scanned document, the goal is to identify the country of issue, the type of document, and its version. Whereas recognizing the individual parts of a document with known standardized layout can be done reliably, identifying the type of a document and therefore also its layout is a challenging problem due to the large variety of documents. In our paper, we develop and evaluate different techniques for this application including feature representations based on recent achievements with convolutional neural networks. On a dataset with 74 different classes and using only one training image per class, our best approach achieves a mean class-wise accuracy of 97.7%.

21 citations


Journal ArticleDOI
TL;DR: This paper proposes an approach that first predicts the global image structure, and then uses the global structure for fine-grained pixel-level 3D layout extraction, and shows that employing the 3D structure prior information yields accurate 3D scene layout segmentation.
Abstract: Extracting the pixel-level 3D layout from a single image is important for different applications, such as object localization, image, and video categorization. Traditionally, the 3D layout is derived by solving a pixel-level classification problem. However, the image-level 3D structure can be very beneficial for extracting pixel-level 3D layout since it implies the way how pixels in the image are organized. In this paper, we propose an approach that first predicts the global image structure, and then we use the global structure for fine-grained pixel-level 3D layout extraction. In particular, image features are extracted based on multiple layout templates. We then learn a discriminative model for classifying the global layout at the image-level. Using latent variables, we implicitly model the sublevel semantics of the image, which enrich the expressiveness of our model. After the image-level structure is obtained, it is used as the prior knowledge to infer pixel-wise 3D layout. Experiments show that the results of our model outperform the state-of-the-art methods by 11.7% for 3D structure classification. Moreover, we show that employing the 3D structure prior information yields accurate 3D scene layout segmentation.

20 citations


Patent
Sumit Gulwani1, Vu Le1
03 Mar 2015
TL;DR: In this paper, the authors present techniques for controlling automated programming for extracting data from an input document using examples of the data to extract from the input document, which can include highlighted regions on the input documents.
Abstract: Various technologies described herein pertain to controlling automated programming for extracting data from an input document. Examples indicative of the data to extract from the input document can be received. The examples can include highlighted regions on the input document. Moreover, the input document can be a semi-structured document (e.g. a text file, a log file, a word processor document, a semi-structured spreadsheet, a webpage, a fixed-layout document, an image file, etc.). Further, an extraction program for extracting the data from the input document can be synthesized based on the examples. The extraction program can be synthesized in a domain specific language (DSL) for a type of the input document. Moreover, the extraction program can be executed on the input document to extract an instance of an output data schema.

18 citations


Posted ContentDOI
TL;DR: A novel nonparametric and unsupervised method to compensate for undesirable document image distortions aiming to optimally improve OCR accuracy and text detection rate is presented.
Abstract: Digital camera and mobile document image acquisition are new trends arising in the world of Optical Character Recognition and text detection. In some cases, such process integrates many distortions and produces poorly scanned text or text-photo images and natural images, leading to an unreliable OCR digitization. In this paper, we present a novel nonparametric and unsupervised method to compensate for undesirable document image distortions aiming to optimally improve OCR accuracy. Our approach relies on a very efficient stack of document image enhancing techniques to recover deformation of the entire document image. First, we propose a local brightness and contrast adjustment method to effectively handle lighting variations and the irregular distribution of image illumination. Second, we use an optimized greyscale conversion algorithm to transform our document image to greyscale level. Third, we sharpen the useful information in the resulting greyscale image using Un-sharp Masking method. Finally, an optimal global binarization approach is used to prepare the final document image to OCR recognition. The proposed approach can significantly improve text detection rate and optical character recognition accuracy. To demonstrate the efficiency of our approach, an exhaustive experimentation on a standard dataset is presented

Proceedings ArticleDOI
23 Aug 2015
TL;DR: A computational model is presented to automatically predict document image quality towards facilitating the OCR accuracy without references and the experimental results show that the proposed method outperforms traditional document imagequality assessment approaches.
Abstract: Optical character recognition (OCR) accuracy of document images is an important factor for the success of many document processing and analysis tasks, especially for unconstraint captured document images. Although several document image OCR capability assessment methods are proposed, they mostly model the problem based on the empirically defined rules of image degradation, which cause the existing approaches infeasible for predicting the OCR scores. In this paper, a computational model is presented to automatically predict document image quality towards facilitating the OCR accuracy without references. Unlike conventional methods that use heuristically designed features, in our work the raw features are learned from training images and a generative quality model is built based on latent Dirichlet allocation, which is used to assess the document's OCR capability. We present evaluation results on a public dataset which have been captured using digital cameras with different level of blur degradation. The experimental results show that the proposed method outperforms traditional document image quality assessment approaches.

Patent
Milan Sesum1, Ivan Vujic1
09 Mar 2015
TL;DR: In this article, text, paths, and images are extracted from the binarized image document and stored in a data store, and retrieved in order to create a flow document that may provide better adaption to a variety of reading experiences and provide editable documents.
Abstract: One or more components of an image document may be detected and extracted in order to create a flow document from the image document. Components of an image document may include text, one or more paths, and one or more images. The text may be detected using optical character recognition (OCR) and the image document may be binarized. The detected text may be extracted from the binarized image document to enable detection of the paths, which may then be extracted from the binarized image document to enable detection of the images. In some examples, the images, similar to the text and paths, may be extracted from the binarized image document. The extracted text, paths, and/or images may be stored in a data store, and may be retrieved in order to create a flow document that may provide better adaption to a variety of reading experiences and provide editable documents.

Proceedings ArticleDOI
23 Aug 2015
TL;DR: A tag cloud evolving according to the reading is presented as a first application of the reading-life log, which contains basic features that can be used for many different kinds of applications.
Abstract: Instead of analyzing directly the document images, analyzing the document reading can offer new perspectives for extracting information about both the reader and the document. Analyzing how people read texts can help to understand the cognitive process of the reading and might lead to new approaches and new solutions for pattern recognition and document image analysis. It can also lead to create smart documents that can measure reading information, provide feedback and adapt themselves depending on the behavior of the readers. As a step towards document reading analysis, the authors propose in this paper a solution for extracting the reading information and creating a “reading-life log”. This reading-life log contains basic features that can be used for many different kinds of applications. A tag cloud evolving according to the reading is presented as a first application of the reading-life log.

Patent
02 Jan 2015
TL;DR: In this paper, the authors compare document images to identify a first document image of a reference document that corresponds with a second document image from a related document, and transform the second image based on the layout of the first document.
Abstract: Systems and methods for enhancing and comparing documents. An example method comprises: comparing document images to identify a first document image of a reference document that corresponds with a second document image of a related document; transforming the second document image based on a layout of the first document image; and performing character recognition of the second document image.

Proceedings ArticleDOI
01 Nov 2015
TL;DR: A hybrid method consisting of the alternative bottom-up and top-down approaches is implemented to find the table region candidates by analyzing text lines and spare lines for detecting tables in document images.
Abstract: In this paper, we present a hybrid method consisting of three main stages for detecting tables in document images. Based on table structure, our system separates table into two main categories, ruling line table and non-ruling line table. In the first stage, the text and non-text elements in document are classified by a heuristic filter. Then, the white space analysis is used to group the text elements into text lines, while ruling line table candidates are identified from non-text elements. In the second stage, based on the text lines, text and non-text elements, a hybrid method which consist of the alternative bottom-up and top-down approaches is implemented to find the table region candidates. In the final stage, these candidates are examined to get the table regions by analyzing text lines and spare lines. Experimental results with the document database from the ICDAR2013 table competition show that the proposed method works better than the previous ones.

Proceedings ArticleDOI
08 Jan 2015
TL;DR: This paper presents a hybrid method of page segmentation based on the combination of connected component analysis and classification on multilevel homogeneous regions, and achieves the higher accuracy compared to other methods.
Abstract: This paper presents a hybrid method of page segmentation based on the combination of connected component analysis and classification on multilevel homogeneous regions. This suggests an iterative method. In which, connected component analysis is used to classify the non-text elements at each level of homogeneous region, and multilevel homogeneity structure is used to ensure this classification can identify all non-text elements. The result of this iterative method is the two documents, text document and non-text document. On text document, adaptive mathematical morphology in each text homogeneous region will give us the corresponding text region. On the non-text document, more detailed classification of the non-text components are made to get separators, tables, images, etc. For evaluation, we experiment our method with datasets from ICDAR2009 page segmentation competition. According to the results, our proposed method achieves the higher accuracy compared to other methods. This proves the effectiveness and superiority of our proposed method.

Proceedings ArticleDOI
Hervé Déjean1
23 Aug 2015
TL;DR: The method first analyzes the layout of the page, building several concurrent layout structures that are used to correct or tag some elements missed by the tagging step and the final set of structured data is extracted.
Abstract: We present a method for extracting structured elements of information, called structured data (sdata), from ocr'ed pages. The method first analyzes the layout of the page, building several concurrent layout structures. Then a tagging step is performed in order to tag textual elements based on their content. Combining the layout structures and the tagged elements, layout models for representing the structured data are inferred for the current page. These models are used to correct or tag some elements missed by the tagging step. The final set of structured data is extracted. An evaluation is presented.

Journal ArticleDOI
TL;DR: This paper offers a review of the state-of-the-art document image processing methods and their classification by identifying new trends for automatic document processing and understanding and presents a comparative survey based on important aspects of a marketable system that is dependent on document imageprocessing techniques.
Abstract: This paper offers a review of the state-of-the-art document image processing methods and their classification by identifying new trends for automatic document processing and understanding. Document image processing (DIP) is an important problem related with most of the challenges coming from the image processing field and with applications to digital document summarization, readers for the visually impaired etc. Difficulties in the processing of documents can arise from lighting conditions, page curl, page rotation in 3D, and page layout segmentation. Document image processing is usually performed in the context of higher-level applications that require an undistorted document image such as optical character recognition and document restoration/preservation. Typically, assumptions are made to constrain the processing problem in the context of a particular application. In this survey, we categorize document image processing methods on the basis of the technique, provide detailed descriptions of representative methods in each category, and examine their pros and cons. It important to notice here that the DIP field is broad, thus we try to provide a top–down/horizontal survey rather a bottom up. At the same time, we target the area of document readers for the blind, and use this application to guide us in a top–down survey of DIP. Moreover, we present a comparative survey based on important aspects of a marketable system that is dependent on document image processing techniques.

Journal ArticleDOI
TL;DR: The results of the proposed method for paragraph structure recognition are comparable to the referenced methods which offer segmentation only.
Abstract: The paper presents a complete solution for recognition of textual and graphic structures in various types of documents acquired from the Internet. In the proposed approach, the document structure recognition problem is divided into sub-problems. The first one is localizing logical structure elements within the document. The second one is recognizing segmented logical structure elements. The input to the method is an image of document page, the output is the XML file containing all graphic and textual elements included in the document, preserving the reading order of document blocks. This file contains information about the identity and position of all logical elements in the document image. The paper describes all details of the proposed method and shows the results of the experiments validating its effectiveness. The results of the proposed method for paragraph structure recognition are comparable to the referenced methods which offer segmentation only.

Patent
28 Sep 2015
TL;DR: In this article, the layout intent associated with explicitly formatted document elements in a document is inferred and an intent-based document is then created using the inferred layout intent for some or all of the explicitly formatted documents in the document.
Abstract: Technologies are described herein for inferring the layout intent associated with explicitly formatted document elements in a document. The layout type of a document having explicitly formatted document elements is determined. Once the layout type for the document has been determined, the layout intent of explicitly formatted document elements in the document may be determined based, at least in part, on the determined layout type of the document. Heuristic algorithms and/or machine learning classifiers may determine the layout intent of the explicitly formatted document elements in the document. An intent-based document is then created using the inferred layout intent for some or all of the explicitly formatted document elements in the document. The intent-based document may then be provided to an intent-based rendering or authoring application for rendering based upon the inferred layout intent.

Proceedings ArticleDOI
08 Sep 2015
TL;DR: A new layout descriptor is presented that achieves 100% accuracy and retrieval in a document retrieval scheme on a database of 960 document images and it reduces the size of the index of the database by a factor 400.
Abstract: Security applications related to document authentication require an exact match between an authentic copy and the original of a document. This implies that the documents analysis algorithms that are used to compare two documents (original and copy) should provide the same output. This kind of algorithm includes the computation of layout descriptors from the segmentation result, as the layout of a document is a part of its semantic content. To this end, this paper presents a new layout descriptor that significantly improves the state of the art. The basic of this descriptor is the use of a Delaunay triangulation of the centroids of the document regions. This triangulation is seen as a graph and the adjacency matrix of the graph forms the descriptor. While most layout descriptors have a stability of 0% with regard to an exact match, our descriptor has a stability of 74% which can be brought up to 100% with the use of an appropriate matching algorithm. It also achieves 100% accuracy and retrieval in a document retrieval scheme on a database of 960 document images. Furthermore, this descriptor is extremely efficient as it performs a search in constant time with respect to the size of the document database and it reduces the size of the index of the database by a factor 400.

Proceedings ArticleDOI
10 Dec 2015
TL;DR: This work combines image and text based classifiers to categorize documents as relevant or irrelevant to cis-regulatory modules in the context of gene-networks, demonstrating the significance of incorporating image data, and specifically OCR-based features, into the document categorization process.
Abstract: Images form a rich information source, which remains underutilized in biomedical document classification. We present here work that uses both image- and text-based features in order to identify articles of interest, in this case, pertaining to cis-regulatory modules in the context of gene-networks. Extending on our new idea, which we have recently introduced, of using OCR-based features to identify DNA contents in images, we combine image and text based classifiers to categorize documents as relevant or irrelevant to cis-regulatory modules. Using a set of hundreds of articles, marked by experts as relevant or irrelevant to cis-regulatory modules, we train/test image and text based classifiers, as well as classifiers integrating both. Our results indicate that the latter show the best performance with Recall, F-measure and Utility measures all above 0.9, demonstrating the significance of incorporating image data, and specifically OCR-based features, into the document categorization process. Moreover, the use of character distribution properties to represent images is directly relevant to other biomedical images containing text (e.g. RNA, proteins). Diagrams and other images containing text are also prevalent outside the biomedical domain, hence the work stands to be applicable and beneficial in other application areas.

Journal ArticleDOI
TL;DR: A novel noise reduction method by applying a machine learning technique to classify and reduce noise in document images and an enhance labeling method of semi-supervised cluster-and-label approach that can significantly improve the accuracy of labeling examples and the performance of classification.
Abstract: We proposed a novel noise reduction method for document images.Semi-supervised learning is applied to classify noise from character components.The proposed method is suitable for Non-Latin based scripts i.e. Thai document image.We proposed an enhance labeling method of semi-supervised cluster-and-label approach.The performance of proposed methods are significantly better than comparison methods. Noise components are a major cause of poor performance in document analysis. To reduce undesired components, most recent research works have applied an image processing technique. However, the effectiveness of these techniques is suitable only for a Latin script document but not a non-Latin script document. The characteristics of the non-Latin script document, such as Thai, are considerably more complicated than the Latin script document and include many levels of character alignment, no word or sentence separator, and variability in a character's size. When applying an image processing technique to a Thai document, we usually remove the characters that are relatively close to noise. Hence, in this paper, we propose a novel noise reduction method by applying a machine learning technique to classify and reduce noise in document images. The proposed method uses a semi-supervised cluster-and-label approach with an improved labeling method, namely, feature selected sub-cluster labeling. Feature selected sub-cluster labeling focuses on the clusters that are incorrectly labeled by conventional labeling methods. These clusters are re-clustered into small groups with a new feature set that is selected according to class labels. The experimental results show that this method can significantly improve the accuracy of labeling examples and the performance of classification. We compared the performance of noise reduction and character preservation between the proposed method and two related noise reduction approaches, i.e., a two-phased stroke-like pattern noise (SPN) removal and a commercial noise reduction software called ScanFix Xpress 6.0. The results show that semi-supervised noise reduction is significantly better than the compared methods of which an F-measure of character and noise is 86.01 and 97.82, respectively.

Proceedings ArticleDOI
26 Feb 2015
TL;DR: A comprehensive survey has been conducted on some state-of-the-art document image binarization techniques and it has been found out that the adaptive contrast method is the best performing method.
Abstract: Document image binarization is performed to segment foreground text from background text in badly degraded documents. In this paper, a comprehensive survey has been conducted on some state-of-the-art document image binarization techniques. After describing these document images binarization techniques, their performance have been compared with the help of various evaluation performance metrics which are widely used for document image analysis and recognition. On the basis of this comparison, it has been found out that the adaptive contrast method is the best performing method. Accordingly, the partial results that we have obtained for the adaptive contrast method have been stated and also the mathematical model and block diagram of the adaptive contrast method has been described in detail.

Proceedings ArticleDOI
01 Sep 2015
TL;DR: A completely automatic and scalable framework to perform query-by-example word-spotting on medieval manuscripts that does not require any human intervention to produce a large amount of annotated training data and provides Computer Vision researchers and Cultural Heritage practitioners with a compact and efficient system for document analysis.
Abstract: We present a completely automatic and scalable framework to perform query-by-example word-spotting on medieval manuscripts. Our system does not require any human intervention to produce a large amount of annotated training data, and it provides Computer Vision researchers and Cultural Heritage practitioners with a compact and efficient system for document analysis. We have executed the pipeline both in a single-manuscript and a cross-manuscript setup, and we have tested it on a heterogeneous set of medieval manuscripts, that includes a variety of writing styles, languages, image resolutions, levels of conservation, noise and amount of illumination and ornamentation. We also present a precision/recall based analysis to quantitatively assess the quality of the proposed algorithm.

Proceedings ArticleDOI
01 Oct 2015
TL;DR: A weighted document frequency (WDF) is proposed and expected to reduce noise to some extent in text classification by reducing how important the feature is in that document.
Abstract: In the past research, Document Frequency (DF) has been validated to be a simple yet quite effective measure for feature selection in text classification. The calculation is based on how many documents in a collection contain a feature, which can be a word, a phrase, a n-gram, or a specially derived attribute. The counting process takes a binary strategy: if a feature appears in a document, its DF will be increased by one. This traditional DF metric concerns only about whether a feature appears in a document, but does not consider how important the feature is in that document. Obviously, thus counted document frequency is very likely to introduce much noise. Therefore, a weighted document frequency (WDF) is proposed and expected to reduce such noise to some extent. Extensive experiments on two text classification datasets demonstrate the effectiveness of the proposed measure.

Proceedings ArticleDOI
23 Aug 2015
TL;DR: A very generic and robust method which can cope with a wide variety of document types and writing systems is proposed, which uses derivatives in the Hough space to identify directions with sudden changes in their projection profiles.
Abstract: One of the basic challenges in page layout analysis of scanned document images is the estimation of the document skew. Precise skew correction is particularly important when the document is to be passed to an optical character recognition system. In this paper, we propose a very generic and robust method which can cope with a wide variety of document types and writing systems. It uses derivatives in the Hough space to identify directions with sudden changes in their projection profiles. We show that this criterion is useful to identify the horizontal and vertical direction with respect to the document. We test our method on the DISEC'13 data set for document skew detection. Our results are comparable to the best systems in the literature.

Proceedings ArticleDOI
01 Dec 2015
TL;DR: The proposed stepped-line layout is a new technique for improving the efficiency of eye movements while reading without any increase in cognitive load, suggesting that the reader's eyes try to fixate on every phrase while reading.
Abstract: We propose a new electronic text format with a stepped-line layout to optimize viewing position and to improve the efficiency of reading Japanese text. Generally, the reader's eyes try to fixate on every phrase while reading Japanese text. To date, no method has been proposed to optimize the fixation position while reading. In case of spaced text such as English, the space characters provide the boundary information for eye movement, however, in case of Japanese text, reading speed decreases by inserting spaces between phrases. With the new stepped-line text format proposed in this report, a text line is segmented and stepped down between phrases, moreover, line breaks are present between phrases. To evaluate the effect of the stepped-line layout on the reading efficiency, we measured reading speeds and eye movements for both the new layout and a conventional straight-line layout. The reading speed for the new stepped-line layout is approximately 13% faster compared to the straight-line layout, whereas the number of fixations in the stepped-line layout is approximately 11% less than that in the straight-line layout. This is primarily achieved by a reduction in the number of regressions and an increase in the forward saccade length. Moreover, 91% of participants did not experience illegibility or incongruousness with the stepped-line layout reading, suggesting that the stepped-line layout is a new technique for improving the efficiency of eye movements while reading without any increase in cognitive load.

Proceedings ArticleDOI
08 Feb 2015
TL;DR: The proposed method first extracts footnotes and figure captions, and then matches them with their corresponding references within a document, and leverages results from the matching process to provide feedback to the identification process and further improve the algorithm accuracy.
Abstract: Cross-references, such like footnotes, endnotes, figure/table captions, references, are a common and useful type of page elements to further explain their corresponding entities in the target document. In this paper, we focus on cross-reference identification in a PDF document, and present a robust method as a case study of identifying footnotes and figure references. The proposed method first extracts footnotes and figure captions, and then matches them with their corresponding references within a document. A number of novel features within a PDF document, i.e., page layout, font information, lexical and linguistic features of cross-references, are utilized for the task. Clustering is adopted to handle the features that are stable in one document but varied in different kinds of documents so that the process of identification is adaptive with document types. In addition, this method leverages results from the matching process to provide feedback to the identification process and further improve the algorithm accuracy. The primary experiments in real document sets show that the proposed method is promising to identify cross-reference in a PDF document.