Proceedings ArticleDOI
Simultaneous Layout Style and Logical Entity Recognition in a Heterogeneous Collection of Documents
Siyuan Chen,Song Mao,George R. Thoma +2 more
- Vol. 1, pp 118-122
Reads0
Chats0
TLDR
This paper presents an unsupervised method where layout style information is explicitly used in both training and recognition phases, and is robust with both balanced and unbalanced style cluster sizes, zone over-segmentation, zone length variation, and variation in tree representations of the same layout style.Abstract:
Logical entity recognition in heterogeneous collections of document page images remains a challenging problem since the performance of traditional supervised methods degrades dramatically in case of many distinct layout styles. In this paper we present an unsupervised method where layout style information is explicitly used in both training and recognition phases. We represent the layout style, local features, and logical labels of physical regions of a document compactly by an ordered labeled X-Y tree. Style dissimilarity of two document pages is represented by the distance between their respective trees. During the training phase, document pages with true logical labels in training set are classified into distinct layout styles by unsupervised clustering. During the recognition phase, the layout style and logical entities of an input document are recognized simultaneously by matching the input tree to the trees in closest- matched layout style cluster of training set. Experimental results show that our algorithm is robust with both balanced and unbalanced style cluster sizes, zone over-segmentation, zone length variation, and variation in tree representations of the same layout style.read more
Citations
More filters
Proceedings ArticleDOI
Metadata Extraction from PDF Papers for Digital Library Ingest
TL;DR: A package that is designed to extract basic metadata from PDF documents is described, based on a suitable combination of several techniques that include PDF parsing, low level document image processing, and layout analysis.
Proceedings Article
Structured document classification by matching local salient features
TL;DR: This paper presents a novel approach for structured document classification by matching the salient feature points between the query image and the reference images, which is robust to diverse training data size, image formats and qualities.
Book ChapterDOI
Machine Learning for Document Structure Recognition
Gerhard Paaß,Iuliu Konya +1 more
TL;DR: This chapter describes approaches for document structure recognition detecting the hierarchy of physical components in images of documents and transforms this into a hierarchy of logical components, such as titles, authors, and sections, which improves readability and is useful for indexing and retrieving information contained in documents.
Proceedings ArticleDOI
Scientific challenges underlying production document processing
TL;DR: The challenge therefore extends beyond the science behind document image recognition and into user interface and user experience design.
Patent
System and method for identifying pictures in documents
TL;DR: In this paper, a system and method to identify pictures in documents is presented, where an image representing a page of a document is received, and the image is analyzed to identify text objects in the page.
References
More filters
Journal ArticleDOI
LIBSVM: A library for support vector machines
Chih-Chung Chang,Chih-Jen Lin +1 more
TL;DR: Issues such as solving SVM optimization problems theoretical convergence multiclass classification probability estimates and parameter selection are discussed in detail.
Book
Finding Groups in Data: An Introduction to Cluster Analysis
TL;DR: An electrical signal transmission system, applicable to the transmission of signals from trackside hot box detector equipment for railroad locomotives and rolling stock, wherein a basic pulse train is transmitted whereof the pulses are of a selected first amplitude and represent a train axle count.
Journal ArticleDOI
Finding Groups in Data: An Introduction to Chster Analysis
TL;DR: This book make understandable the cluster analysis is based notion of starsmodern treatment, which efficiently finds accurate clusters in data and discusses various types of study the user set explicitly but also proposes another.
A Practical Guide to Support Vector Classication
TL;DR: A simple procedure is proposed, which usually gives reasonable results and is suitable for beginners who are not familiar with SVM.
Journal ArticleDOI
Syntactic segmentation and labeling of digitized pages from technical journals
TL;DR: It is shown that families of technical documents that share the same layout conventions can be readily analyzed and backtracking for error recovery and branch and bound for maximum-area labeling are implemented with Unix Shell programs.