scispace - formally typeset
Proceedings ArticleDOI

Simultaneous Layout Style and Logical Entity Recognition in a Heterogeneous Collection of Documents

Siyuan Chen, +2 more
- Vol. 1, pp 118-122
Reads0
Chats0
TLDR
This paper presents an unsupervised method where layout style information is explicitly used in both training and recognition phases, and is robust with both balanced and unbalanced style cluster sizes, zone over-segmentation, zone length variation, and variation in tree representations of the same layout style.
Abstract
Logical entity recognition in heterogeneous collections of document page images remains a challenging problem since the performance of traditional supervised methods degrades dramatically in case of many distinct layout styles. In this paper we present an unsupervised method where layout style information is explicitly used in both training and recognition phases. We represent the layout style, local features, and logical labels of physical regions of a document compactly by an ordered labeled X-Y tree. Style dissimilarity of two document pages is represented by the distance between their respective trees. During the training phase, document pages with true logical labels in training set are classified into distinct layout styles by unsupervised clustering. During the recognition phase, the layout style and logical entities of an input document are recognized simultaneously by matching the input tree to the trees in closest- matched layout style cluster of training set. Experimental results show that our algorithm is robust with both balanced and unbalanced style cluster sizes, zone over-segmentation, zone length variation, and variation in tree representations of the same layout style.

read more

Citations
More filters
Proceedings ArticleDOI

Metadata Extraction from PDF Papers for Digital Library Ingest

TL;DR: A package that is designed to extract basic metadata from PDF documents is described, based on a suitable combination of several techniques that include PDF parsing, low level document image processing, and layout analysis.
Proceedings Article

Structured document classification by matching local salient features

TL;DR: This paper presents a novel approach for structured document classification by matching the salient feature points between the query image and the reference images, which is robust to diverse training data size, image formats and qualities.
Book ChapterDOI

Machine Learning for Document Structure Recognition

TL;DR: This chapter describes approaches for document structure recognition detecting the hierarchy of physical components in images of documents and transforms this into a hierarchy of logical components, such as titles, authors, and sections, which improves readability and is useful for indexing and retrieving information contained in documents.
Proceedings ArticleDOI

Scientific challenges underlying production document processing

Eric Saund
TL;DR: The challenge therefore extends beyond the science behind document image recognition and into user interface and user experience design.
Patent

System and method for identifying pictures in documents

TL;DR: In this paper, a system and method to identify pictures in documents is presented, where an image representing a page of a document is received, and the image is analyzed to identify text objects in the page.
References
More filters
Journal ArticleDOI

LIBSVM: A library for support vector machines

TL;DR: Issues such as solving SVM optimization problems theoretical convergence multiclass classification probability estimates and parameter selection are discussed in detail.
Book

Finding Groups in Data: An Introduction to Cluster Analysis

TL;DR: An electrical signal transmission system, applicable to the transmission of signals from trackside hot box detector equipment for railroad locomotives and rolling stock, wherein a basic pulse train is transmitted whereof the pulses are of a selected first amplitude and represent a train axle count.
Journal ArticleDOI

Finding Groups in Data: An Introduction to Chster Analysis

TL;DR: This book make understandable the cluster analysis is based notion of starsmodern treatment, which efficiently finds accurate clusters in data and discusses various types of study the user set explicitly but also proposes another.

A Practical Guide to Support Vector Classication

TL;DR: A simple procedure is proposed, which usually gives reasonable results and is suitable for beginners who are not familiar with SVM.
Journal ArticleDOI

Syntactic segmentation and labeling of digitized pages from technical journals

TL;DR: It is shown that families of technical documents that share the same layout conventions can be readily analyzed and backtracking for error recovery and branch and bound for maximum-area labeling are implemented with Unix Shell programs.
Related Papers (5)