scispace - formally typeset
Search or ask a question
Proceedings ArticleDOI

Segmenting printed text and handwritten annotation by Spectral Partitioning

TL;DR: This paper addresses the problem of segmenting handwritten annotations on scientific research papers by geometrically segmenting the complex cases of handwritten annotations, including marks, cuts and special symbols along, with the regular text.
Abstract: This paper addresses the problem of segmenting handwritten annotations on scientific research papers. The motivation of this work is to geometrically segment the complex cases of handwritten annotations, including marks, cuts and special symbols along, with the regular text. Our work particularly focuses on documents that have multi-oriented handwritten [1] annotations rather than annotations in controlled scenario [2]. Spectral Partitioning is adopted as the segmentation scheme to separate the printed text and annotations. A new feature Envelope Straightness is developed and included in our feature set. This leads to an improvement of accuracy over the state-of-the-art features. The experiments are performed on two datasets: 40 documents authored by two writers from IAM dataset, comprising only printed and handwritten text and a self created dataset of 40 scientific papers from various proceedings annotated by a reader, comprising varied types of annotations. In the framework of spectral partitioning, our feature set has achieved a recall of 98.39% for printed text and precision of 85.40% for handwritten annotations on our dataset. For IAM dataset our feature set has achieved a recall of 81.89% for printed text and a precision of 69.67% for handwritten annotations. The results achieved on both dataset are better compared with results obtained using [3] [1].
Citations
More filters
Proceedings ArticleDOI
01 Sep 2019
TL;DR: This work uses a Graph Autoencoder to perform the intended field label to field value association in a given form image, which is the first attempt to perform label-value associations in a handwritten form image using a machine learning approach.
Abstract: We propose a graph-based deep network for predicting the associations pertaining to field labels and field values in heterogeneous handwritten form images. We consider forms in which the field label comprises printed text and field value can be the handwritten text. Inspired by the relationship predicting capability of the graphical models, we use a Graph Autoencoder to perform the intended field label to field value association in a given form image. To the best of our knowledge, it is the first attempt to perform label-value association in a handwritten form image using a machine learning approach. We have prepared our handwritten form image dataset comprising 300 images from 30 different templates having 10 images per template. Our framework is experimented on different network parameter and has shown promising results.

1 citations


Cites methods from "Segmenting printed text and handwri..."

  • ...The features used to distinguish between them have been adopted from [11] as: field component patch size, foreground density, average stroke width, horizontal and vertical density difference, maximum horizontal and vertical runlength and standard deviation of horizontal and vertical projection of a patch....

    [...]

References
More filters
Journal ArticleDOI
TL;DR: In this article, the authors present the most common spectral clustering algorithms, and derive those algorithms from scratch by several different approaches, and discuss the advantages and disadvantages of these algorithms.
Abstract: In recent years, spectral clustering has become one of the most popular modern clustering algorithms. It is simple to implement, can be solved efficiently by standard linear algebra software, and very often outperforms traditional clustering algorithms such as the k-means algorithm. On the first glance spectral clustering appears slightly mysterious, and it is not obvious to see why it works at all and what it really does. The goal of this tutorial is to give some intuition on those questions. We describe different graph Laplacians and their basic properties, present the most common spectral clustering algorithms, and derive those algorithms from scratch by several different approaches. Advantages and disadvantages of the different spectral clustering algorithms are discussed.

9,141 citations

Proceedings Article
03 Jan 2001
TL;DR: A simple spectral clustering algorithm that can be implemented using a few lines of Matlab is presented, and tools from matrix perturbation theory are used to analyze the algorithm, and give conditions under which it can be expected to do well.
Abstract: Despite many empirical successes of spectral clustering methods— algorithms that cluster points using eigenvectors of matrices derived from the data—there are several unresolved issues. First. there are a wide variety of algorithms that use the eigenvectors in slightly different ways. Second, many of these algorithms have no proof that they will actually compute a reasonable clustering. In this paper, we present a simple spectral clustering algorithm that can be implemented using a few lines of Matlab. Using tools from matrix perturbation theory, we analyze the algorithm, and give conditions under which it can be expected to do well. We also show surprisingly good experimental results on a number of challenging clustering problems.

9,043 citations


"Segmenting printed text and handwri..." refers methods in this paper

  • ...The Spectral approach relies on the Eigen structure of a similarity matrix to partition points into disjoint clusters with points in the same cluster having high similarity and points in different clusters having low similarity [12]....

    [...]

Proceedings ArticleDOI
10 Sep 2001
TL;DR: An algorithm that is based on the theory of hidden Markov models (HMMs) to distinguish between machine-printed and handwritten materials is presented, which has been shown to be promising in the authors' experiments.
Abstract: In this paper, we address the problem of separating handwritten annotations from machine-printed text within a document. We present an algorithm that is based on the theory of hidden Markov models (HMMs) to distinguish between machine-printed and handwritten materials. No OCR results are required prior to or during the process, and the classification is performed at the word level. Handwritten annotations are not limited to marginal areas, as the approach can deal with document images having handwritten annotations overlaid on machine-printed text and it has been shown to be promising in our experiments. Experimental results show that the proposed method can achieve 72.19% recall for fully extracted handwritten words and 90.37% for partially extracted words. The precision of extracting handwritten words has reached 92.86%.

107 citations


"Segmenting printed text and handwri..." refers background in this paper

  • ...Only few systems are able to handle multi-oriented handwritten annotations [1] [3] [7] in a real environment....

    [...]

Proceedings ArticleDOI
14 Aug 1995
TL;DR: In this work a classification system is presented which reads a raster image of a character and outputs two confidence values, one for "machine-written" and one for 'hand-written' character classes, respectively.
Abstract: In applications of character recognition where machine-printed and hand-written characters are involved, it is important to know if the character image, or the whole word, is machine- or hand-written. This is due to the accuracy difference between the algorithms and systems oriented to machine- or handwritten characters. Obviously, this type of knowledge leads to the increase of the overall system quality. In this work a classification system is presented which reads a raster image of a character and outputs two confidence values, one for "machine-written" and one for "hand-written" character classes, respectively. The proposed system features a preprocessing step, which transforms a general uncentered character image into a normalized form, then the feature extraction phase extracts relevant information from the image, and at the end, a standard classifier based on a feedforward neural network creates the final response. At the end, some results on a proprietary image database are reported.

65 citations


"Segmenting printed text and handwri..." refers background or methods in this paper

  • ...Most of these systems extract annotations in controlled scenario [4] [5] [6] [2]....

    [...]

  • ...It commenced by the contribution of Kuhnke et al [4] for printed and hand-written character segmentation using directional and symmetrical features into a neural network....

    [...]

Proceedings ArticleDOI
20 Sep 1999
TL;DR: This paper presents a classification scheme for both Bangla and Devnagari characters based on the structural and statistical features of the machine-printed and hand-written text lines and has an accuracy of about 98.3%.
Abstract: There are many types of documents where machine-printed and hand-written texts appear intermixed. Since the optical character recognition (OCR) methodologies for machine-printed and hand-written texts are different, it is necessary to separate these two types of text before feeding them to the respective OCR systems. In this paper, we present such a scheme for both Bangla and Devnagari characters. The scheme is based on the structural and statistical features of the machine-printed and hand-written text lines. The classification scheme has an accuracy of about 98.3%.

48 citations


"Segmenting printed text and handwri..." refers background or methods in this paper

  • ...Most of these systems extract annotations in controlled scenario [4] [5] [6] [2]....

    [...]

  • ...Pal and Chaudhuri [5] segmented handwritten and printed text lines of Bangla and Devnagari using a tree-based classification approach....

    [...]