# Preprocessing of Low-Quality Handwritten Documents Using Markov Random Fields

TL;DR: This paper presents a statistical approach to the preprocessing of degraded handwritten forms including the steps of binarization and form line removal including the modification of the MRF model to drop the preprinted ruling lines from the image.

Abstract: This paper presents a statistical approach to the preprocessing of degraded handwritten forms including the steps of binarization and form line removal. The degraded image is modeled by a Markov random field (MRF) where the hidden-layer prior probability is learned from a training set of high-quality binarized images and the observation probability density is learned on-the-fly from the gray-level histogram of the input image. We have modified the MRF model to drop the preprinted ruling lines from the image. We use the patch-based topology of the MRF and belief propagation (BP) for efficiency in processing. To further improve the processing speed, we prune unlikely solutions from the search space while solving the MRF. Experimental results show higher accuracy on two data sets of degraded handwritten images than previously used methods.

## Summary (4 min read)

### 1 INTRODUCTION

- THE goal of this paper is the preprocessing of degradedhandwritten document images such as carbon forms for subsequent recognition and retrieval.
- This is largely due to the extremely low image quality.
- People tend to write lightly at the turns of strokes.
- Therefore, binarizing the carbon copy images of handwritten documents is very challenging.
- The authors can learn the observation model on the fly from the local histogram of the test image.

### 2.1 Locally Adaptive Methods for Binarization

- By assuming that the background changes slowly, the problem of varying illumination is solved by adaptive binarization algorithms such as Niblack [15] and Sauvola [18].
- The idea is to determine the threshold locally, using histogram analysis, statistical measures (mean, variance, etc.), or the intensity of the extracted background.
- The resulting blurring affects handwriting recognition accuracy.
- Approaches of heuristic analysis of local connectivity, such as Kamel/Zhao [11], Yang/Yan [21], and Milewski/Govindaraju [14], solve the problem to some extent by searching for stroke locations and targeting only nonstroke areas.
- In all of these approaches, the spatial constraints applied to the images are determined by a heuristic.

### 2.2 The Markov Random Field for Binarization

- In recent years, inspired by the success of applying the MRF to image restoration [4], [5], [6], attempts have been made to apply MRF to the preprocessing of degraded document images [7], [8], [20].
- Wolf and Doermann [20] defined the prior model on a 4 4 clique, which is appropriate for textual images in low-resolution video.
- Gupta et al. [7], [8] studied the restoration and binarization of blurred images of license plate digits.
- They adopted the factorized style of MRF using the product of compatibility functions [4], [5], [6], which are defined as mixtures of multivariate normal distributions computed over samples of the training set.
- The authors describe an MRF adapted for handling handwritten documents that overcomes the computational challenges caused by highresolution data and low accuracy rates of current handwriting recognizers.

### 2.3 Ruling Line Removal

- The process of removing preprinted ruling lines while preserving the overlapping textual matter is referred to as image in-painting (Fig. 1) and is performed by inferring the removed overlapping portion of images from spatial constraints.
- Previously reported work on line removal in document images uses heuristic [1], [14], [23].
- Bai and Huo [1] remove the underline in machine-printed documents by estimating its width.
- Yoo et al. [23] describe a sophisticated method that classifies the missing parts of strokes into different categories such as horizontal, vertical, and diagonal and connects them with runs (of black pixels) in the corresponding directions.
- It relies on many heuristic rules and is not accurate when strokes are lightly touching the ruling line.

### 3 MARKOV RANDOM FIELD MODEL FOR

- The authors use an MRF model (Fig. 2) with the same topology as the one described in [5].
- Each binarized patch conditionally depends on its four neighboring binarized patches in both the horizontal and vertical directions, and each observed patch conditionally depends y on its corresponding binarized patch.
- An edge in the graph represents the conditional dependence of two vertices.
- It is impossible to compute either (3) or (4) directly for large graphs because the computation grows exponentially as the number of vertices increases.

### 4.1 Belief Propagation

- An iteration only involves local computation between the neighboring vertices.
- The formulas for the BP algorithm for MAP estimation are similar to (8) and (9) except that P xj xj and P xk are replaced with argmaxxj and maxxk , respectively: x̂jMAP ¼ argmax xj ðxj; yjÞ Y k Mkj ; ð10Þ.
- The pairwise compatibility functions and are usually heuristically defined as functions with the distance between two patches as the variable.

### 4.3 Learning the Observation Model PrðyjjxjÞ

- The observation model on the pixel level can be estimated from the distribution of gray-scale densities of pixels [20].
- The authors algorithm is described as follows: 1. Background extraction.
- The authors mark the background pixels in the original image using the binarized image and estimate the mean b0 and variance b0 of density pb from the extracted background pixels.
- EM algorithm for estimating the 2-GMM.
- The p.d.f. estimation algorithm using the EM algorithm has an advantage over the algorithms using Niblack thresholding because it avoids the problem of sharply cutting the histogram and has a smoother estima- tion at the intersection of two Gaussian distributions.

### 4.4 Ruling Line Removal

- First, the ruling lines are located by template matching; this is relatively straightforward to implement because of the fixed form layout and is true for most types of forms in other applications as well.
- The authors replace (24) with (29) for the compound tasks of binarization and line removal.

### 4.5 Pruning the Search Space of MRF Inference

- To this point, MRF-based preprocessing has been pre- sented as a self-contained general-purpose algorithm.
- From the above analysis, the authors have the following two-step strategy to accelerate the algorithm: 1. Find a global threshold thrprune such that 90 percent of the pixels in the test image are below thrprune.
- If PRUNEjðlÞ is true, Cl is pruned from the search space for solving xj.
- For the patches that contain pixels to in-paint, Prmin should be greater than the prior probability of any state in the codebook, i.e., Prmin < min l PrðClÞ, so that any state will not be pruned in the first iteration of BP.

### 5.1 Test Data Sets

- The authors test data includes the PCR carbon forms and handwriting images from IAM database 3.0 [13]: 1. PCR forms.
- In New York state, all patients who enter the Emergency Medical System (EMS) are tracked through their prehospital care to the emergency room using the PCRs.
- The PCR is used to gather vital patient information.
- D. Medical lexicons of words are large (more than 4,000 entries).
- The IAM database contains highquality images of unconstrained handwritten English text, which were scanned as gray-scale images at 300 dpi.

### 5.2 Display of Preprocessing Results

- First, the authors applied their algorithm to the input image shown in Fig.
- By aligning the input image with a template form image, rough estimations of the positions of lines and unwanted machineprinted blocks are detected.
- The authors test images and the images for training the prior model are from different writers.
- After the first iteration, the message has not yet been passed between neighbors.
- After four iterations, nearly all of the strokes are restored, although a few tiny artifacts are still visible.

### 5.3 Results of Acceleration: Speed versus Accuracy

- The authors have tested the effect of different values of parameter Prmin on the speed and accuracy of their algorithm using the PCR carbon form image in Fig. 7 and the IAM handwriting image in Fig. 10.
- In order to compare the results obtained by their algorithm with different values of Prmin , the authors have taken the output images of Prmin ¼ 0 (which indicates no speedup) as reference images and have counted the pixels in the output images with various Prmin s that are different from the reference images.

### 5.4 Comparison to Other Preprocessing Methods

- In Fig. 11, the authors compare their approach with the preprocessing algorithm of Milewski and Govindaraju [14], the Niblack algorithm [15], and the Otsu algorithm [16].
- The Niblack and Otsu algorithms are for binarization only.
- From the result of the MRF-based algorithm, the text “67 yo , pt found” is clear and the text “MFG X ray” is obscured but some letters are still legible.
- Set #1 contains 1,203 word images that are not affected by overlapping form lines, i.e., no intersection of stroke and line;.
- The word recognition rates of the original images among all three methods are very close.

### 6 CONCLUSIONS

- The authors have presented a novel method for binarizing degraded document images containing handwriting and removing preprinted form lines.
- In their MRF model, the authors reduce the large search space of the prior model to a class of 114 representatives by VQ and learn the observation model directly from the input image.
- The authors work is the first attempt at applying a stochastic method to the preprocessing of degraded highresolution handwritten documents.
- The authors model is targeted toward document images and therefore may not handle large variations in illumination, complex backgrounds, and blurring that are common in video and scene text processing.

Did you find this useful? Give us your feedback

...read more

##### Citations

115 citations

47 citations

### Cites background from "Preprocessing of Low-Quality Handwr..."

...A few other successful approaches in binarization of document images are morphological operators [15], Markov Random Fields [16], local adaptive partitioning methods [17]....

[...]

39 citations

### Cites background from "Preprocessing of Low-Quality Handwr..."

...MRF and CRF based approaches have been successful in modeling low level vision problems such as image restoration, segmentation [4], etc....

[...]

37 citations

^{1}

34 citations

### Cites methods from "Preprocessing of Low-Quality Handwr..."

...Cao and Govindaraju [6,7] proposed a method using small fixed size patches to represent handwriting and restore broken handwritten text based on a MRF framework....

[...]

...X. Peng (B) · S. Setlur · V. Govindaraju Department of Computer Science and Engineering, UB Commons, Center for Unified Biometrics and Sensors, 520 Lee Entrance, Suite 202, SUNY at Buffalo, Amherst, NY, 14228 USA e-mail: xpeng@buffalo.edu S. Setlur e-mail: setlur@buffalo.edu V. Govindaraju e-mail: govind@buffalo.edu R. Sitaram HP Labs India, Hosur Main Road, Adugodi, Bangalore 560030, India e-mail: sitaram@hp.com Keywords Text identification · Markov Random Field · Documents retrieval · Ink separation · Segmentation...

[...]

##### References

31,977 citations

### "Preprocessing of Low-Quality Handwr..." refers methods in this paper

...Traditional document image binarization algorithms [ 16 ], [15], [18], [11], [21] separate the foreground from the background by histogram thresholding and analysis of the connectivity of strokes....

[...]

...5.4 Comparison to Other Preprocessing Methods In Fig. 11, we compare our approach with the preprocessing algorithm of Milewski and Govindaraju [14], the Niblack algorithm [15], and the Otsu algorithm [ 16 ]....

[...]

18,328 citations

### Additional excerpts

...THE goal of this paper is the preprocessing of degradedhandwritten document images such as carbon forms for subsequent recognition and retrieval....

[...]

15,149 citations

[...]

3,421 citations

### "Preprocessing of Low-Quality Handwr..." refers methods in this paper

...MRF is ideally suited to this task and has been used successfully on natural scene images [2], [22]....

[...]

##### Related Papers (5)

##### Frequently Asked Questions (2)

###### Q2. What are the future works in "Preprocessing of low-quality handwritten documents using markov random fields" ?

The authors will investigate approaches to generalize their model to these applications in their future work.