Showing papers in "International Journal on Document Analysis and Recognition in 2018"
TL;DR: A learning-based method for handwritten text line segmentation in document images using a variant of deep fully convolutional networks (FCNs) with dilated convolutions that outperforms the most popular variants of FCN, based on deconvolution or unpooling layers, on a public dataset.
Abstract: We present a learning-based method for handwritten text line segmentation in document images. Our approach relies on a variant of deep fully convolutional networks (FCNs) with dilated convolutions. Dilated convolutions allow to never reduce the input resolution and produce a pixel-level labeling. The FCN is trained to identify X-height labeling as text line representation, which has many advantages for text recognition. We show that our approach outperforms the most popular variants of FCN, based on deconvolution or unpooling layers, on a public dataset. We also provide results investigating various settings, and we conclude with a comparison of our model with recent approaches defined as part of the cBAD (
https://scriptnet.iit.demokritos.gr/competitions/5/
) international competition, leading us to a 91.3% F-measure.
69 citations
TL;DR: In this paper, a fixed-sized representation from variable-sized signatures was learned by modifying the network architecture, using spatial pyramid pooling, which achieved state-of-the-art performance on handwritten signature verification.
Abstract: Methods for learning feature representations for offline handwritten signature verification have been successfully proposed in recent literature, using deep convolutional neural networks to learn representations from signature pixels. Such methods reported large performance improvements compared to handcrafted feature extractors. However, they also introduced an important constraint: the inputs to the neural networks must have a fixed size, while signatures vary significantly in size between different users. In this paper, we propose addressing this issue by learning a fixed-sized representation from variable-sized signatures by modifying the network architecture, using spatial pyramid pooling. We also investigate the impact of the resolution of the images used for training and the impact of adapting (fine-tuning) the representations to new operating conditions (different acquisition protocols, such as writing instruments and scan resolution). On the GPDS dataset, we achieve results comparable with the state of the art, while removing the constraint of having a maximum size for the signatures to be processed. We also show that using higher resolutions (300 or 600 dpi) can improve performance when skilled forgeries from a subset of users are available for feature learning, but lower resolutions (around 100dpi) can be used if only genuine signatures are used. Lastly, we show that fine-tuning can improve performance when the operating conditions change.
62 citations
TL;DR: By taking a probabilistic perspective on training CNNs, this work derives two different loss functions for binary and real-valued word string embeddings and proposes two different CNN architectures, specifically designed for word spotting.
Abstract: Word spotting has become a field of strong research interest in document image analysis over the last years. Recently, AttributeSVMs were proposed which predict a binary attribute representation (Almazan et al. in IEEE Trans Pattern Anal Mach Intell 36(12):2552---2566, 2014). At their time, this influential method defined the state of the art in segmentation-based word spotting. In this work, we present an approach for learning attribute representations with convolutional neural networks(CNNs). By taking a probabilistic perspective on training CNNs, we derive two different loss functions for binary and real-valued word string embeddings. In addition, we propose two different CNN architectures, specifically designed for word spotting. These architectures are able to be trained in an end-to-end fashion. In a number of experiments, we investigate the influence of different word string embeddings and optimization strategies. We show our attribute CNNs to achieve state-of-the-art results for segmentation-based word spotting on a large variety of data sets.
47 citations
TL;DR: This work proposes a novel technique called weighted average pooling for reducing the parameters in fully connected layer without loss in accuracy in state-of-the-art CNNs and implements a cascaded model in single CNN by adding mid output to complete recognition as early as possible, which reduces average inference time significantly.
Abstract: Deep convolutional neural networks-based methods have brought great breakthrough in image classification, which provides an end-to-end solution for handwritten Chinese character recognition (HCCR) problem through learning discriminative features automatically. Nevertheless, state-of-the-art CNNs appear to incur huge computational cost and require the storage of a large number of parameters especially in fully connected layers, which is difficult to deploy such networks into alternative hardware devices with limited computation capacity. To solve the storage problem, we propose a novel technique called weighted average pooling for reducing the parameters in fully connected layer without loss in accuracy. Besides, we implement a cascaded model in single CNN by adding mid output to complete recognition as early as possible, which reduces average inference time significantly. Experiments are performed on the ICDAR-2013 offline HCCR dataset. It is found that our proposed approach only needs 6.9 ms for classifying a character image on average and achieves the state-of-the-art accuracy of 97.1% while requires only 3.3 MB for storage.
46 citations
TL;DR: The technical challenges of performing text/non-text separation are summarized, and offline document images are categorized into different classes according to the nature of the challenges one faces, in an attempt to provide insight into various techniques presented in the literature.
Abstract: Separation of text and non-text is an essential processing step for any document analysis system. Therefore, it is important to have a clear understanding of the state-of-the-art of text/non-text separation in order to facilitate the development of efficient document processing systems. This paper first summarizes the technical challenges of performing text/non-text separation. It then categorizes offline document images into different classes according to the nature of the challenges one faces, in an attempt to provide insight into various techniques presented in the literature. The pros and cons of various techniques are explained wherever possible. Along with the evaluation protocols, benchmark databases, this paper also presents a performance comparison of different methods. Finally, this article highlights the future research challenges and directions in this domain.
46 citations
TL;DR: An effective segmentation-free approach using a hybrid neural network hidden Markov model (NN-HMM) for offline handwritten Chinese text recognition (HCTR) and a deep convolutional neural network with automatically learned discriminative features demonstrates its superiority in the HMM framework.
Abstract: This paper proposes an effective segmentation-free approach using a hybrid neural network hidden Markov model (NN-HMM) for offline handwritten Chinese text recognition (HCTR). In the general Bayesian framework, the handwritten Chinese text line is sequentially modeled by HMMs with each representing one character class, while the NN-based classifier is adopted to calculate the posterior probability of all HMM states. The key issues in feature extraction, character modeling, and language modeling are comprehensively investigated to show the effectiveness of NN-HMM framework for offline HCTR. First, a conventional deep neural network (DNN) architecture is studied with a well-designed feature extractor. As for the training procedure, the label refinement using forced alignment and the sequence training can yield significant gains on top of the frame-level cross-entropy criterion. Second, a deep convolutional neural network (DCNN) with automatically learned discriminative features demonstrates its superiority to DNN in the HMM framework. Moreover, to solve the challenging problem of distinguishing quite confusing classes due to the large vocabulary of Chinese characters, NN-based classifier should output 19900 HMM states as the classification units via a high-resolution modeling within each character. On the ICDAR 2013 competition task of CASIA-HWDB database, DNN-HMM yields a promising character error rate (CER) of 5.24% by making a good trade-off between the computational complexity and recognition accuracy. To the best of our knowledge, DCNN-HMM can achieve a best published CER of 3.53%.
33 citations
TL;DR: Compared with five other classical algorithms, the images binarized using the proposed algorithm achieved the highest F-measure and peak signal-to-noise ratio and obtained the highest correct rate of recognition.
Abstract: Because of the different types of document degradation such as uneven illumination, image contrast variation, blur caused by humidity, and bleed-through, degraded document image binarization is still an enormous challenge. This paper presents a new binarization method for degraded document images. The proposed algorithm focuses on the differences of image grayscale contrast in different areas. Quadtree is used to divide areas adaptively. In addition, various contrast enhancements are selected to adjust local grayscale contrast in areas with different contrasts. Finally, the local threshold is regarded as the mean of foreground and background gray values, which are determined by the frequency of the gray values. The proposed algorithm was tested on the datasets from the Document Image Binarization Contest (DIBCO) (DIBCO 2009, H-DIBCO 2010, DIBCO 2011, and H-DIBCO 2012). Compared with five other classical algorithms, the images binarized using the proposed algorithm achieved the highest F-measure and peak signal-to-noise ratio and obtained the highest correct rate of recognition.
21 citations
TL;DR: This paper provides a statistical Arabic language model and post-processing techniques based on hybridizing the error model approach with the context approach and shows that the proposed hybrid system outperforms the rule-based system.
Abstract: Optical character recognition (OCR) is the process of recognizing characters automatically from scanned documents for editing, indexing, searching, and reducing the storage space. The resulted text from the OCR usually does not match the text in the original document. In order to minimize the number of incorrect words in the obtained text, OCR post-processing approaches can be used. Correcting OCR errors is more complicated when we are dealing with the Arabic language because of its complexity such as connected letters, different letters may have the same shape, and the same letter may have different forms. This paper provides a statistical Arabic language model and post-processing techniques based on hybridizing the error model approach with the context approach. The proposed model is language independent and non-constrained with the string length. To the best of our knowledge, this is the first end-to-end OCR post-processing model that is applied to the Arabic language. In order to train the proposed model, we build Arabic OCR context database which contains 9000 images of Arabic text. Also, the evaluation of the OCR post-processing system results is automated using our novel alignment technique which is called fast automatic hashing text alignment. Our experimental results show that the rule-based system improves the word error rate from 24.02% to become 20.26% by using a training data set of 1000 images. On the other hand, after this training, we apply the rule-based system on 500 images as a testing dataset and the word error rate is improved from 14.95% to become 14.53%. The proposed hybrid OCR post-processing system improves the results based on using 1000 training images from a word error rate of 24.02% to become 18.96%. After training the hybrid system, we used 500 images for testing and the results show that the word error rate enhanced from 14.95 to become 14.42. The obtained results show that the proposed hybrid system outperforms the rule-based system.
19 citations
TL;DR: KERTAS is a new dataset of historical documents that can help researchers, historians and paleographers to automatically date Arabic manuscripts more accurately and efficiently.
Abstract: The age of a historical manuscript can be an invaluable source of information for paleographers and historians. The process of automatic manuscript age detection has inherent complexities, which are compounded by the lack of suitable datasets for algorithm testing. This paper presents a dataset of historical handwritten Arabic manuscripts designed specifically to test state-of-the-art authorship and age detection algorithms. Qatar National Library has been the main source of manuscripts for this dataset while the remaining manuscripts are open source. The dataset consists of over 2000 images taken from various handwritten Arabic manuscripts spanning fourteen centuries. In addition, a sparse representation-based approach for dating historical Arabic manuscript is also proposed. There is lack of existing datasets that provide reliable writing date and author identity as metadata. KERTAS is a new dataset of historical documents that can help researchers, historians and paleographers to automatically date Arabic manuscripts more accurately and efficiently.
18 citations
TL;DR: The initial studies confirm that the proposed hybrid CNN architecture based on scattering feature maps could perform better than the equivalent self-learning architecture of CNN on handwritten character recognition problems.
Abstract: Convolutional neural network (CNN)-based deep learning architectures are the state-of-the-art in image-based pattern recognition applications. The receptive filter fields in convolutional layers are learned from training data patterns automatically during classifier learning. There are number of well-defined, well-studied and proven filters in the literature that can extract informative content from the input patterns. This paper focuses on utilizing scattering transform-based wavelet filters as the first-layer convolutional filters in CNN architecture. The scattering networks are generated by a series of scattering transform operations. The scattering coefficients generated in first few layers are effective in capturing the dominant energy contained in the input data patterns. The present work aims at replacing the first-layer convolutional feature maps in CNN architecture with scattering feature maps. This architecture is equivalent to utilizing scattering wavelet filters as the first-layer receptive fields in CNN architecture. The proposed hybrid CNN architecture experiments the Malayalam handwritten character recognition which is one of the challenging multi-class classification problems. The initial studies confirm that the proposed hybrid CNN architecture based on scattering feature maps could perform better than the equivalent self-learning architecture of CNN on handwritten character recognition problems.
16 citations
TL;DR: This work proposes a new neural model which directly predicts object coordinates and is more powerful than the state of the art in applications where training data are not as abundant as in the classical configuration of natural images and Imagenet/Pascal-VOC tasks.
Abstract: The current trend in object detection and localization is to learn predictions with high capacity deep neural networks trained on a very large amount of annotated data and using a high amount of processing power. In this work, we particularly target the detection of text in document images and we propose a new neural model which directly predicts object coordinates. The particularity of our contribution lies in the local computations of predictions with a new form of local parameter sharing which keeps the overall amount of trainable parameters low. Key components of the model are spatial 2D-LSTM recurrent layers which convey contextual information between the regions of the image. We show that this model is more powerful than the state of the art in applications where training data are not as abundant as in the classical configuration of natural images and Imagenet/Pascal-VOC tasks. The proposed model also facilitates the detection of many objects in a single image and can deal with inputs of variable sizes without resizing. To enhance the localization precision of the coordinate regressor, we limit the amount of information produced by the local model components and propose two different regression strategies: (i) separately predict lower-left and upper-right corners of each object bounding box, followed by combinatorial pairing; (ii) only predict the left side of the objects and estimate the right position jointly with text recognition. These strategies lead to good full-page text recognition results in heterogeneous documents. Experiments have been performed on a document analysis task, the localization of the text lines in the Maurdor dataset.
TL;DR: A state-of-the-art binarization algorithm is modified and it is seen that for the chosen algorithm, machine learning-based parameter tuning improves the execution performance more than heterogeneous computing, when comparing absolute execution times.
Abstract: In the context of historical document analysis, image binarization is a first important step, which separates foreground from background, despite common image degradations, such as faded ink, stain ...
TL;DR: Deep learning has gained great successes in various applications of pattern recognition and artificial intelligence, including character and text recognition, image segmentation, object detection and recognition, face recognition, traffic sign recognition, speech recognition, machine translation, and more.
Abstract: Deep learning is a new field of machine learning research, to design models and learning algorithms for deep neural networks. Due to the ability of learning from big data and the superior representation and prediction performance, deep learning has gained great successes in various applications of pattern recognition and artificial intelligence, including character and text recognition, image segmentation, object detection and recognition, face recognition, traffic sign recognition, speech recognition, machine translation, to name a few. Intensive attention has been drawn to the exploration of new deep learning models and algorithms, and the extension to more application areas. The combination of deep learning and traditional methods in pattern recognition and artificial intelligence has also demonstrated benefits.
TL;DR: An augmented incremental recognition method for online handwritten mathematical expressions (MEs) that not only maintains recognition rate even compared with the batch recognition method but also reduces the waiting time to a very small level.
Abstract: This paper presents an augmented incremental recognition method for online handwritten mathematical expressions (MEs). If an ME is recognized after all strokes are written (batch recognition), the waiting time increases significantly when the ME becomes longer. On the other hand, the pure incremental recognition method recognizes an ME whenever a new single stroke is input. It shortens the waiting time but degrades the recognition rate due to the limited context. Thus, we propose an augmented incremental recognition method that not only maintains the advantage of the two methods but also reduces their weaknesses. The proposed method has two main features: one is to process the latest stroke, and the other is to find the erroneous segmentations and recognitions in the recent strokes and correct them. In the first process, the segmentation and the recognition by Cocke–Younger–Kasami (CYK) algorithm are only executed for the latest stroke. In the second process, all the previous segmentations are updated if they are significantly changed after the latest stroke is input, and then, all the symbols related to the updated segmentations are updated with their recognition scores. These changes are reflected in the CYK table. In addition, the waiting time is further reduced by employing multi-thread processes. Experiments on our dataset and the CROHME datasets show the effectiveness of this augmented incremental recognition method, which not only maintains recognition rate even compared with the batch recognition method but also reduces the waiting time to a very small level.
TL;DR: Experimental results showed that the proposed graph partitioning-based character segmentation method achieved high correct segmentation rate and outperformed existing methods for the Lanna Dhamma alphabet.
Abstract: Character segmentation is an important task in optical character recognition (OCR). The quality of any OCR system is highly dependent on character segmentation algorithm. Despite the availability of various character segmentation methods proposed to date, existing methods cannot satisfyingly segment characters belonging to some complex writing styles such as the Lanna Dhamma characters. In this paper, a new character segmentation method named graph partitioning-based character segmentation is proposed to address the problem. The proposed method can deal with multi-level writing style as well as touching and broken characters. It is considered as a generalization of existing approaches to multi-level writing style. The proposed method consists of three phases. In the first phase, a newly devised over-segmentation technique based on morphological skeleton is used to obtain redundant fragments of a word image. The fragments are then used to form a segmentation hypotheses graph. In the last phase, the hypotheses graph is partitioned into subgraphs each corresponding to a segmented character using the partitioning algorithm developed specifically for character segmentation purpose. Experimental results based on handwritten Lanna Dhamma characters datasets showed that the proposed method achieved high correct segmentation rate and outperformed existing methods for the Lanna Dhamma alphabet.
TL;DR: The proposed method, called ECDP for “Ensemble-based classification of document patches,” segments the physical layout of the document, classifies image patches as containing text or graphics, assembles homogeneous document regions, and passes the text to an optical character recognition engine to convert into natural language.
Abstract: Raster-image PDF files originating from scanning or photographing paper documents are inaccessible to both text search engines and screen readers that people with visual impairments use. We here focus on the relatively less-researched problem of converting raster-image files with Arabic script into machine-accessible documents. Our method, called ECDP for "Ensemble-based classification of document patches," segments the physical layout of the document, classifies image patches as containing text or graphics, assembles homogeneous document regions, and passes the text to an optical character recognition engine to convert into natural language. Classification is based on the majority voting of an ensemble of support vector machines. When tested on the dataset BCE-Arabic [Saad et al. in: ACM 9th annual international conference on pervasive technologies related to assistive environments (PETRA'16), Corfu, 2016], ECDP yielded an average patch classification accuracy of 97.3% and average $$F_1$$F1 score of 95.26% for text patches and efficiently extracted text zones in both paragraphs and text-embedded graphics, even if the text is rotated by $$90^{\circ }$$90ź or is in English. ECDP outperforms a classical layout analysis method (RLSA) and a state-of-the-art commercial product (RDI-CleverPage) on this dataset and maintains a relatively high level of performance on document images drawn from two other datasets (Hesham et al. in Pattern Anal Appl 20:1275---1287, 2017; Proprietary Dataset of 109 Arabic Documents. http://www.rdi-eg.com). The results suggest that the proposed method has the potential to generalize well to the analysis of documents with a broad range of content.
TL;DR: This study proves that the application of an efficient sparse coding-based denoising process followed by the magnification process can achieve good restoration results even if the input image is highly noisy.
Abstract: The resolution enhancement of textual images poses a significant challenge mainly in the presence of noise. The inherent difficulties are twofold. First is the reconstruction of an upscaled version of the input low-resolution image without amplifying the effect of noise. Second is the achievement of an improved visual image quality and a better OCR accuracy. Classically, the issue is addressed by the application of a denoising step used as a preprocessing or a post-processing to the magnification process. Starting by a denoising process could be more promising to avoid any magnified artifacts while proceeding otherwise. However, the state of the art underlines the limitations of denoising approaches faced with the low spatial resolution of textual images. Recently, sparse coding has attracted increasing interest due to its effectiveness in different reconstruction tasks. This study proves that the application of an efficient sparse coding-based denoising process followed by the magnification process can achieve good restoration results even if the input image is highly noisy. The main specificities of the proposed sparse coding-based framework are: (1) cascading denoising and magnification of each image patch, (2) the use of sparsity stemmed from the non-local self-similarity given in textual images and (3) the use of dual dictionary learning involving both online and offline dictionaries that are selected adaptively for each local region of the input degraded image to recover its corresponding noise-free high-resolution version. Extensive experiments on synthetic and real low-resolution noisy textual images are carried out to validate visually and quantitatively the effectiveness of the proposed system. Promising results, in terms of image visual quality as well as character recognition rates, are achieved when compared it with the state-of-the-art approaches.
TL;DR: In FSLL, two local ML methods, SLEM and LLE, are fused by rewriting their cost functions without the need for any projection space, which provides more structural information at high-dimensional space that can be applied on extracting the embedded low-dimensional data.
Abstract: In this paper, a new local manifold learning (ML) method is proposed. Our proposed method, which is named FSLL, is based on the fusion of locally linear embedding (LLE) and a new Stochastic Laplacian Eigenmaps (SLEM). SLEM is the same as a common LEM technique, but the coefficients between each data point and its neighbors are calculated by a stochastic process. The coefficients of SLEM make a probability mass function scheme, and their entropy is set to a certain value. The entropy value is an estimation of the locality around each data point. Two criteria will be presented based on the mutual neighborhood conception to determine the entropy value. In LLE, each data point is linearly reconstructed based on its neighbors and then the embedded data manifold is extracted by preserving these linear reconstruction coefficients. LLE and SLEM extract and learn the embedded data manifold by two different kinds of local structure information. In FSLL, two local ML methods, SLEM and LLE, are fused by rewriting their cost functions without the need for any projection space. Fusion of these two techniques provides more structural information at high-dimensional space that can be applied on extracting the embedded low-dimensional data. Also, in this study, a feature vector will be presented by combining a HMAX feature vector and a PCA-based feature vector. Evaluations of the proposed method are done on Persian handwritten digit IFHCDB and IPHD databases in image and feature spaces. The results demonstrate the performance of FSLL and SLEM. The recognition rates are improved about 4% in most dimensionalities. Also, a method of out-of-sample test data extension is proposed corresponding to the proposed methods.
TL;DR: A new approach to segmentation-free word spotting that is based on the combination of three different contributions to generate a set of word-independent text box proposals and an indexing scheme for fast retrieval based on character n-grams is proposed.
Abstract: In this article, we propose a new approach to segmentation-free word spotting that is based on the combination of three different contributions. Firstly, inspired by the success of bounding box proposal algorithms in object recognition, we propose a scheme to generate a set of word-independent text box proposals. For that, we generate a set of atomic bounding boxes based on simple connected component analysis that are combined using a set of spatial constraints in order to generate the final set of text box proposals. Secondly, an attribute representation based on the Pyramidal Histogram of Characters (PHOC) is encoded in an integral image and used to efficiently evaluate text box proposals for retrieval. Thirdly, we also propose an indexing scheme for fast retrieval based on character n-grams. For the generation of the index a similar attribute space based on a Pyramidal Histogram of Character N-grams (PHON) is used. All attribute models are learned using linear SVMs over the Fisher Vector representation of the word images along with the PHOC or PHON labels of the corresponding words. We show the performance of the proposed approach in both tasks of query-by-string and query-by-example in standard single- and multi-writer data sets, reporting state-of-the-art results.
TL;DR: This paper presents a method that combines a logical description of the contents of the documents, with the result of an automatic analysis on the physical properties of the collection, and shows that this combined strategy can locate 97.2% of handwritten fields.
Abstract: This paper deals with the location of handwritten fields in old pre-printed registers The images present the difficulties of old and damaged documents, and we also have to face the difficulty of extracting the text due to the great interaction between handwritten and printed writing In addition, in many collections, the structure of the forms varies according to the origin of the documents This work is applied to a database of Mexican marriage records, which has been published for a competition in the workshop HIP 2013 and is publicly available In this paper, we show the interest and limitations of the empirical method which has been submitted for the competition We then present a method that combines a logical description of the contents of the documents, with the result of an automatic analysis on the physical properties of the collection The particularity of this analysis is that it does not require any ground-truth We show that this combined strategy can locate 972% of handwritten fields The proposed approach is generalizable and could be applied to other databases