Showing papers in "International Journal on Document Analysis and Recognition in 2015"
TL;DR: The overall workflow architecture of CERMINE is outlined, details about individual steps implementations are provided and the evaluation of the extraction workflow carried out with the use of a large dataset showed good performance for most metadata types.
Abstract: CERMINE is a comprehensive open-source system for extracting structured metadata from scientific articles in a born-digital form. The system is based on a modular workflow, whose loosely coupled architecture allows for individual component evaluation and adjustment, enables effortless improvements and replacements of independent parts of the algorithm and facilitates future architecture expanding. The implementations of most steps are based on supervised and unsupervised machine learning techniques, which simplifies the procedure of adapting the system to new document layouts and styles. The evaluation of the extraction workflow carried out with the use of a large dataset showed good performance for most metadata types, with the average F score of 77.5 %. CERMINE system is available under an open-source licence and can be accessed at http://cermine.ceon.pl. In this paper, we outline the overall workflow architecture and provide details about individual steps implementations. We also thoroughly compare CERMINE to similar solutions, describe evaluation methodology and finally report its results.
164 citations
TL;DR: A recurrent neural network is trained to transcribe undiacritized Arabic text with fully diacritized sentences using a deep bidirectional long short-term memory network that builds high-level linguistic abstractions of text and exploits long-range context in both input directions.
Abstract: This paper presents a sequence transcription approach for the automatic diacritization of Arabic text. A recurrent neural network is trained to transcribe undiacritized Arabic text with fully diacritized sentences. We use a deep bidirectional long short-term memory network that builds high-level linguistic abstractions of text and exploits long-range context in both input directions. This approach differs from previous approaches in that no lexical, morphological, or syntactical analysis is performed on the data before being processed by the net. Nonetheless, when the network is post-processed with our error correction techniques, it achieves state-of-the-art performance, yielding an average diacritic and word error rates of 2.09 and 5.82 %, respectively, on samples from 11 books. For the LDC ATB3 benchmark, this approach reduces the diacritic error rate by 25 %, the word error rate by 20 %, and the last-letter diacritization error rate by 33 % over the best published results.
100 citations
TL;DR: This paper presents a scene text extraction technique that automatically detects and segments texts from scene images using a support vector regression model that is trained using bags-of-words representation.
Abstract: This paper presents a scene text extraction technique that automatically detects and segments texts from scene images. Three text-specific features are designed over image edges with which a set of candidate text boundaries is first detected. For each detected candidate text boundary, one or more candidate characters are then extracted by using a local threshold that is estimated based on the surrounding image pixels. The real characters and words are finally identified by a support vector regression model that is trained using bags-of-words representation. The proposed technique has been evaluated over the latest ICDAR-2013 Robust Reading Competition dataset. Experiments show that it obtains superior F-measures of 78.19 % and 75.24 % (on atom level), respectively, for the scene text detection and segmentation tasks.
95 citations
TL;DR: This work developed a new type of recognizers based on deep convolutional neural networks (DNNs) for Hangul recognition and proposed several novel techniques to improve the performance and training speed of the networks.
Abstract: In spite of the advances in recognition technology, handwritten Hangul recognition (HHR) remains largely unsolved due to the presence of many confusing characters and excessive cursiveness in Hangul handwritings. Even the best existing recognizers do not lead to satisfactory performance for practical applications and have much lower performance than those developed for Chinese or alphanumeric characters. To improve the performance of HHR, here we developed a new type of recognizers based on deep neural networks (DNNs). DNN has recently shown excellent performance in many pattern recognition and machine learning problems, but have not been attempted for HHR. We built our Hangul recognizers based on deep convolutional neural networks and proposed several novel techniques to improve the performance and training speed of the networks. We systematically evaluated the performance of our recognizers on two public Hangul image databases, SERI95a and PE92. Using our framework, we achieved a recognition rate of 95.96 % on SERI95a and 92.92 % on PE92. Compared with the previous best records of 93.71 % on SERI95a and 87.70 % on PE92, our results yielded improvements of 2.25 and 5.22 %, respectively. These improvements lead to error reduction rates of 35.71 % on SERI95a and 42.44 % on PE92, relative to the previous lowest error rates. Such improvement fills a significant portion of the large gap between practical requirement and the actual performance of Hangul recognizers.
91 citations
TL;DR: This paper presents a floor plan database, named CVC-FP, that is annotated for the architectural objects and their structural relations and implemented a groundtruthing tool, the SGT tool, that allows to make specific this sort of information in a natural manner.
Abstract: Recent results on structured learning methods have shown the impact of structural information in a wide range of pattern recognition tasks. In the field of document image analysis, there is a long experience on structural methods for the analysis and information extraction of multiple types of documents. Yet, the lack of conveniently annotated and free access databases has not benefited the progress in some areas such as technical drawing understanding. In this paper, we present a floor plan database, named CVC-FP, that is annotated for the architectural objects and their structural relations. To construct this database, we have implemented a groundtruthing tool, the SGT tool, that allows to make specific this sort of information in a natural manner. This tool has been made for general purpose groundtruthing: It allows to define own object classes and properties, multiple labeling options are possible, grants the cooperative work, and provides user and version control. We finally have collected some of the recent work on floor plan interpretation and present a quantitative benchmark for this database. Both CVC-FP database and the SGT tool are freely released to the research community to ease comparisons between methods and boost reproducible research.
52 citations
TL;DR: A review of improvements of the Bag-of-Visual-Words framework and its application to the keyword spotting task is presented and both the baseline and improved systems are compared with the methods presented at the Handwritten Keyword Spotting Competition 2014.
Abstract: The Bag-of-Visual-Words (BoVW) framework has gained popularity among the document image analysis community, specifically as a representation of handwritten words for recognition or spotting purposes. Although in the computer vision field the BoVW method has been greatly improved, most of the approaches in the document image analysis domain still rely on the basic implementation of the BoVW method disregarding such latest refinements. In this paper, we present a review of those improvements and its application to the keyword spotting task. We thoroughly evaluate their impact against a baseline system in the well-known George Washington dataset and compare the obtained results against nine state-of-the-art keyword spotting methods. In addition, we also compare both the baseline and improved systems with the methods presented at the Handwritten Keyword Spotting Competition 2014.
49 citations
TL;DR: This study proposes a knowledge-driven system that can interact with bottom-up and top-down information to progressively understand the content of a document and model the comic book and the image processing domains knowledge for information consistency analysis.
Abstract: Document analysis is an active field of research, which can attain a complete understanding of the semantics of a given document. One example of the document understanding process is enabling a computer to identify the key elements of a comic book story and arrange them according to a predefined domain knowledge. In this study, we propose a knowledge-driven system that can interact with bottom-up and top-down information to progressively understand the content of a document. We model the comic book's and the image processing domains knowledge for information consistency analysis. In addition, different image processing methods are improved or developed to extract panels, balloons, tails, texts, comic characters and their semantic relations in an unsupervised way.
44 citations
TL;DR: This article discusses restoration of camera-captured distorted document images by proposing an algorithm that estimates and rectifies document warping just from 2D image based on line segmentation without the assistance of 3D data or model.
Abstract: This article discusses restoration of camera-captured distorted document images. Without the assistance of 3D data or model, our algorithm estimates and rectifies document warping just from 2D image based on line segmentation. Warping shape of each text line is acquired by estimating baselines' shape and characters' slant angles after line segmentation. In order to get fluent recovery result, thin-plate splines are exploited whose key points are determined through the result of warping estimation. Such process can effectively depict document warping and successfully restore warped document images to be flat. Comparison of OCR recognition rate between original camera-captured images and restored images shows the effectiveness of the algorithm proposed. We also demonstrate evaluation on DFKI dewarping contest dataset with some related algorithms. Besides desirable restoration result, processing speed of the whole procedure is satisfactory as well. In conclusion, it is applicable to be performed in OCR application to achieve better understanding of camera-captured document images.
38 citations
TL;DR: A segmentation-based method is proposed for developing Nastalique OCR, deriving principles and techniques for the pre-processing and recognition, and the work is extensible to other languages using Nastsalique.
Abstract: Much work on Arabic language optical character recognition (OCR) has been on Naskh writing style. Nastalique style, used for most of languages using Arabic script across Southern Asia, is much more challenging to process due to its compactness, cursiveness, higher context sensitivity and diagonality. This makes the Nastalique writing more complex with multiple letters horizontally overlapping each other. Due to these reasons, existing methods used for Naskh would not work for Nastalique and therefore most work on Nastalique has used non-segmentation methods. The current paper presents new approach for segmentation-based analysis for Nastalique style. The paper explains the complexity of Nastalique, why Naskh based techniques cannot work for Nastalique, and proposes a segmentation-based method for developing Nastalique OCR, deriving principles and techniques for the pre-processing and recognition. The OCR is developed for Urdu language. The system is optimized using 79,093 instances of 5249 main bodies derived from a corpus of 18 million words, giving recognition accuracy of 97.11 %. The system is then tested on document images of books with 87.44 % main body recognition accuracy. The work is extensible to other languages using Nastalique.
30 citations
TL;DR: A method that locates tables and their cells in camera-captured document images based on the well-known recursive X-Y cut, however, it is modified so that it can also deal with curved seams caused by the geometric distortions.
Abstract: In this paper, we present a method that locates tables and their cells in camera-captured document images. In order to deal with this problem in the presence of geometric and photometric distortions, we develop new junction detection and labeling methods, where junction detection means to find candidates for the corners of cells, and junction labeling is to infer their connectivity. We consider junctions as the intersections of curves, and so we first develop a multiple curve detection algorithm. After the junction detection, we encode the connectivity information (including false detection) between the junctions into 12 labels, and design a cost function reflecting pairwise relationships as well as local observations. The cost function is minimized via the belief propagation algorithm, and we can locate tables and their cells from the inferred labels. Also, in order to handle multiple tables in a single page, we propose a table area detection method. Our method is based on the well-known recursive X-Y cut, however, we modify the method so that we can also deal with curved seams caused by the geometric distortions. For the evaluation of our method, we build a data set that includes a variety of camera-captured table images and make the set publicly available. Experimental results on the set show that our method successfully locates tables and their cells in camera-captured images.
28 citations
TL;DR: Significant improvements in visual quality and character recognition rates are achieved using the proposed approach, confirmed by a detailed comparative study with state-of-the-art upscaling approaches.
Abstract: Resolution enhancement has become a valuable research topic due to the rapidly growing need for high-quality images in various applications. Various resolution enhancement approaches have been successfully applied on natural images. Nevertheless, their direct application to textual images is not efficient enough due to the specificities that distinguish these particular images from natural images. The use of insufficient resolution introduces substantial loss of details which can make a text unreadable by humans and unrecognizable by OCR systems. To address these issues, a sparse coding-based approach is proposed to enhance the resolution of a textual image. Three major contributions are presented in this paper: (1) Multiple coupled dictionaries are learned from a clustered database and selected adaptively for a better reconstruction. (2) An automatic process is developed to collect the training database, which contains writing patterns extracted from high-quality character images. (3) A new local feature descriptor well suited for writing specificities is proposed for the clustering of the training database. The performance of these propositions is evaluated qualitatively and quantitatively on various types of low-resolution textual images. Significant improvements in visual quality and character recognition rates are achieved using the proposed approach, confirmed by a detailed comparative study with state-of-the-art upscaling approaches.
TL;DR: The proposed technique is well suited for table processing (i.e. extracting repeated patterns from the table) and it outperforms the state-of-the-art method by approximately more than 3 %.
Abstract: In this paper, we present document information content (i.e. text fields) extraction technique via graph mining. Real-world users first provide a set of key text fields from the document image which they think are important. These fields are used to initialise a graph where nodes are labelled with the field names in addition to other features such as size, type and number of words, and edges are attributed with relative positioning between them. Such an attributed relational graph is then used to mine similar graphs from document images which are used to update the initial graph iteratively each time we extract them, to produce a graph model. Graph models, therefore, are employed in the absence of users. We have validated the proposed technique and evaluated its scientific impact on real-world industrial problem with the performance of 86.64 % precision and 90.80 % recall by considering all zones, viz. header, body and footer. More specifically, the proposed technique is well suited for table processing (i.e. extracting repeated patterns from the table) and it outperforms the state-of-the-art method by approximately more than 3 %.
TL;DR: A novel approach in skew detection and correction of a typed document by minimizing the area of the axis-parallel bounding box is presented and it is shown that the algorithm outperforms existing state-of-the-art skew detection algorithms.
Abstract: Skew detection and correction of a scanned document is a preprocessing step for optical character recognition systems. We present a novel approach in skew detection and correction of a typed document by minimizing the area of the axis-parallel bounding box. Advantage of our approach over existing methods is that our algorithm is script and content independent. Moreover, our algorithm is not subject to skew angle limitations. The performance of our algorithm was evaluated using images of different scripts with varying skew angles with and without graphical images. Our experiments show that our algorithm outperforms existing state-of-the-art skew detection algorithms.
TL;DR: Evaluation across different metrics on established natural image text recognition benchmarks shows that the simple and fast image binarization method combined with off-the-shelf OCR engine achieves state-of- the-art performance for end-to-end text understanding in natural images and outperforms recent fancy methods.
Abstract: While modern off-the-shelf OCR engines show particularly high accuracy on scanned text, text detection and recognition in natural images still remain a challenging problem. Here, we demonstrate that OCR engines can still perform well on this harder task as long as an appropriate image binarization is applied to input photographs. We propose a new binarization algorithm that is particularly suitable for scene text and systematically evaluate its performance along with 12 existing binarization methods. While most existing binarization techniques are designed specifically either for text detection or for recognition of localized text, our method shows very similar results for both large images and localized text regions. Therefore, it can be applied to large images directly with no need for re-binarization of localized text regions. We also propose the real-time variant of this method based on linear-time bilateral filtering. Evaluation across different metrics on established natural image text recognition benchmarks (ICDAR 2003 and ICDAR 2011) shows that our simple and fast image binarization method combined with off-the-shelf OCR engine achieves state-of-the-art performance for end-to-end text understanding in natural images and outperforms recent fancy methods.
TL;DR: A system developed for discriminating fake notes from genuine ones and apply it to Indian banknotes is described and the ability of the embedded security aspects is thoroughly analysed for detecting fake currencies.
Abstract: Automatic authentication of paper money is becoming an increasingly urgent problem because of new and improved uses of counterfeits. In this paper, we describe a system developed for discriminating fake notes from genuine ones and apply it to Indian banknotes. Image processing and pattern recognition techniques are used to design the overall approach. The ability of the embedded security aspects is thoroughly analysed for detecting fake currencies. Real samples are used in the experiments that show a high-precision machine can be developed for authentication of paper money. The system performance is reported for both accuracy and processing speed. The analysis of security features to prevent counterfeiting highlights some of the issues that should be considered in designing of currency notes in the future.
TL;DR: Experimental results on eight public datasets demonstrate that the proposed method outperforms the state-of-the-art algorithms.
Abstract: This article proposes offline language-free writer identification based on speeded-up robust features (SURFs), which goes through training, enrollment, and identification stages. In all stages, an isotropic box filter is first used to segment the handwritten text image into word regions (WRs). Then, the SURF descriptors (SUDs) of WR and the corresponding scales and orientations (SOs) are extracted. In the training stage, an SUD codebank is constructed by clustering the SUDs of training samples. In the enrollment stage, the SUDs of the input handwriting adopted to form an SUD signature (SUDS) by looking up the SUD codebank and the SOs are utilized to generate a scale and orientation histogram $$({H}_{\mathrm{SO}})$$(HSO). In the identification stage, the SUDS and $${H}_{\mathrm{SO}}$$HSO of the input handwriting are extracted and matched with the enrolled ones for identification. Experimental results on eight public datasets demonstrate that the proposed method outperforms the state-of-the-art algorithms.
TL;DR: It is demonstrated that the colour information within the images if efficiently exploited is good enough to identify text regions from the surrounding noise and improve the overall performance of text detection and recognition in the wild.
Abstract: This paper presents an approach for text detection and recognition in scene images. The main contribution of this paper is to demonstrate that the colour information within the images if efficiently exploited is good enough to identify text regions from the surrounding noise. In the same way, the colour information present in character and word images can be used to achieve significant performance improvement in the recognition of characters and words. The proposed pipeline makes use of the colour information and low-level image processing operations to enhance text information that improves the overall performance of text detection and recognition in the wild. The proposed method offers two main advantages. First, it enhances the text regions up to a level of clarity where a simple off-the-shelf feature representation and classification method achieves state-of-the-art recognition performance. Second, the proposed framework is computationally fast as compared to other text detection and recognition techniques that offer good accuracy at the cost of significantly high latency. We performed extensive experimentation to evaluate our method on challenging benchmark datasets (Chars74K, ICDAR03, ICDAR11 and SVT), and the results show a considerable performance improvement.
TL;DR: A character image restoration method is proposed for unconstrained handwritten Chinese character recognition that extends some state-of-the-art classifiers based on the estimated features and shows that the extended classifiers outperform the original state of the art classifiers.
Abstract: Despite the success of methods on constrained handwriting databases, recognition of unconstrained handwritten Chinese characters remains a big challenge. One difficulty for recognizing unconstrained handwritting is that some connected strokes are involved or some strokes are omitted. In this paper, a character image restoration method is proposed for unconstrained handwritten Chinese character recognition. In this method, the observed character image is modeled as the combination of the ideal character image with two types of noise images: the omitted stroke noise image and the added stroke noise image. To preserve the original gradient features, restoration is done on the gradient features. The estimated features are then used to discriminate similar characters. To show the effectiveness of the proposed method, we extend some state-of-the-art classifiers based on the estimated features. Experimental results show that the extended classifiers outperform the original state-of-the-art classifiers. This demonstrates that the estimated features are useful for further improving the recognition rate.
TL;DR: This approach significantly improves the state-of-the-art MT system and achieves MT scores close to that achieved by human segmentation, and uses the output from OOV name detection as a novel feature for discriminative reranking, which significantly reduced the false alarm rate of OV name search on OCR output.
Abstract: Automatically accessing information from unconstrained image documents has important applications in business and government operations. These real-world applications typically combine optical character recognition (OCR) with language and information technologies, such as machine translation (MT) and keyword spotting. OCR output has errors and presents unique challenges to late-stage processing. This paper addresses two of these challenges: (1) translating the output from Arabic handwriting OCR which lacks reliable sentence boundary markers, and (2) searching named entities which do not exist in the OCR vocabulary, therefore, completely missing from Arabic handwriting OCR output. We address these challenges by leveraging natural language processing technologies, specifically conditional random field-based sentence boundary detection and out-of-vocabulary (OOV) name detection. This approach significantly improves our state-of-the-art MT system and achieves MT scores close to that achieved by human segmentation. The output from OOV name detection was used as a novel feature for discriminative reranking, which significantly reduced the false alarm rate of OOV name search on OCR output. Our experiments also show substantial performance gains from integrating a variety of features from multiple resources, such as linguistic analysis, image layout analysis, and image text recognition.
TL;DR: Experimental results comparing with the state-of-the-art methods show that the proposed RRT to generate a radius map from Canny edges of each input image to obtain its medial axis is generic and outperforms the existing methods in terms of obtaining skeleton, preserving visual topology and recognition rate.
Abstract: Thinning that preserves visual topology of characters in video is challenging in the field of document analysis and video text analysis due to low resolution and complex background. This paper proposes to explore ring radius transform (RRT) to generate a radius map from Canny edges of each input image to obtain its medial axis. A radius value contained in the radius map here is the nearest distance to the edge pixels on contours. For the radius map, the method proposes a novel idea for identifying medial axis (middle pixels between two strokes) for arbitrary orientations of the character. Iterative-maximal-growing is then proposed to connect missing medial axis pixels at junctions and intersections. Next, we perform histogram on color information of medial axes with clustering to eliminate false medial axis segments. The method finally restores the shape of the character through radius values of medial axis pixels for the purpose of recognition with the Google Open source OCR (Tesseract). The method has been tested on video, natural scene and handwritten characters from ICDAR 2013, SVT, arbitrary-oriented data from MSRA-TD500, multi-script character data and MPEG7 object data to evaluate its performances at thinning level as well as recognition level. Experimental results comparing with the state-of-the-art methods show that the proposed method is generic and outperforms the existing methods in terms of obtaining skeleton, preserving visual topology and recognition rate. The method is also robust to handle characters of arbitrary orientations.
TL;DR: This paper concerns with the automatic generation of Farsi/Arabic handwritten fonts by extracting the properties of the writer’s script style from the basic subwords in an OpenType font file structure to generate a computer font.
Abstract: The interest in personalized handwritten fonts has been increased in recent years. This paper concerns with the automatic generation of Farsi/Arabic handwritten fonts. To reach this target, we need to extract the properties of the writer's script style. The "glyphs" (simple characters or ligatures) of the writer's script are extracted from the basic subwords. The basic subwords are acquired from a writer using tabular sheets. A learning method is used in extraction phase. After glyph extraction, four important steps are performed automatically: (a) adjusting glyphs joints and baselines (b) computing metric data, (c) locating dots, and (d) computing kerning pairs. Finally, the gathered information is compiled in an OpenType$$^{\textregistered }$$® font file structure to generate a computer font, which can be used in any computer application. The results seem visually acceptable.
TL;DR: Experiments conducted on a word-segmented version of the publicly available RIMES database show that the proposed approach can improve recognition accuracy compared to systems based on static dictionaries only.
Abstract: Handwriting recognition systems usually rely on static dictionaries and language models. Full coverage of these dictionaries is generally not achieved when dealing with unrestricted document corpora due to the presence of Out-Of-Vocabulary (OOV) words. We propose an approach which uses the World Wide Web as a corpus to improve dictionary coverage. We exploit the very large and freely available Wikipedia corpus in order to obtain dynamic dictionaries on the fly. We rely on recurrent neural network (RNN) recognizers, with and without linguistic resources, to detect words that are non-reliably recognized within a word sequence. Such words are labeled as non-anchor words (NAWs) and include OOVs and In-Vocabulary words recognized with low confidence. To recognize a non-anchor word, a dynamic dictionary is built by selecting words from the Web resource based on their string similarity with the NAW image, and their linguistic relevance in the NAW context. Similarity is evaluated by computing the edit distance between the sequence of characters generated by the RNN recognizer exploited as a filler model, and the Wikipedia words. Linguistic relevance is based on an N-gram language model estimated from the Wikipedia corpus. Experiments conducted on a word-segmented version of the publicly available RIMES database show that the proposed approach can improve recognition accuracy compared to systems based on static dictionaries only. The proposed approach shows even better behavior as the proportion of OOVs increases, in terms of both accuracy and dictionary coverage.
TL;DR: The mean recognition rate is used to evaluate the performance of the proposed algorithm and shows superior performance when compared with recently published algorithms for AOFR.
Abstract: This paper proposes a new algorithm for Arabic optical font recognition (AOFR) as the first stage for Arabic optical character recognition. The proposed algorithm uses scale-invariant detector, gradient-based descriptor, and k-means clustering. The scale-invariant detector is used to find key points that identify the font of an image of printed Arabic text. The work in this paper compares between several scale-invariant detectors and selects the best one for AOFR. A gradient-based descriptor similar to the one in the famous scale-invariant feature transform algorithm is used to describe the detected key points. In addition, k-means clustering is used for font classification. In this paper, the mean recognition rate is used to evaluate the performance of the proposed algorithm. The proposed algorithm shows superior performance when compared with recently published algorithms for AOFR.
TL;DR: In this article, the restoration of camera-captured distorted document images is discussed without the assistance of 3D data or model, and an algorithm estimates and rectifies document warping just from 2D images.
Abstract: This article discusses restoration of camera-captured distorted document images. Without the assistance of 3D data or model, our algorithm estimates and rectifies document warping just from 2D imag...
TL;DR: The anytime anywhere document analysis methodology applied in the context of computer-aided transcription is introduced and its utility is revealed for documents which are difficult to analyse, as in the case of handwritten texts.
Abstract: This paper introduces the anytime anywhere document analysis methodology applied in the context of computer-aided transcription. Its utility is revealed for documents which are difficult to analyse, as in the case of handwritten texts. A special focus lies on the glyph separation problem which turns out to be particularly complicated. As automatic methods show fundamental limitations, a number of interactive methods are proposed which are based on the interplay between user and machine. These methods get along without any assumptions concerning underlying languages or appearances of texts. An evaluation in the context of palaeography and applied to a well-established data set illustrates how well handwritings are dealt with, although they offer distinct differences in their regularity and shape.