Author
Soumyadeep Dey
Other affiliations: Indian Institute of Technology Kharagpur
Bio: Soumyadeep Dey is an academic researcher from Microsoft. The author has contributed to research in topics: Annotation & Computer science. The author has an hindex of 4, co-authored 13 publications receiving 43 citations. Previous affiliations of Soumyadeep Dey include Indian Institute of Technology Kharagpur.
Papers
More filters
10 Dec 2013
TL;DR: An effective technique for rubber stamp removal from scanned document images is proposed based on the novel idea of a single feature obtained by projecting the pixel colors of the image foreground along the eigenvector corresponding to the first principal component in HSV color space.
Abstract: Rubber stamps on document pages often overlap and obscure the text very badly, thereby impairing its readability and deteriorating the performance of an optical character recognition system. Removal of rubber stamps from a document image is, therefore, essential for successfully converting a document image into an editable electronic form. We propose here an effective technique for rubber stamp removal from scanned document images. It is based on the novel idea of a single feature obtained by projecting the pixel colors of the image foreground along the eigenvector corresponding to the first principal component in HSV color space. Otsu’s adaptive thresholding is used to segment out the stamp impressions from the text by exploiting the discriminative power of the aforesaid feature. Experimentation and subjective evaluation on a variety of scanned document images demonstrate the strength and effectiveness of the proposed technique.
11 citations
16 Dec 2012
TL;DR: This paper performs layout analysis to detect words, lines, and paragraphs in the document image and seeks the geometric properties of the text blocks to detect and remove the margin noise.
Abstract: In this paper, we propose a technique for removing margin noise (both textual and non-textual noise) from scanned document images. We perform layout analysis to detect words, lines, and paragraphs in the document image. These detected elements are classified into text and non-text components on the basis of their characteristics (size, position, etc.). The geometric properties of the text blocks are sought to detect and remove the margin noise. We evaluate our algorithm on several scanned pages of Bengali literature books.
10 citations
01 Dec 2015
TL;DR: This paper has proposed a novel stamp and logo detection technique capable of detecting logos as well as chromatic and achromatic stamps and shows good performance in case of separating them from text.
Abstract: Stamps and logos are generally used for authenticating the source of a document. For automatic document processing, identification and segmentation of stamps and logos are essential. In the past, methods to detect stamps and logos were limited to specific shapes, colors, or training data. However, stamps and logos can be of any shape or color. In this paper, we have proposed a novel stamp and logo detection technique. Our approach is based on the fact that stamps and logos, in general, are not the primary contents of a document. This fact motivates us to propose an outlier detection technique for the same purpose in a feature space. Based on some geometric features, the detected outliers are classified as stamps and logos. Our method shows good performance in case of separating them from text. Moreover, this technique is capable of detecting logos as well as chromatic and achromatic stamps.
10 citations
TL;DR: A consensus-based clustering approach for document image segmentation that is used iteratively with a classifier to label each primitive block and shows that the dependency of classification performance on the training data is significantly reduced.
Abstract: Segmentation of a document image plays an important role in automatic document processing. In this paper, we propose a consensus-based clustering approach for document image segmentation. In this method, the foreground regions of a document image are grouped into a set of primitive blocks, and a set of features is extracted from them. Similarities among the blocks are computed on each feature using a hypothesis test-based similarity measure. Based on the consensus of these similarities, clustering is performed on the primitive blocks. This clustering approach is used iteratively with a classifier to label each primitive block. Experimental results show the effectiveness of the proposed method. It is further shown in the experimental results that the dependency of classification performance on the training data is significantly reduced.
9 citations
01 Nov 2017
TL;DR: A margin noise removal dataset MarNR, consisting of various document images with variation in layout and margin noises, is presented and four metrics of evaluation are defined using confusion matrices obtained experimentally over a labeled test dataset explicitly generated for evaluating the margin noise removed algorithms.
Abstract: Margin noise removal is an important step prior to segmentation and optical character recognition (OCR) of a page Presence of this noise results in erroneous output by the segmentation algorithms and OCR systems In this paper, we present a margin noise removal dataset MarNR A comparative study of four margin noise removal algorithms is also presented in this paper For the purpose of evaluation, we have considered seven metrics The metrics Hamming distance, noise ratio, and page content removal aim to evaluate a margin noise removal algorithm either on the quantity of noise removed or on the original content of the image retrieved We also consider margin noise removal as a bi-classification task and four metrics of evaluation are defined using confusion matrices obtained experimentally over a labeled test dataset explicitly generated for evaluating the margin noise removal algorithms The dataset consists of various document images with variation in layout and margin noises The labeled dataset is also made public for comparative study of different margin noise removal algorithms
4 citations
Cited by
More filters
TL;DR: The paper performs two-tier classification in order to discriminate between stamps and no-stamps and then classify stamps in terms of their shape, which is document size and orientation independent.
Abstract: The paper addresses a problem of detection and classification of rubber stamp instances in scanned documents. A variety of methods from the field of image processing, pattern recognition, and some heuristic are utilized. Presented method works on typical stamps of different colors and shapes. For color images, color space transformation is applied in order to find potential color stamps. Monochrome stamps are detected through shape specific algorithms. Following feature extraction stage, identified candidates are subjected to classification task using a set of shape descriptors. Selected elementary properties form an ensemble of features which is rotation, scale, and translation invariant; hence this approach is document size and orientation independent. We perform two-tier classification in order to discriminate between stamps and no-stamps and then classify stamps in terms of their shape. The experiments carried out on a considerable set of real documents gathered from the Internet showed high potential of the proposed method.
16 citations
01 Nov 2017
TL;DR: D-StaR can segment both overlapping and non-overlapping stamps, which was always a problem for existing systems in the literature, and outperforms the state-of-the-art methods for stamp segmentation.
Abstract: B This paper presents a novel approach, named D-StaR, for stamp segmentation from scanned document images The presented approach is generic (applicable to stamps of any color, shape, size, and orientation) and based on deep learning In particular, it uses Fully Convolutional networks for semantic analysis of documents to extract stamps The presented approach is evaluated on a publicly available stamp dataset Evaluation results show that the presented approach outperforms the state-of-the-art methods for stamp segmentation and achieves pixel based precision and recall of 87% and 84%, respectively Deeper analysis of the evaluation reveals that the presented approach can segment both overlapping and non-overlapping stamps, which was always a problem for existing systems in the literature
14 citations
10 Nov 2017
TL;DR: This work uses fully supervised Deep CNN semantic segmentation to separate content layers from historical document images containing diverse content types, including handwriting, machine print, form lines, and stamps, using CNNs for semantic pixel labeling.
Abstract: Convolutional Neural Networks (CNNs) have produced excellent results in natural scene semantic pixel labeling tasks. We examine the application of this idea to document processing, using fully supervised Deep CNN semantic segmentation to separate content layers from historical document images containing diverse content types, including handwriting, machine print, form lines, and stamps. For efficiency, we employ a downsampling-upsampling network to make dense pixel predictions. CNNs achieve high generalization accuracy on document images with interleaved, overlapping strokes, even when trained on a solitary pixel-labeled document image. We also show a proof-of-concept extension of the semantic segmentation task to handwritten cursive character recognition, enabling a new "segmentation-free" approach to handwriting transcription.
13 citations
01 Sep 2019
TL;DR: Different evaluation metrics were used to gain an in-sight into the algorithms, including new character accuracy metrics to better reflect the difficult circumstances presented by the documents.
Abstract: This paper presents an objective comparative evaluation of page analysis and recognition methods for historical documents with text mainly in Bengali language and script. It describes the competition rules, dataset, and evaluation methodology. Results are presented for five methods - three submit-ted, one re-run, and one open source state-of-the-art system. The focus is on optical character recognition (OCR) performance. Different evaluation metrics were used to gain an in-sight into the algorithms, including new character accuracy metrics to better reflect the difficult circumstances presented by the documents. The results indicate that deep learning approaches are promising, but there are still significant challenges for historic material of this nature.
12 citations
01 Dec 2015
TL;DR: This paper has proposed a novel stamp and logo detection technique capable of detecting logos as well as chromatic and achromatic stamps and shows good performance in case of separating them from text.
Abstract: Stamps and logos are generally used for authenticating the source of a document. For automatic document processing, identification and segmentation of stamps and logos are essential. In the past, methods to detect stamps and logos were limited to specific shapes, colors, or training data. However, stamps and logos can be of any shape or color. In this paper, we have proposed a novel stamp and logo detection technique. Our approach is based on the fact that stamps and logos, in general, are not the primary contents of a document. This fact motivates us to propose an outlier detection technique for the same purpose in a feature space. Based on some geometric features, the detected outliers are classified as stamps and logos. Our method shows good performance in case of separating them from text. Moreover, this technique is capable of detecting logos as well as chromatic and achromatic stamps.
10 citations