scispace - formally typeset
Search or ask a question
Author

Saikat Roy

Other affiliations: University of Bonn
Bio: Saikat Roy is an academic researcher from Jadavpur University. The author has contributed to research in topics: Convolutional neural network & Deep learning. The author has an hindex of 4, co-authored 6 publications receiving 152 citations. Previous affiliations of Saikat Roy include University of Bonn.

Papers
More filters
Journal ArticleDOI
TL;DR: A novel deep learning technique for the recognition of handwritten Bangla isolated compound character is presented and a new benchmark of recognition accuracy on the CMATERdb 3.3.1.3 dataset is reported.

113 citations

Proceedings ArticleDOI
29 Jan 2018
TL;DR: The proposed region-based Deep Convolutional Neural Network framework for document structure learning achieves state-of-the-art accuracy of 92.21% on the popular RVL-CDIP document image dataset, exceeding the benchmarks set by the existing algorithms.
Abstract: In this article, a region-based Deep Convolutional Neural Network framework is presented for document structure learning. The contribution of this work involves efficient training of region based classifiers and effective ensembling for document image classification. A primary level of ‘inter-domain’ transfer learning is used by exporting weights from a pre-trained VGG16 architecture on the ImageNet dataset to train a document classifier on whole document images. Exploiting the nature of region based influence modelling, a secondary level of ‘intra-domain’ transfer learning is used for rapid training of deep learning models for image segments. Finally, a stacked generalization based ensembling is utilized for combining the predictions of the base deep neural network models. The proposed method achieves state-of-the-art accuracy of 92.21% on the popular RVL-CDIP document image dataset, exceeding the benchmarks set by the existing algorithms.

60 citations

Posted Content
TL;DR: In this paper, a region-based deep convolutional neural network framework is proposed for document structure learning, which involves efficient training of region based classifiers and effective ensembling for document image classification.
Abstract: In this work, a region-based Deep Convolutional Neural Network framework is proposed for document structure learning The contribution of this work involves efficient training of region based classifiers and effective ensembling for document image classification A primary level of `inter-domain' transfer learning is used by exporting weights from a pre-trained VGG16 architecture on the ImageNet dataset to train a document classifier on whole document images Exploiting the nature of region based influence modelling, a secondary level of `intra-domain' transfer learning is used for rapid training of deep learning models for image segments Finally, stacked generalization based ensembling is utilized for combining the predictions of the base deep neural network models The proposed method achieves state-of-the-art accuracy of 922% on the popular RVL-CDIP document image dataset, exceeding benchmarks set by existing algorithms

24 citations

Proceedings ArticleDOI
01 Dec 2016
TL;DR: Results of the experimentations show that the proposed strategy involving a considerably smaller network architecture can produce comparable document classification accuracies in competition with the state-of-the-art architectures making it more suitable for use in comparatively low configuration mobile devices.
Abstract: This article presents our recent study of a lightweight Deep Convolutional Neural Network (DCNN) architecture for document image classification. Here, we concentrated on training of a committee of generalized, compact and powerful base DCNNs. A support vector machine (SVM) is used to combine the outputs of individual DCNNs. The main novelty of the present study is introduction of supervised layerwise training of DCNN architecture in document classification tasks for better initialization of weights of individual DCNNs. Each DCNN of the committee is trained for a specific part or the whole document. Also, here we used the principle of generalized stacking for combining the normalized outputs of all the members of the DCNN committee. The proposed document classification strategy has been tested on the well-known Tobacco3482 document image dataset. Results of our experimentations show that the proposed strategy involving a considerably smaller network architecture can produce comparable document classification accuracies in competition with the state-of-the-art architectures making it more suitable for use in comparatively low configuration mobile devices.

14 citations

Proceedings ArticleDOI
01 Jan 2017
TL;DR: A scalable supervised prediction model based on convolutional regression framework that is particularly suitable for short time series data is proposed and various schemes to model social influence for health behavior change are proposed.
Abstract: Understanding the propagation of human health behavior, such as smoking and obesity, and identification of the factors that control such phenomenon is an important area of research in recent years mainly because, in industrialized countries a substantial proportion of the mortality and quality of life is due to particular behavior patterns, and that these behavior patterns are modifiable. Predicting the individuals who are going to be overweight or obese in future, as overweight and obesity propagate over dynamic human interaction network, is an important problem in this area. However, the problem has received limited attention from the network analysis and machine learning perspective till date. In this work, we propose a scalable supervised prediction model based on convolutional regression framework that is particularly suitable for short time series data. We propose various schemes to model social influence for health behavior change. Further we study the contribution of the primary factors of overweight and obesity, like unhealthy diets, recent weight gains and inactivity in the prediction task. A thorough experiment shows the superiority of the proposed method over the state-of-the-art.

5 citations


Cited by
More filters
Proceedings ArticleDOI
23 Aug 2020
TL;DR: The LayoutLM is proposed to jointly model interactions between text and layout information across scanned document images, which is beneficial for a great number of real-world document image understanding tasks such as information extraction from scanned documents.
Abstract: Pre-training techniques have been verified successfully in a variety of NLP tasks in recent years. Despite the widespread use of pre-training models for NLP applications, they almost exclusively focus on text-level manipulation, while neglecting layout and style information that is vital for document image understanding. In this paper, we propose the LayoutLM to jointly model interactions between text and layout information across scanned document images, which is beneficial for a great number of real-world document image understanding tasks such as information extraction from scanned documents. Furthermore, we also leverage image features to incorporate words' visual information into LayoutLM. To the best of our knowledge, this is the first time that text and layout are jointly learned in a single framework for document-level pre-training. It achieves new state-of-the-art results in several downstream tasks, including form understanding (from 70.72 to 79.27), receipt understanding (from 94.02 to 95.24) and document image classification (from 93.07 to 94.42). The code and pre-trained LayoutLM models are publicly available at https://aka.ms/layoutlm.

388 citations

Journal Article
TL;DR: This paper addresses current topics about document image understanding from a technical point of view as a survey and proposes methods/approaches for recognition of various kinds of documents.
Abstract: The subject about document image understanding is to extract and classify individual data meaningfully from paper-based documents. Until today, many methods/approaches have been proposed with regard to recognition of various kinds of documents, various technical problems for extensions of OCR, and requirements for practical usages. Of course, though the technical research issues in the early stage are looked upon as complementary attacks for the traditional OCR which is dependent on character recognition techniques, the application ranges or related issues are widely investigated or should be established progressively. This paper addresses current topics about document image understanding from a technical point of view as a survey. key words: document model, top-down, bottom-up, layout structure, logical structure, document types, layout recognition

222 citations

Book ChapterDOI
09 Sep 2019
TL;DR: A saliency-based fully-convolutional neural network performing multi-scale reasoning on visual cues followed by a fully-connected conditional random field (CRF) for localizing tables and charts in digital/digitized documents is proposed.
Abstract: Within the realm of information extraction from documents, detection of tables and charts is particularly needed as they contain a visual summary of the most valuable information contained in a document. For a complete automation of the visual information extraction process from tables and charts, it is necessary to develop techniques that localize them and identify precisely their boundaries. In this paper we aim at solving the table/chart detection task through an approach that combines deep convolutional neural networks, graphical models and saliency concepts. In particular, we propose a saliency-based fully-convolutional neural network performing multi-scale reasoning on visual cues followed by a fully-connected conditional random field (CRF) for localizing tables and charts in digital/digitized documents. Performance analysis, carried out on an extended version of the ICDAR 2013 (with annotated charts as well as tables) dataset, shows that our approach yields promising results, outperforming existing models.

100 citations

Journal ArticleDOI
TL;DR: In the present work, a non-explicit feature based approach, more specifically, a multi-column multi-scale convolutional neural network (MMCNN) based architecture has been proposed for this purpose and a deep quad-tree based staggered prediction model has be proposed for faster character recognition.

88 citations

Book ChapterDOI
16 Sep 2019
TL;DR: A multimodal neural network able to learn from word embeddings, computed on text extracted by OCR, and from the image is designed that boosts pure image accuracy by 3% on Tobacco3482 and RVL-CDIP augmented by the new QS-OCR text dataset, even without clean text information.
Abstract: Classification of document images is a critical step for accelerating archival of old manuscripts, online subscription and administrative procedures. Computer vision and deep learning have been suggested as a first solution to classify documents based on their visual appearance. However, achieving the fine-grained classification that is required in real-world setting cannot be achieved by visual analysis alone. Often, the relevant information is in the actual text content of the document, although this text is not available in digital form. In this work, we introduce a novel pipeline based on off-the-shelf architectures to deal with document classification by taking into account both text and visual information. We design a multimodal neural network that is able to learn both the image and from word embeddings, computed on noisy text extracted by OCR. We show that this approach allows us to improve single-modality classification accuracy by several points on the small Tobacco3482 and large RVL-CDIP datasets, even without clean text information. We release a post-OCR text classification (https://github.com/Quicksign/ocrized-text-dataset) that complements the Tobacco3482 and RVL-CDIP ones to encourage researchers to look into multi-modal text/image classification.

77 citations