scispace - formally typeset
Search or ask a question
Author

Sudip Das

Bio: Sudip Das is an academic researcher from Indian Statistical Institute. The author has contributed to research in topics: Computer science & Pedestrian detection. The author has an hindex of 3, co-authored 10 publications receiving 22 citations.

Papers
More filters
Proceedings Article
01 Jan 2019
TL;DR: This work proposes a novel deep learning framework, called ClueNet, to detect as well as estimate the entire pose of occluded pedestrians in an unsupervised manner and the experimental results on CityPersons and MS COCO datasets show the superior performance over existing methods.
Abstract: Pose estimation of a pedestrian helps to gather information about the current activity or the instant behaviour of the subject. Such information is useful for autonomous vehicles, augmented reality, video surveillance, etc. Although a large volume of pedestrian detection studies are available in the literature, detection of the same in situations of significant occlusions still remains a challenging task. In this work, we take a step further to propose a novel deep learning framework, called ClueNet, to detect as well as estimate the entire pose of occluded pedestrians in an unsupervised manner. ClueNet is a two stage framework where the first stage generates visual clues for the second stage to accurately estimate the pose of occluded pedestrians. The first stage employs a multi-task network to segment the visible parts and predict a bounding box enclosing the visible and occluded regions for each pedestrian. The second stage uses these predictions from the first stage for pose estimation. Here we propose a novel strategy, called Mask and Predict, to train our ClueNet to estimate the pose even for occluded regions. Additionally, we make use of various other training strategies to further improve our results. The proposed work is first of its kind and the experimental results on CityPersons and MS COCO datasets show the superior performance of our approach over existing methods.

17 citations

Posted Content
TL;DR: In this paper, the authors proposed an end-to-end multimodal fusion model for pedestrian detection using RGB and thermal images, which consists of two distinct deformable ResNeXt-50 encoders for feature extraction from the two modalities.
Abstract: Pedestrian Detection is the most critical module of an Autonomous Driving system. Although a camera is commonly used for this purpose, its quality degrades severely in low-light night time driving scenarios. On the other hand, the quality of a thermal camera image remains unaffected in similar conditions. This paper proposes an end-to-end multimodal fusion model for pedestrian detection using RGB and thermal images. Its novel spatio-contextual deep network architecture is capable of exploiting the multimodal input efficiently. It consists of two distinct deformable ResNeXt-50 encoders for feature extraction from the two modalities. Fusion of these two encoded features takes place inside a multimodal feature embedding module (MuFEm) consisting of several groups of a pair of Graph Attention Network and a feature fusion unit. The output of the last feature fusion unit of MuFEm is subsequently passed to two CRFs for their spatial refinement. Further enhancement of the features is achieved by applying channel-wise attention and extraction of contextual information with the help of four RNNs traversing in four different directions. Finally, these feature maps are used by a single-stage decoder to generate the bounding box of each pedestrian and the score map. We have performed extensive experiments of the proposed framework on three publicly available multimodal pedestrian detection benchmark datasets, namely KAIST, CVC-14, and UTokyo. The results on each of them improved the respective state-of-the-art performance. A short video giving an overview of this work along with its qualitative results can be seen at this https URL.

10 citations

Proceedings ArticleDOI
01 Oct 2020
TL;DR: A novel multi-task framework for end-to-end training towards the entire pose estimation of pedestrians including in situations of any kind of occlusion, which outperforms the SOTA results for pose estimation, instance segmentation and pedestrian detection in cases of heavy occlusions.
Abstract: Pose estimation in the wild is a challenging problem, particularly in situations of(i) occlusions of varying degrees, and (ii) crowded outdoor scenes. Most of the existing studies of pose estimation did not report the performance in similar situations. Moreover, pose annotations for occluded parts of the human figures have not been provided in any of the relevant standard datasets, which in turn creates further difficulties to the required studies for pose estimation of the entire Figure for occluded humans. Well known pedestrian detection datasets such as CityPersons contains samples of outdoor scenes but it does not include pose annotations. Here we propose a novel multi-task framework for end-to-end training towards the entire pose estimation of pedestrians including in situations of any kind of occlusion. To tackle this problem, we make use of a pose estimation dataset, MS-COCO, and employ unsupervised adversarial instance-level domain adaptation for estimating the entire pose of occluded pedestrians. The experimental studies show that the proposed framework outperforms the SOTA results for pose estimation, instance segmentation and pedestrian detection in cases of heavy occlusions (HO) and reasonable + heavy occlusions (R+HO) on the two benchmark datasets.

7 citations

Proceedings ArticleDOI
01 Oct 2020
TL;DR: In this paper, a deep architecture consisting of a novel Feature Representation Block (FRB) capable of efficient abstraction of the input information is proposed for automatic detection of scene texts in the wild.
Abstract: Automatic detection of scene texts in the wild is a challenging problem, particularly due to the difficulties in handling (i) occlusions of varying percentages, (ii) widely different scales and orientations, (iii) severe degradation in the image quality etc. Here, we propose a deep architecture consisting of a novel Feature Representation Block (FRB) capable of efficient abstraction of the input information. Also, we consider curriculum learning with respect to difficulties in image samples and gradual increase in pixel-wise blurring. Our framework is capable of detecting texts of different scales and orientations. It can tackle blurring from multiple possible sources, non-uniform illumination as well as partial occlusions of varying percentages. Text detection performance of the proposed framework on various benchmark datasets including ICDAR 2015, ICDAR 2017 MLT, MSRA-TD500 and COCO-Text has significantly improved the respective state-of-the-art results.

6 citations

Proceedings ArticleDOI
10 Jan 2021
TL;DR: In this paper, a novel deep generative architecture, termed as CardioGAN, based on generative adversarial network and powered by the effective attention mechanism has been designed which is capable of learning the intricate interdependencies among the various parts of real samples leading to the generation of more realistic electrocardiogram signals.
Abstract: Electrocardiogram (ECG) signal is studied to obtain crucial information about the condition of a patient's heart. Machine learning based automated medical diagnostic systems that may help to evaluate the condition of the heart from this signal are required to be trained using large volumes of labelled training samples and the same may increase the chance of compromising with the patients' privacy. To solve this issue, generation of synthetic electrocardiogram signals by learning only from the general distributions of the available real training samples have been attempted in the literature. However, these studies did not pay necessary attention to the specific vital details of these signals, such as the P wave, the QRS complex, and the T wave. This shortcoming often results in the generation of unrealistic synthetic signals, such as a signal which does not contain one or more of the above components. In the present study, a novel deep generative architecture, termed as CardioGAN, based on generative adversarial network and powered by the effective attention mechanism has been designed which is capable of learning the intricate inter-dependencies among the various parts of real samples leading to the generation of more realistic electrocardiogram signals. Also, it helps in reducing the risk of breaching the privacy of patients. Extensive experimentation performed by us establishes that the proposed method achieves a better performance in generating synthetic electrocardiogram signals in comparison to the existing methods.

4 citations


Cited by
More filters
Journal ArticleDOI
TL;DR: An extensive review of single-spectral pedestrian detection that includes handcrafted features based methods and deep features based approaches is presented and it is found that hand crafted features with large freedom degrees in shape and space have better performance.
Abstract: Pedestrian detection is an important but challenging problem in computer vision, especially in human-centric tasks. Over the past decade, significant improvement has been witnessed with the help of handcrafted features and deep features. Here we present a comprehensive survey on recent advances in pedestrian detection. First, we provide a detailed review of single-spectral pedestrian detection that includes handcrafted features based methods and deep features based approaches. For handcrafted features based methods, we present an extensive review of approaches and find that handcrafted features with large freedom degrees in shape and space have better performance. In the case of deep features based approaches, we split them into pure CNN based methods and those employing both handcrafted and CNN based features. We give the statistical analysis and tendency of these methods, where feature enhanced, part-aware, and post-processing methods have attracted main attention. In addition to single-spectral pedestrian detection, we also review multi-spectral pedestrian detection, which provides more robust features for illumination variance. Furthermore, we introduce some related datasets and evaluation metrics, and a deep experimental analysis. We conclude this survey by emphasizing open problems that need to be addressed and highlighting various future directions. Researchers can track an up-to-date list at \url{https://github.com/JialeCao001/PedSurvey}.

46 citations

Proceedings Article
01 Jan 2005
TL;DR: A two-step detection/tracking method using a support vector machine with size-normalized pedestrian candidates and a combination of Kalman filter prediction and mean shift tracking for nonrigid nature of human appearance on the road is proposed.
Abstract: This paper presents a method for pedestrian detection and tracking using a single night-vision video camera installed on the vehicle. To deal with the nonrigid nature of human appearance on the road, a two-step detection/tracking method is proposed. The detection phase is performed by a support vector machine (SVM) with size-normalized pedestrian candidates and the tracking phase is a combination of Kalman filter prediction and mean shift tracking. The detection phase is further strengthened by information obtained by a road-detection module that provides key information for pedestrian validation. Experimental comparisons (e.g., grayscale SVM recognition versus binary SVM recognition and entire-body detection versus upper-body detection) have been carried out to illustrate the feasibility of our approach.

19 citations

Journal ArticleDOI
26 Nov 2020-Symmetry
TL;DR: The history and progress of scene text detection is introduced and the traditional methods and deep learning-based methods in detail are classified in detail, pointing out the corresponding key issues and techniques.
Abstract: Scene text detection is attracting more and more attention and has become an important topic in machine vision research. With the development of mobile IoT (Internet of things) and deep learning technology, text detection research has made significant progress. This survey aims to summarize and analyze the main challenges and significant progress in scene text detection research. In this paper, we first introduce the history and progress of scene text detection and classify the traditional methods and deep learning-based methods in detail, pointing out the corresponding key issues and techniques. Then, we introduce commonly used benchmark datasets and evaluation protocols and identify state-of-the-art algorithms by comparison. Finally, we summarize and predict potential future research directions.

11 citations

Posted Content
TL;DR: In this paper, the authors proposed an end-to-end multimodal fusion model for pedestrian detection using RGB and thermal images, which consists of two distinct deformable ResNeXt-50 encoders for feature extraction from the two modalities.
Abstract: Pedestrian Detection is the most critical module of an Autonomous Driving system. Although a camera is commonly used for this purpose, its quality degrades severely in low-light night time driving scenarios. On the other hand, the quality of a thermal camera image remains unaffected in similar conditions. This paper proposes an end-to-end multimodal fusion model for pedestrian detection using RGB and thermal images. Its novel spatio-contextual deep network architecture is capable of exploiting the multimodal input efficiently. It consists of two distinct deformable ResNeXt-50 encoders for feature extraction from the two modalities. Fusion of these two encoded features takes place inside a multimodal feature embedding module (MuFEm) consisting of several groups of a pair of Graph Attention Network and a feature fusion unit. The output of the last feature fusion unit of MuFEm is subsequently passed to two CRFs for their spatial refinement. Further enhancement of the features is achieved by applying channel-wise attention and extraction of contextual information with the help of four RNNs traversing in four different directions. Finally, these feature maps are used by a single-stage decoder to generate the bounding box of each pedestrian and the score map. We have performed extensive experiments of the proposed framework on three publicly available multimodal pedestrian detection benchmark datasets, namely KAIST, CVC-14, and UTokyo. The results on each of them improved the respective state-of-the-art performance. A short video giving an overview of this work along with its qualitative results can be seen at this https URL.

10 citations