scispace - formally typeset
Search or ask a question
Author

Shubham Choudhary

Bio: Shubham Choudhary is an academic researcher from Indian Institute of Technology Delhi. The author has contributed to research in topics: Visualization & Deep learning. The author has an hindex of 1, co-authored 2 publications receiving 4 citations.

Papers
More filters
Proceedings ArticleDOI
01 Sep 2019
TL;DR: It is found that the convolutional neural network performs better when trained with the assistance of fixation information compared to the network trained without eye fixations.
Abstract: This paper is concerned with the development of techniques for the recognition of ornamental characters motivated by the perceptual processes involved in humans. To understand the perceptual process, we have performed the eye-tracking experiment to recognize the special set of characters, with artistic variations in character structure and form. The novelty of this paper is the use of human visual fixations to supervise the intermediate layers of the convolutional neural network. From the results obtained, we found that the network performs better when trained with the assistance of fixation information compared to the network trained without eye fixations.

4 citations

Journal ArticleDOI
TL;DR: In this article , a gradient-based method for visualizing the reasoning behind the model's decision through visualization maps was proposed, which is better than the other class activation mapping (CAM) methods.
Abstract: Deep neural networks (DNNs) currently constitute the best-performing artificial vision systems. However, humans are still better at recognizing many characters, especially distorted, ornamental, or calligraphic characters compared to the highly sophisticated recognition models. Understanding the mechanism of character recognition by humans may give some cues for building better recognition models. However, the appropriate methodological approach to using these cues has not been much explored for developing character recognition models. Therefore, this paper tries to understand the process of character recognition by humans and DNNs by generating visual explanations for their respective decisions. We have used eye-tracking to assay the spatial distribution of information hotspots for humans via fixation maps. We have proposed a gradient-based method for visualizing the reasoning behind the model's decision through visualization maps and have proved that our method is better than the other class activation mapping (CAM) methods. Qualitative comparison between visualization maps and fixation maps reveals that both model and humans focus on similar regions in character in the case of correctly classified characters. Whereas, when the focused regions are different for humans and model, the characters are typically misclassified by the latter. Hence, we propose to use the fixation maps as a supervisory input to train the model which ultimately results in improved recognition performance and better generalization. As the proposed model gives some insights about the reasoning behind its decision, it can find applications in fields such as surveillance and medical applications where explainability helps to determine system fidelity.
Journal ArticleDOI
TL;DR: In this paper , the authors studied functional brain networks, formed using EEG responses, corresponding to four types of visual stimuli with varying contrast relationships: positive faces, chimeric faces, photo-negated faces and only eyes.
Abstract: How humans recognise faces and objects effortlessly, has become a great point of interest. To understand the underlying process, one of the approaches is to study the facial features, in particular ordinal contrast relations around the eye region, which plays a crucial role in face recognition and perception. Recently the graph-theoretic approaches to electroencephalogram (EEG) analysis are found to be effective in understating the underlying process of human brain while performing various tasks. We have explored this approach in face recognition and perception to know the importance of contrast features around the eye region. We studied functional brain networks, formed using EEG responses, corresponding to four types of visual stimuli with varying contrast relationships: Positive faces, chimeric faces (photo-negated faces, preserving the polarity of contrast relationships around eyes), photo-negated faces and only eyes. We observed the variations in brain networks of each type of stimuli by finding the distribution of graph distances across brain networks of all subjects. Moreover, our statistical analysis shows that positive and chimeric faces are equally easy to recognise in contrast to difficult recognition of negative faces and only eyes.
Posted Content
TL;DR: In this article, the congruence of information gathering strategies between humans and deep neural networks has been examined in a character recognition task, where the authors use the visual fixation maps obtained from the eye-tracking experiment as a supervisory input to align the model's focus on relevant character regions.
Abstract: Human observers engage in selective information uptake when classifying visual patterns. The same is true of deep neural networks, which currently constitute the best performing artificial vision systems. Our goal is to examine the congruence, or lack thereof, in the information-gathering strategies of the two systems. We have operationalized our investigation as a character recognition task. We have used eye-tracking to assay the spatial distribution of information hotspots for humans via fixation maps and an activation mapping technique for obtaining analogous distributions for deep networks through visualization maps. Qualitative comparison between visualization maps and fixation maps reveals an interesting correlate of congruence. The deep learning model considered similar regions in character, which humans have fixated in the case of correctly classified characters. On the other hand, when the focused regions are different for humans and deep nets, the characters are typically misclassified by the latter. Hence, we propose to use the visual fixation maps obtained from the eye-tracking experiment as a supervisory input to align the model's focus on relevant character regions. We find that such supervision improves the model's performance significantly and does not require any additional parameters. This approach has the potential to find applications in diverse domains such as medical analysis and surveillance in which explainability helps to determine system fidelity.

Cited by
More filters
Posted Content
TL;DR: In this paper, an efficient video summarization framework that will give a gist of the entire video in a few key-frames or video skims is proposed, which relies on the cognitive judgments of human beings.
Abstract: This paper proposes an efficient video summarization framework that will give a gist of the entire video in a few key-frames or video skims. Existing video summarization frameworks are based on algorithms that utilize computer vision low-level feature extraction or high-level domain level extraction. However, being the ultimate user of the summarized video, humans remain the most neglected aspect. Therefore, the proposed paper considers human's role in summarization and introduces human visual attention-based summarization techniques. To understand human attention behavior, we have designed and performed experiments with human participants using electroencephalogram (EEG) and eye-tracking technology. The EEG and eye-tracking data obtained from the experimentation are processed simultaneously and used to segment frames containing useful information from a considerable video volume. Thus, the frame segmentation primarily relies on the cognitive judgments of human beings. Using our approach, a video is summarized by 96.5% while maintaining higher precision and high recall factors. The comparison with the state-of-the-art techniques demonstrates that the proposed approach yields ceiling-level performance with reduced computational cost in summarising the videos.

1 citations

Proceedings ArticleDOI
10 Jan 2021
TL;DR: Wang et al. as mentioned in this paper proposed a collaborative human and machine attention module which considers both visual and network's attention, which can be integrated with any convolutional neural network (CNN) model.
Abstract: The deep learning models, which include attention mechanisms, are shown to enhance the performance and efficiency of the various computer vision tasks such as pattern recognition, object detection, face recognition, etc. Although the visual attention mechanism is the source of inspiration for these models, recent attention models consider ’attention’ as a pure machine vision optimization problem, and visual attention remains the most neglected aspect. Therefore, this paper presents a collaborative human and machine attention module which considers both visual and network’s attention. The proposed module is inspired by the dorsal (‘where’) pathways of visual processing and can be integrated with any convolutional neural network (CNN) model. First, the module computes the spatial attention map from the input feature maps, which is then combined with the visual attention maps. The visual attention maps are created using eye-fixations obtained by performing an eye-tracking experiment with human participants. The visual attention map covers the highly salient and discriminating image regions as humans tend to focus on such regions, whereas the other relevant image regions are processed by spatial attention map. The combination of these two maps results in the finer refinement in feature maps, resulting in improved performance. The comparative analysis reveals that our model not only shows significant improvement over the baseline model but also outperforms the other models. We hope that our findings using a collaborative human-machine attention module will be helpful in other computer vision tasks as well.

1 citations

Journal ArticleDOI
TL;DR: In this article, the results of a free-viewing gaze fixation study conducted on 3904 freehand sketches distributed across 160 object categories were analyzed and it was shown that fixation sequences exhibit marked consistency within a sketch, across sketches of a category and even across suitably grouped sets of categories.
Abstract: The study of eye gaze fixations on photographic images is an active research area. In contrast, the image subcategory of freehand sketches has not received as much attention for such studies. In this paper, we analyze the results of a free-viewing gaze fixation study conducted on 3904 freehand sketches distributed across 160 object categories. Our analysis shows that fixation sequences exhibit marked consistency within a sketch, across sketches of a category and even across suitably grouped sets of categories. This multi-level consistency is remarkable given the variability in depiction and extreme image content sparsity that characterizes hand-drawn object sketches. In our paper, we show that the multi-level consistency in the fixation data can be exploited to (a) predict a test sketch's category given only its fixation sequence and (b) build a computational model which predicts part-labels underlying fixations on objects. We hope that our findings motivate the community to deem sketch-like representations worthy of gaze-based studies vis-a-vis photographic images.
Posted Content
TL;DR: In this article, the congruence of information gathering strategies between humans and deep neural networks has been examined in a character recognition task, where the authors use the visual fixation maps obtained from the eye-tracking experiment as a supervisory input to align the model's focus on relevant character regions.
Abstract: Human observers engage in selective information uptake when classifying visual patterns. The same is true of deep neural networks, which currently constitute the best performing artificial vision systems. Our goal is to examine the congruence, or lack thereof, in the information-gathering strategies of the two systems. We have operationalized our investigation as a character recognition task. We have used eye-tracking to assay the spatial distribution of information hotspots for humans via fixation maps and an activation mapping technique for obtaining analogous distributions for deep networks through visualization maps. Qualitative comparison between visualization maps and fixation maps reveals an interesting correlate of congruence. The deep learning model considered similar regions in character, which humans have fixated in the case of correctly classified characters. On the other hand, when the focused regions are different for humans and deep nets, the characters are typically misclassified by the latter. Hence, we propose to use the visual fixation maps obtained from the eye-tracking experiment as a supervisory input to align the model's focus on relevant character regions. We find that such supervision improves the model's performance significantly and does not require any additional parameters. This approach has the potential to find applications in diverse domains such as medical analysis and surveillance in which explainability helps to determine system fidelity.