scispace - formally typeset
Search or ask a question
Proceedings ArticleDOI

Ablation-CAM: Visual Explanations for Deep Convolutional Network via Gradient-free Localization

01 Mar 2020-pp 983-991
TL;DR: This approach – Ablation-based Class Activation Mapping (Ablation CAM) uses ablation analysis to determine the importance of individual feature map units w.r.t. class to produce a coarse localization map highlighting the important regions in the image for predicting the concept.
Abstract: In response to recent criticism of gradient-based visualization techniques, we propose a new methodology to generate visual explanations for deep Convolutional Neural Networks (CNN) - based models. Our approach – Ablation-based Class Activation Mapping (Ablation CAM) uses ablation analysis to determine the importance (weights) of individual feature map units w.r.t. class. Further, this is used to produce a coarse localization map highlighting the important regions in the image for predicting the concept. Our objective and subjective evaluations show that this gradient-free approach works better than state-of-the-art Grad-CAM technique. Moreover, further experiments are carried out to show that Ablation-CAM is class discriminative as well as can be used to evaluate trust in a model.

Content maybe subject to copyright    Report

Citations
More filters
Journal ArticleDOI
TL;DR: Zhang et al. as discussed by the authors proposed a new CAM method to divide the highlighted areas into commonality and specificity saliency maps for multi-granularity visualization, which is based on the concept of granular computing theory.
Abstract: The interpretability of convolutional neural networks (CNNs) is attracting increasing attention. Class activation maps (CAM) intuitively explain the classification mechanisms of CNNs by highlighting important areas. However, as coarse-grained explanations, classical CAM methods are incapable of explaining the classification mechanism in detail. Inspired by the concept of granular computing theory, we propose a new CAM method to divide the highlighted areas into commonality saliency maps and specificity saliency maps for multi-granularity visualization. This method consists of three components. First, the universe is simplified to contain only a category pair. Then, neighborhood rough sets are used to divide the universe into three disjointed regions containing the commonality and specificity of category pairs by adaptive thresholds of optimal granularity. Finally, these three regions are used to generate multi-granularity saliency maps. This method well visualizes the multi-granularity classification mechanism of the CNN and further explains the misclassification. We compare this method to five representative CAM methods using two newly proposed fine-grained evaluation metrics and subjective observations. First, experiments demonstrate that the multi-granularity visualization method provides a more extensive and detailed explanation. Second, the adaptive thresholds can be adapted to different situations to obtain a reliable visualization explanation. Finally, in explaining of the adversarial attack, it visualizes the details that caused misclassification.

3 citations

Journal ArticleDOI
TL;DR: SAR-BagNet is presented, which is a novel interpretable recognition framework for SAR images that can provide a clear heatmap that can accurately reflect the impact of each part of a SAR image on the final network decision.
Abstract: Convolutional neural networks (CNNs) have been widely used in SAR image recognition and have achieved high recognition accuracy on some public datasets. However, due to the opacity of the decision-making mechanism, the reliability and credibility of CNNs are insufficient at present, which hinders their application in some important fields such as SAR image recognition. In recent years, various interpretable network structures have been proposed to discern the relationship between a CNN’s decision and image regions. Unfortunately, most interpretable networks are based on optical images, which have poor recognition performance for SAR images, and most of them cannot accurately explain the relationship between image parts and classification decisions. Based on the above problems, in this study, we present SAR-BagNet, which is a novel interpretable recognition framework for SAR images. SAR-BagNet can provide a clear heatmap that can accurately reflect the impact of each part of a SAR image on the final network decision. Except for the good interpretability, SAR-BagNet also has high recognition accuracy and can achieve 98.25% test accuracy.

3 citations

15 Oct 2022
TL;DR: DProtoNet as mentioned in this paper decouples the inference and interpretation modules of a prototype-based network by avoiding the use of prototype activation to explain the network's decisions in order to simultaneously improve the accuracy and interpretability of the neural network.
Abstract: The interpretability of neural networks has recently received extensive attention. Previous prototype-based explainable networks involved prototype activation in both reasoning and interpretation processes, requiring specific explainable structures for the prototype, thus making the network less accurate as it gains interpretability. Therefore, the decoupling prototypical network (DProtoNet) was proposed to avoid this problem. This new model contains encoder, inference, and interpretation modules. As regards the encoder module, unrestricted feature masks were presented to generate expressive features and prototypes. Regarding the inference module, a multi-image prototype learning method was introduced to update prototypes so that the network can learn generalized prototypes. Finally, concerning the interpretation module, a multiple dynamic masks (MDM) decoder was suggested to explain the neural network, which generates heatmaps using the consistent activation of the original image and mask image at the detection nodes of the network. It decouples the inference and interpretation modules of a prototype-based network by avoiding the use of prototype activation to explain the network’s decisions in order to simultaneously improve the accuracy and interpretability of the neural network. The multiple public general and medical datasets were tested, and the results confirmed that our method could achieve a 5% improvement in accuracy and state-of-the-art interpretability compared with previous methods.

3 citations

Journal ArticleDOI
TL;DR: In this paper , a large-scale feature map together with multi-scale feedback is added to improve the recognition ability of small objects, and the anchor boxes are optimized by clustering the ground truth object box of UAVT-3.
Abstract: Military object detection from Unmanned Aerial Vehicle (UAV) reconnaissance images faces challenges, including lack of image data, images with poor quality, and small objects. In this work, we simulate UAV low-altitude reconnaissance and construct the UAV reconnaissance image tank database UAVT-3. Then, we improve YOLOv5 and propose UAVT-YOLOv5 for object detection of UAV images. First, data augmentation of blurred images is introduced to improve the accuracy of fog and motion-blurred images. Secondly, a large-scale feature map together with multi-scale feedback is added to improve the recognition ability of small objects. Thirdly, we optimize the loss function by increasing the loss penalty of small objects and classes with fewer samples. Finally, the anchor boxes are optimized by clustering the ground truth object box of UAVT-3. The feature visualization technique Class Action Mapping (CAM) is introduced to explore the mechanisms of the proposed model. The experimental results of the improved model evaluated on UAVT-3 show that the mAP reaches 99.2%, an increase of 2.1% compared with YOLOv5, the detection speed is 40 frames per second, and data augmentation of blurred images yields an mAP increase of 20.4% and 26.6% for fog and motion blur images detection. The class action maps show the discriminant region of the tanks is the turret for UAVT-YOLOv5.

3 citations

Proceedings ArticleDOI
Lei Zhu, Qian Chen, Lujia Jin, Yunfei You, Yanye Lu 
16 Jul 2022
TL;DR: This paper elaborates a plug-and-play mechanism called BagCAMs to better project a well-trained classifier for the localization task without refining or re-training the baseline structure and can improve the performance of baseline WSOL methods to a great extent.
Abstract: , Abstract. Classification activation map (CAM), utilizing the classification structure to generate pixel-wise localization maps, is a crucial mechanism for weakly supervised object localization (WSOL). However, CAM directly uses the classifier trained on image-level features to locate objects, making it prefers to discern global discriminative factors rather than regional object cues. Thus only the discriminative locations are activated when feeding pixel-level features into this classifier. To solve this issue, this paper elaborates a plug-and-play mechanism called BagCAMs to better project a well-trained classifier for the localization task without refining or re-training the baseline structure. Our BagCAMs adopts a proposed regional localizer generation (RLG) strategy to define a set of regional localizers and then derive them from a well-trained classifier. These regional localizers can be viewed as the base learner that only dis-cerns region-wise object factors for localization tasks, and their results can be effectively weighted by our BagCAMs to form the final localization map. Experiments indicate that adopting our proposed BagCAMs can improve the performance of baseline WSOL methods to a great extent and obtains state-of-the-art performance on three WSOL benchmarks. Code are released at https://github.com/zh460045050/BagCAMs .

3 citations

References
More filters
Proceedings ArticleDOI
27 Jun 2016
TL;DR: In this article, the authors proposed a residual learning framework to ease the training of networks that are substantially deeper than those used previously, which won the 1st place on the ILSVRC 2015 classification task.
Abstract: Deeper neural networks are more difficult to train. We present a residual learning framework to ease the training of networks that are substantially deeper than those used previously. We explicitly reformulate the layers as learning residual functions with reference to the layer inputs, instead of learning unreferenced functions. We provide comprehensive empirical evidence showing that these residual networks are easier to optimize, and can gain accuracy from considerably increased depth. On the ImageNet dataset we evaluate residual nets with a depth of up to 152 layers—8× deeper than VGG nets [40] but still having lower complexity. An ensemble of these residual nets achieves 3.57% error on the ImageNet test set. This result won the 1st place on the ILSVRC 2015 classification task. We also present analysis on CIFAR-10 with 100 and 1000 layers. The depth of representations is of central importance for many visual recognition tasks. Solely due to our extremely deep representations, we obtain a 28% relative improvement on the COCO object detection dataset. Deep residual nets are foundations of our submissions to ILSVRC & COCO 2015 competitions1, where we also won the 1st places on the tasks of ImageNet detection, ImageNet localization, COCO detection, and COCO segmentation.

123,388 citations

Proceedings Article
03 Dec 2012
TL;DR: The state-of-the-art performance of CNNs was achieved by Deep Convolutional Neural Networks (DCNNs) as discussed by the authors, which consists of five convolutional layers, some of which are followed by max-pooling layers, and three fully-connected layers with a final 1000-way softmax.
Abstract: We trained a large, deep convolutional neural network to classify the 1.2 million high-resolution images in the ImageNet LSVRC-2010 contest into the 1000 different classes. On the test data, we achieved top-1 and top-5 error rates of 37.5% and 17.0% which is considerably better than the previous state-of-the-art. The neural network, which has 60 million parameters and 650,000 neurons, consists of five convolutional layers, some of which are followed by max-pooling layers, and three fully-connected layers with a final 1000-way softmax. To make training faster, we used non-saturating neurons and a very efficient GPU implementation of the convolution operation. To reduce overriding in the fully-connected layers we employed a recently-developed regularization method called "dropout" that proved to be very effective. We also entered a variant of this model in the ILSVRC-2012 competition and achieved a winning top-5 test error rate of 15.3%, compared to 26.2% achieved by the second-best entry.

73,978 citations

Proceedings Article
01 Jan 2015
TL;DR: In this paper, the authors investigated the effect of the convolutional network depth on its accuracy in the large-scale image recognition setting and showed that a significant improvement on the prior-art configurations can be achieved by pushing the depth to 16-19 layers.
Abstract: In this work we investigate the effect of the convolutional network depth on its accuracy in the large-scale image recognition setting. Our main contribution is a thorough evaluation of networks of increasing depth using an architecture with very small (3x3) convolution filters, which shows that a significant improvement on the prior-art configurations can be achieved by pushing the depth to 16-19 weight layers. These findings were the basis of our ImageNet Challenge 2014 submission, where our team secured the first and the second places in the localisation and classification tracks respectively. We also show that our representations generalise well to other datasets, where they achieve state-of-the-art results. We have made our two best-performing ConvNet models publicly available to facilitate further research on the use of deep visual representations in computer vision.

49,914 citations

Proceedings ArticleDOI
07 Jun 2015
TL;DR: Inception as mentioned in this paper is a deep convolutional neural network architecture that achieves the new state of the art for classification and detection in the ImageNet Large-Scale Visual Recognition Challenge 2014 (ILSVRC14).
Abstract: We propose a deep convolutional neural network architecture codenamed Inception that achieves the new state of the art for classification and detection in the ImageNet Large-Scale Visual Recognition Challenge 2014 (ILSVRC14). The main hallmark of this architecture is the improved utilization of the computing resources inside the network. By a carefully crafted design, we increased the depth and width of the network while keeping the computational budget constant. To optimize quality, the architectural decisions were based on the Hebbian principle and the intuition of multi-scale processing. One particular incarnation used in our submission for ILSVRC14 is called GoogLeNet, a 22 layers deep network, the quality of which is assessed in the context of classification and detection.

40,257 citations

Proceedings ArticleDOI
07 Jun 2015
TL;DR: The key insight is to build “fully convolutional” networks that take input of arbitrary size and produce correspondingly-sized output with efficient inference and learning.
Abstract: Convolutional networks are powerful visual models that yield hierarchies of features. We show that convolutional networks by themselves, trained end-to-end, pixels-to-pixels, exceed the state-of-the-art in semantic segmentation. Our key insight is to build “fully convolutional” networks that take input of arbitrary size and produce correspondingly-sized output with efficient inference and learning. We define and detail the space of fully convolutional networks, explain their application to spatially dense prediction tasks, and draw connections to prior models. We adapt contemporary classification networks (AlexNet [20], the VGG net [31], and GoogLeNet [32]) into fully convolutional networks and transfer their learned representations by fine-tuning [3] to the segmentation task. We then define a skip architecture that combines semantic information from a deep, coarse layer with appearance information from a shallow, fine layer to produce accurate and detailed segmentations. Our fully convolutional network achieves state-of-the-art segmentation of PASCAL VOC (20% relative improvement to 62.2% mean IU on 2012), NYUDv2, and SIFT Flow, while inference takes less than one fifth of a second for a typical image.

28,225 citations