scispace - formally typeset
Search or ask a question

Showing papers on "Contextual image classification published in 2021"


Posted Content
Ze Liu1, Yutong Lin1, Yue Cao1, Han Hu1, Yixuan Wei1, Zheng Zhang1, Stephen Lin1, Baining Guo1 
TL;DR: Wang et al. as mentioned in this paper proposed a new vision Transformer called Swin Transformer, which is computed with shifted windows to address the differences between the two domains, such as large variations in the scale of visual entities and the high resolution of pixels in images compared to words in text.
Abstract: This paper presents a new vision Transformer, called Swin Transformer, that capably serves as a general-purpose backbone for computer vision. Challenges in adapting Transformer from language to vision arise from differences between the two domains, such as large variations in the scale of visual entities and the high resolution of pixels in images compared to words in text. To address these differences, we propose a hierarchical Transformer whose representation is computed with shifted windows. The shifted windowing scheme brings greater efficiency by limiting self-attention computation to non-overlapping local windows while also allowing for cross-window connection. This hierarchical architecture has the flexibility to model at various scales and has linear computational complexity with respect to image size. These qualities of Swin Transformer make it compatible with a broad range of vision tasks, including image classification (86.4 top-1 accuracy on ImageNet-1K) and dense prediction tasks such as object detection (58.7 box AP and 51.1 mask AP on COCO test-dev) and semantic segmentation (53.5 mIoU on ADE20K val). Its performance surpasses the previous state-of-the-art by a large margin of +2.7 box AP and +2.6 mask AP on COCO, and +3.2 mIoU on ADE20K, demonstrating the potential of Transformer-based models as vision backbones. The code and models will be made publicly available at~\url{this https URL}.

3,518 citations


Proceedings ArticleDOI
20 Jun 2021
TL;DR: BoTNet as mentioned in this paper incorporates self-attention for image classification, object detection, and instance segmentation, and achieves state-of-the-art performance on the ImageNet benchmark.
Abstract: We present BoTNet, a conceptually simple yet powerful backbone architecture that incorporates self-attention for multiple computer vision tasks including image classification, object detection and instance segmentation. By just replacing the spatial convolutions with global self-attention in the final three bottleneck blocks of a ResNet and no other changes, our approach improves upon the baselines significantly on instance segmentation and object detection while also reducing the parameters, with minimal overhead in latency. Through the design of BoTNet, we also point out how ResNet bottleneck blocks with self-attention can be viewed as Transformer blocks. Without any bells and whistles, BoTNet achieves 44.4% Mask AP and 49.7% Box AP on the COCO Instance Segmentation benchmark using the Mask R-CNN framework; surpassing the previous best published single model and single scale results of ResNeSt [67] evaluated on the COCO validation set. Finally, we present a simple adaptation of the BoTNet design for image classification, resulting in models that achieve a strong performance of 84.7% top-1 accuracy on the ImageNet benchmark while being up to 1.64x faster in "compute"1 time than the popular EfficientNet models on TPU-v3 hardware. We hope our simple and effective approach will serve as a strong baseline for future research in self-attention models for vision.2

675 citations


Proceedings Article
03 May 2021
TL;DR: The Vision Transformer (ViT) as discussed by the authors uses a pure transformer applied directly to sequences of image patches to perform very well on image classification tasks, achieving state-of-the-art results on ImageNet, CIFAR-100, VTAB, etc.
Abstract: While the Transformer architecture has become the de-facto standard for natural language processing tasks, its applications to computer vision remain limited. In vision, attention is either applied in conjunction with convolutional networks, or used to replace certain components of convolutional networks while keeping their overall structure in place. We show that this reliance on CNNs is not necessary and a pure transformer applied directly to sequences of image patches can perform very well on image classification tasks. When pre-trained on large amounts of data and transferred to multiple mid-sized or small image recognition benchmarks (ImageNet, CIFAR-100, VTAB, etc.), Vision Transformer (ViT) attains excellent results compared to state-of-the-art convolutional networks while requiring substantially fewer computational resources to train.

573 citations


Journal ArticleDOI
TL;DR: A new minibatch GCN is developed that is capable of inferring out-of-sample data without retraining networks and improving classification performance, and three fusion strategies are explored: additive fusion, elementwise multiplicative fusion, and concatenation fusion to measure the obtained performance gain.
Abstract: Convolutional neural networks (CNNs) have been attracting increasing attention in hyperspectral (HS) image classification due to their ability to capture spatial–spectral feature representations. Nevertheless, their ability in modeling relations between the samples remains limited. Beyond the limitations of grid sampling, graph convolutional networks (GCNs) have been recently proposed and successfully applied in irregular (or nongrid) data representation and analysis. In this article, we thoroughly investigate CNNs and GCNs (qualitatively and quantitatively) in terms of HS image classification. Due to the construction of the adjacency matrix on all the data, traditional GCNs usually suffer from a huge computational cost, particularly in large-scale remote sensing (RS) problems. To this end, we develop a new minibatch GCN (called miniGCN hereinafter), which allows to train large-scale GCNs in a minibatch fashion. More significantly, our miniGCN is capable of inferring out-of-sample data without retraining networks and improving classification performance. Furthermore, as CNNs and GCNs can extract different types of HS features, an intuitive solution to break the performance bottleneck of a single model is to fuse them. Since miniGCNs can perform batchwise network training (enabling the combination of CNNs and GCNs), we explore three fusion strategies: additive fusion, elementwise multiplicative fusion, and concatenation fusion to measure the obtained performance gain. Extensive experiments, conducted on three HS data sets, demonstrate the advantages of miniGCNs over GCNs and the superiority of the tested fusion strategies with regard to the single CNN or GCN models. The codes of this work will be available at https://github.com/danfenghong/IEEE_TGRS_GCN for the sake of reproducibility.

560 citations


Posted Content
Chun-Fu Chen1, Quanfu Fan1, Rameswar Panda1
TL;DR: Zhang et al. as mentioned in this paper proposed a dual-branch transformer to combine image patches (i.e., tokens in a transformer) of different sizes to produce stronger image features, which achieved promising results on image classification compared to convolutional neural networks.
Abstract: The recently developed vision transformer (ViT) has achieved promising results on image classification compared to convolutional neural networks. Inspired by this, in this paper, we study how to learn multi-scale feature representations in transformer models for image classification. To this end, we propose a dual-branch transformer to combine image patches (i.e., tokens in a transformer) of different sizes to produce stronger image features. Our approach processes small-patch and large-patch tokens with two separate branches of different computational complexity and these tokens are then fused purely by attention multiple times to complement each other. Furthermore, to reduce computation, we develop a simple yet effective token fusion module based on cross attention, which uses a single token for each branch as a query to exchange information with other branches. Our proposed cross-attention only requires linear time for both computational and memory complexity instead of quadratic time otherwise. Extensive experiments demonstrate that our approach performs better than or on par with several concurrent works on vision transformer, in addition to efficient CNN models. For example, on the ImageNet1K dataset, with some architectural changes, our approach outperforms the recent DeiT by a large margin of 2\% with a small to moderate increase in FLOPs and model parameters. Our source codes and models are available at \url{this https URL}.

310 citations


Book ChapterDOI
27 Sep 2021
TL;DR: TransBTS as mentioned in this paper exploits Transformer in 3D CNN for MRI Brain Tumor Segmentation and proposes a novel network named TransBTS based on the encoder-decoder structure.
Abstract: Transformer, which can benefit from global (long-range) information modeling using self-attention mechanisms, has been successful in natural language processing and 2D image classification recently. However, both local and global features are crucial for dense prediction tasks, especially for 3D medical image segmentation. In this paper, we for the first time exploit Transformer in 3D CNN for MRI Brain Tumor Segmentation and propose a novel network named TransBTS based on the encoder-decoder structure. To capture the local 3D context information, the encoder first utilizes 3D CNN to extract the volumetric spatial feature maps. Meanwhile, the feature maps are reformed elaborately for tokens that are fed into Transformer for global feature modeling. The decoder leverages the features embedded by Transformer and performs progressive upsampling to predict the detailed segmentation map. Extensive experimental results on both BraTS 2019 and 2020 datasets show that TransBTS achieves comparable or higher results than previous state-of-the-art 3D methods for brain tumor segmentation on 3D MRI scans. The source code is available at https://github.com/Wenxuan-1119/TransBTS.

306 citations


Journal Article
TL;DR: This work proposes an alternative unsupervised strategy to learn medical visual representations directly from the naturally occurring pairing of images and textual data, and shows that this method leads to image representations that considerably outperform strong baselines in most settings.
Abstract: Learning visual representations of medical images is core to medical image understanding but its progress has been held back by the small size of hand-labeled datasets. Existing work commonly relies on transferring weights from ImageNet pretraining, which is suboptimal due to drastically different image characteristics, or rule-based label extraction from the textual report data paired with medical images, which is inaccurate and hard to generalize. We propose an alternative unsupervised strategy to learn medical visual representations directly from the naturally occurring pairing of images and textual data. Our method of pretraining medical image encoders with the paired text data via a bidirectional contrastive objective between the two modalities is domain-agnostic, and requires no additional expert input. We test our method by transferring our pretrained weights to 4 medical image classification tasks and 2 zero-shot retrieval tasks, and show that our method leads to image representations that considerably outperform strong baselines in most settings. Notably, in all 4 classification tasks, our method requires only 10% as much labeled training data as an ImageNet initialized counterpart to achieve better or comparable performance, demonstrating superior data efficiency.

266 citations


Journal ArticleDOI
TL;DR: This work proposes an approach for semi-supervised semantic segmentation that learns from limited pixel-wise annotated samples while exploiting additional annotation-free images, and achieves significant improvement over existing methods, especially when trained with very few labeled samples.
Abstract: The ability to understand visual information from limited labeled data is an important aspect of machine learning. While image-level classification has been extensively studied in a semi-supervised setting, dense pixel-level classification with limited data has only drawn attention recently. In this work, we propose an approach for semi-supervised semantic segmentation that learns from limited pixel-wise annotated samples while exploiting additional annotation-free images. The proposed approach relies on adversarial training with a feature matching loss to learn from unlabeled images. It uses two network branches that link semi-supervised classification with semi-supervised segmentation including self-training. The dual-branch approach reduces both the low-level and the high-level artifacts typical when training with few labels. The approach attains significant improvement over existing methods, especially when trained with very few labeled samples. On several standard benchmarks—PASCAL VOC 2012, PASCAL-Context, and Cityscapes—the approach achieves new state-of-the-art in semi-supervised learning.

255 citations


Proceedings ArticleDOI
27 Apr 2021
TL;DR: A novel and efficient structure named Short-Term Dense Concatenate network (STDC network) is proposed by removing structure redundancy by gradually reducing the dimension of feature maps and use the aggregation of them for image representation, which forms the basic module of STDC network.
Abstract: BiSeNet [28], [27] has been proved to be a popular two-stream network for real-time segmentation. However, its principle of adding an extra path to encode spatial information is time-consuming, and the backbones borrowed from pretrained tasks, e.g., image classification, may be inefficient for image segmentation due to the deficiency of task-specific design. To handle these problems, we propose a novel and efficient structure named Short-Term Dense Concatenate network (STDC network) by removing structure redundancy. Specifically, we gradually reduce the dimension of feature maps and use the aggregation of them for image representation, which forms the basic module of STDC network. In the decoder, we propose a Detail Aggregation module by integrating the learning of spatial information into low-level layers in single-stream manner. Finally, the low-level features and deep features are fused to predict the final segmentation results. Extensive experiments on Cityscapes and CamVid dataset demonstrate the effectiveness of our method by achieving promising trade-off between segmentation accuracy and inference speed. On Cityscapes, we achieve 71.9% mIoU on the test set with a speed of 250.4 FPS on NVIDIA GTX 1080Ti, which is 45.2% faster than the latest methods, and achieve 76.8% mIoU with 97.0 FPS while inferring on higher resolution images. Code is available at https://github.com/MichaelFan01/STDC-Seg.

245 citations


Journal ArticleDOI
TL;DR: The experimental results in this paper show that traditional machine learning has a better solution effect on small sample data sets, and deep learning framework has higher recognition accuracy on large sample data set.

227 citations


Journal ArticleDOI
TL;DR: In this article, a general multimodal deep learning (MDL) framework is proposed for geoscience and remote sensing (RS) applications, which is not only limited to pixel-wise classification tasks but also applicable to spatial information modeling with CNNs.
Abstract: Classification and identification of the materials lying over or beneath the earth’s surface have long been a fundamental but challenging research topic in geoscience and remote sensing (RS), and have garnered a growing concern owing to the recent advancements of deep learning techniques. Although deep networks have been successfully applied in single-modality-dominated classification tasks, yet their performance inevitably meets the bottleneck in complex scenes that need to be finely classified, due to the limitation of information diversity. In this work, we provide a baseline solution to the aforementioned difficulty by developing a general multimodal deep learning (MDL) framework. In particular, we also investigate a special case of multi-modality learning (MML)—cross-modality learning (CML) that exists widely in RS image classification applications. By focusing on “what,” “where,” and “how” to fuse, we show different fusion strategies as well as how to train deep networks and build the network architecture. Specifically, five fusion architectures are introduced and developed, further being unified in our MDL framework. More significantly, our framework is not only limited to pixel-wise classification tasks but also applicable to spatial information modeling with convolutional neural networks (CNNs). To validate the effectiveness and superiority of the MDL framework, extensive experiments related to the settings of MML and CML are conducted on two different multimodal RS data sets. Furthermore, the codes and data sets will be available at https://github.com/danfenghong/IEEE_TGRS_MDL-RS , contributing to the RS community.

Journal ArticleDOI
TL;DR: A novel attention-based deep learning model using the attention module with VGG-16 that captures the spatial relationship between the ROIs in CXR images and indicates that it is suitable for CxR image classification in COVID-19 diagnosis.
Abstract: Computer-aided diagnosis (CAD) methods such as Chest X-rays (CXR)-based method is one of the cheapest alternative options to diagnose the early stage of COVID-19 disease compared to other alternatives such as Polymerase Chain Reaction (PCR), Computed Tomography (CT) scan, and so on. To this end, there have been few works proposed to diagnose COVID-19 by using CXR-based methods. However, they have limited performance as they ignore the spatial relationships between the region of interests (ROIs) in CXR images, which could identify the likely regions of COVID-19’s effect in the human lungs. In this paper, we propose a novel attention-based deep learning model using the attention module with VGG-16. By using the attention module, we capture the spatial relationship between the ROIs in CXR images. In the meantime, by using an appropriate convolution layer (4th pooling layer) of the VGG-16 model in addition to the attention module, we design a novel deep learning model to perform fine-tuning in the classification process. To evaluate the performance of our method, we conduct extensive experiments by using three COVID-19 CXR image datasets. The experiment and analysis demonstrate the stable and promising performance of our proposed method compared to the state-of-the-art methods. The promising classification performance of our proposed method indicates that it is suitable for CXR image classification in COVID-19 diagnosis.

Journal ArticleDOI
TL;DR: Zhang et al. as discussed by the authors proposed a remote-sensing scene-classification method based on vision transformers, which does not rely on convolution layers as in standard convolutional neural networks (CNNs).
Abstract: In this paper, we propose a remote-sensing scene-classification method based on vision transformers. These types of networks, which are now recognized as state-of-the-art models in natural language processing, do not rely on convolution layers as in standard convolutional neural networks (CNNs). Instead, they use multihead attention mechanisms as the main building block to derive long-range contextual relation between pixels in images. In a first step, the images under analysis are divided into patches, then converted to sequence by flattening and embedding. To keep information about the position, embedding position is added to these patches. Then, the resulting sequence is fed to several multihead attention layers for generating the final representation. At the classification stage, the first token sequence is fed to a softmax classification layer. To boost the classification performance, we explore several data augmentation strategies to generate additional data for training. Moreover, we show experimentally that we can compress the network by pruning half of the layers while keeping competing classification accuracies. Experimental results conducted on different remote-sensing image datasets demonstrate the promising capability of the model compared to state-of-the-art methods. Specifically, Vision Transformer obtains an average classification accuracy of 98.49%, 95.86%, 95.56% and 93.83% on Merced, AID, Optimal31 and NWPU datasets, respectively. While the compressed version obtained by removing half of the multihead attention layers yields 97.90%, 94.27%, 95.30% and 93.05%, respectively.

Journal ArticleDOI
TL;DR: A comprehensive survey of applications of CNNs in medical image understanding is presented in this article, where a discussion on CNN and its various award-winning frameworks have been presented, and critical discussion on some of the challenges is also presented.
Abstract: Imaging techniques are used to capture anomalies of the human body. The captured images must be understood for diagnosis, prognosis and treatment planning of the anomalies. Medical image understanding is generally performed by skilled medical professionals. However, the scarce availability of human experts and the fatigue and rough estimate procedures involved with them limit the effectiveness of image understanding performed by skilled medical professionals. Convolutional neural networks (CNNs) are effective tools for image understanding. They have outperformed human experts in many image understanding tasks. This article aims to provide a comprehensive survey of applications of CNNs in medical image understanding. The underlying objective is to motivate medical image understanding researchers to extensively apply CNNs in their research and diagnosis. A brief introduction to CNNs has been presented. A discussion on CNN and its various award-winning frameworks have been presented. The major medical image understanding tasks, namely image classification, segmentation, localization and detection have been introduced. Applications of CNN in medical image understanding of the ailments of brain, breast, lung and other organs have been surveyed critically and comprehensively. A critical discussion on some of the challenges is also presented.

Proceedings ArticleDOI
26 Mar 2021
TL;DR: In this article, the authors propose a hybrid network structure composed of a supervised contrastive loss to learn image representations and a cross-entropy loss for learning classifiers, where the learning is progressively transited from feature learning to the classifier learning to embody the idea that better features make better classifiers.
Abstract: Learning discriminative image representations plays a vital role in long-tailed image classification because it can ease the classifier learning in imbalanced cases. Given the promising performance contrastive learning has shown recently in representation learning, in this work, we explore effective supervised contrastive learning strategies and tailor them to learn better image representations from imbalanced data in order to boost the classification accuracy thereon. Specifically, we propose a novel hybrid network structure being composed of a supervised contrastive loss to learn image representations and a cross-entropy loss to learn classifiers, where the learning is progressively transited from feature learning to the classifier learning to embody the idea that better features make better classifiers. We explore two variants of contrastive loss for feature learning, which vary in the forms but share a common idea of pulling the samples from the same class together in the normalized embedding space and pushing the samples from different classes apart. One of them is the recently proposed supervised contrastive (SC) loss, which is designed on top of the state-of-the-art unsupervised contrastive loss by incorporating positive samples from the same class. The other is a prototypical supervised contrastive (PSC) learning strategy which addresses the intensive memory consumption in standard SC loss and thus shows more promise under limited memory budget. Extensive experiments on three long-tailed classification datasets demonstrate the advantage of the proposed contrastive learning based hybrid networks in long-tailed classification.

Journal ArticleDOI
TL;DR: The different computational model of SVM and key process for the SVM system development are reviewed and a survey on their applications for image classification is provided.
Abstract: Life of any living being is impossible if it does not have the ability to differentiate between various things, objects, smell, taste, colors, etc. Human being is a good ability to classify the object easily such as different human face, images. This is time of the machine so we want that machine can do all the work like as a human, this is part of machine learning. Here this paper discusses the some important technique for the image classification. What are the techniques through which a machine can learn for the image classification task as well as perform the classification task with efficiently. The most known technique to learn a machine is SVM. Support Vector machine (SVM) has evolved as an efficient paradigm for classification. SVM has a strongest mathematical model for classification and regression. This powerful mathematical foundation gives a new direction for further research in the vast field of classification and regression. Over the past few decades, various improvements to SVM has appeared, such as twin SVM, Lagrangian SVM, Quantum Support vector machine, least square support vector machine, etc., which will be further discussed in the paper, led to the creation of a new approach for better classification accuracy. For improving the accuracy as well as performance of SVM, we must aware of how a kernel function should be selected and what are the different approaches for parameter selection. This paper reviews the different computational model of SVM and key process for the SVM system development. Furthermore provides survey on their applications for image classification.

Proceedings ArticleDOI
Songyang Zhang1, Zeming Li, Shipeng Yan1, Xuming He1, Jian Sun 
01 Jun 2021
TL;DR: In this paper, a unified distribution alignment strategy for long-tail visual recognition is proposed, where an adaptive calibration function is developed to adjust the classification scores for each data point and a generalized re-weight method is introduced to balance the class prior.
Abstract: Despite the recent success of deep neural networks, it remains challenging to effectively model the long-tail class distribution in visual recognition tasks. To address this problem, we first investigate the performance bottleneck of the two-stage learning framework via ablative study. Motivated by our discovery, we propose a unified distribution alignment strategy for long-tail visual recognition. Specifically, we develop an adaptive calibration function that enables us to adjust the classification scores for each data point. We then introduce a generalized re-weight method in the two-stage learning to balance the class prior, which provides a flexible and unified solution to diverse scenarios in visual recognition tasks. We validate our method by extensive experiments on four tasks, including image classification, semantic segmentation, object detection, and instance segmentation. Our approach achieves the state-of-the-art results across all four recognition tasks with a simple and unified framework.

Posted Content
TL;DR: MLP-Mixer as discussed by the authors is an architecture based exclusively on multi-layer perceptrons (MLP), which contains two types of layers: one with MLPs applied independently to image patches (i.e. "mixing" the per-location features), and one with LSTM applied across patches, and it achieves competitive scores on image classification benchmarks, with pre-training and inference cost comparable to state-of-theart models.
Abstract: Convolutional Neural Networks (CNNs) are the go-to model for computer vision. Recently, attention-based networks, such as the Vision Transformer, have also become popular. In this paper we show that while convolutions and attention are both sufficient for good performance, neither of them are necessary. We present MLP-Mixer, an architecture based exclusively on multi-layer perceptrons (MLPs). MLP-Mixer contains two types of layers: one with MLPs applied independently to image patches (i.e. "mixing" the per-location features), and one with MLPs applied across patches (i.e. "mixing" spatial information). When trained on large datasets, or with modern regularization schemes, MLP-Mixer attains competitive scores on image classification benchmarks, with pre-training and inference cost comparable to state-of-the-art models. We hope that these results spark further research beyond the realms of well established CNNs and Transformers.

Posted Content
TL;DR: In this paper, a multi-instance contrastive learning (MICLe) method was proposed to construct more informative positive pairs for self-supervised learning in medical image classification and achieved an improvement of 6.7% in top-1 accuracy and 1.1% in mean AUC on dermatology and chest X-ray classification.
Abstract: Self-supervised pretraining followed by supervised fine-tuning has seen success in image recognition, especially when labeled examples are scarce, but has received limited attention in medical image analysis. This paper studies the effectiveness of self-supervised learning as a pretraining strategy for medical image classification. We conduct experiments on two distinct tasks: dermatology skin condition classification from digital camera images and multi-label chest X-ray classification, and demonstrate that self-supervised learning on ImageNet, followed by additional self-supervised learning on unlabeled domain-specific medical images significantly improves the accuracy of medical image classifiers. We introduce a novel Multi-Instance Contrastive Learning (MICLe) method that uses multiple images of the underlying pathology per patient case, when available, to construct more informative positive pairs for self-supervised learning. Combining our contributions, we achieve an improvement of 6.7% in top-1 accuracy and an improvement of 1.1% in mean AUC on dermatology and chest X-ray classification respectively, outperforming strong supervised baselines pretrained on ImageNet. In addition, we show that big self-supervised models are robust to distribution shift and can learn efficiently with a small number of labeled medical images.

Posted Content
TL;DR: Transformer networks as mentioned in this paper enable modeling long dependencies between input sequence elements and support parallel processing of sequence as compared to recurrent networks e.g., Long short-term memory (LSTM).
Abstract: Astounding results from Transformer models on natural language tasks have intrigued the vision community to study their application to computer vision problems. Among their salient benefits, Transformers enable modeling long dependencies between input sequence elements and support parallel processing of sequence as compared to recurrent networks e.g., Long short-term memory (LSTM). Different from convolutional networks, Transformers require minimal inductive biases for their design and are naturally suited as set-functions. Furthermore, the straightforward design of Transformers allows processing multiple modalities (e.g., images, videos, text and speech) using similar processing blocks and demonstrates excellent scalability to very large capacity networks and huge datasets. These strengths have led to exciting progress on a number of vision tasks using Transformer networks. This survey aims to provide a comprehensive overview of the Transformer models in the computer vision discipline. We start with an introduction to fundamental concepts behind the success of Transformers i.e., self-attention, large-scale pre-training, and bidirectional encoding. We then cover extensive applications of transformers in vision including popular recognition tasks (e.g., image classification, object detection, action recognition, and segmentation), generative modeling, multi-modal tasks (e.g., visual-question answering, visual reasoning, and visual grounding), video processing (e.g., activity recognition, video forecasting), low-level vision (e.g., image super-resolution, image enhancement, and colorization) and 3D analysis (e.g., point cloud classification and segmentation). We compare the respective advantages and limitations of popular techniques both in terms of architectural design and their experimental value. Finally, we provide an analysis on open research directions and possible future works.

Journal ArticleDOI
TL;DR: A deep learning method in image classification for the detection of colorectal cancer with ResNet architecture demonstrates the profoundly reliable and reproducible outcomes for biomedical image analysis.

Journal ArticleDOI
11 Jun 2021-Irbm
TL;DR: The proposed hybrid model provided more effective and improvement techniques for classification and with threshold-based segmentation in terms of detection and the overall accuracy of the hybrid CNN-SVM is obtained.
Abstract: Objective In this research paper, the brain MRI images are going to classify by considering the excellence of CNN on a public dataset to classify Benign and Malignant tumors. Materials and Methods Deep learning (DL) methods due to good performance in the last few years have become more popular for Image classification. Convolution Neural Network (CNN), with several methods, can extract features without using handcrafted models, and eventually, show better accuracy of classification. The proposed hybrid model combined CNN and support vector machine (SVM) in terms of classification and with threshold-based segmentation in terms of detection. Result The findings of previous studies are based on different models with their accuracy as Rough Extreme Learning Machine (RELM)-94.233%, Deep CNN (DCNN)-95%, Deep Neural Network (DNN) and Discrete Wavelet Autoencoder (DWA)-96%, k-nearest neighbors (kNN)-96.6%, CNN-97.5%. The overall accuracy of the hybrid CNN-SVM is obtained as 98.4959%. Conclusion In today's world, brain cancer is one of the most dangerous diseases with the highest death rate, detection and classification of brain tumors due to abnormal growth of cells, shapes, orientation, and the location is a challengeable task in medical imaging. Magnetic resonance imaging (MRI) is a typical method of medical imaging for brain tumor analysis. Conventional machine learning (ML) techniques categorize brain cancer based on some handicraft property with the radiologist specialist choice. That can lead to failure in the execution and also decrease the effectiveness of an Algorithm. With a brief look came to know that the proposed hybrid model provides more effective and improvement techniques for classification.

Journal ArticleDOI
TL;DR: Whether image classification performance drops with each kind of degradation, whether this drop can be avoided by including degraded images into training, and whether existing computer vision algorithms that attempt to remove such degradations can help improve the image classificationperformance are studied.
Abstract: Just like many other topics in computer vision, image classification has achieved significant progress recently by using deep learning neural networks, especially the Convolutional Neural Networks (CNNs). Most of the existing works focused on classifying very clear natural images, evidenced by the widely used image databases, such as Caltech-256, PASCAL VOCs, and ImageNet. However, in many real applications, the acquired images may contain certain degradations that lead to various kinds of blurring, noise, and distortions. One important and interesting problem is the effect of such degradations to the performance of CNN-based image classification and whether degradation removal helps CNN-based image classification. More specifically, we wonder whether image classification performance drops with each kind of degradation, whether this drop can be avoided by including degraded images into training, and whether existing computer vision algorithms that attempt to remove such degradations can help improve the image classification performance. In this article, we empirically study those problems for nine kinds of degraded images—hazy images, motion-blurred images, fish-eye images, underwater images, low resolution images, salt-and-peppered images, images with white Gaussian noise, Gaussian-blurred images, and out-of-focus images. We expect this article can draw more interests from the community to study the classification of degraded images.

Journal ArticleDOI
TL;DR: A novel few-shot learning method named multi-scale metric learning (MSML) is proposed to extract multi- Scale features and learn the multi- scale relations between samples for the classification of few- shot learning.
Abstract: Few-shot learning in image classification is developed to learn a model that aims to identify unseen classes with only few training samples for each class. Fewer training samples and new tasks of classification make many traditional classification models no longer applicable. In this paper, a novel few-shot learning method named multi-scale metric learning (MSML) is proposed to extract multi-scale features and learn the multi-scale relations between samples for the classification of few-shot learning. In the proposed method, a feature pyramid structure is introduced for multi-scale feature embedding, which aims to combine high-level strong semantic features with low-level but abundant visual features. Then a multi-scale relation generation network (MRGN) is developed for hierarchical metric learning, in which high-level features are corresponding to deeper metric learning while low-level features are corresponding to lighter metric learning. Moreover, a novel loss function named intra-class and inter-class relation loss (IIRL) is proposed to optimize the proposed deep network, which aims to strengthen the correlation between homogeneous groups of samples and weaken the correlation between heterogeneous groups of samples. Experimental results on mini ImageNet and tiered ImageNet demonstrate that the proposed method achieves superior performance in few-shot learning problem.

Journal Article
TL;DR: Different architectures based on PyConv are presented for four main tasks on visual recognition: image classification, video action classification/recognition, object detection and semantic image segmentation/parsing, showing significant improvements over all these core tasks in comparison with the baselines.
Abstract: This work introduces pyramidal convolution (PyConv), which is capable of processing the input at multiple filter scales PyConv contains a pyramid of kernels, where each level involves different types of filters with varying size and depth, which are able to capture different levels of details in the scene On top of these improved recognition capabilities, PyConv is also efficient and, with our formulation, it does not increase the computational cost and parameters compared to standard convolution Moreover, it is very flexible and extensible, providing a large space of potential network architectures for different applications PyConv has the potential to impact nearly every computer vision task and, in this work, we present different architectures based on PyConv for four main tasks on visual recognition: image classification, video action classification/recognition, object detection and semantic image segmentation/parsing Our approach shows significant improvements over all these core tasks in comparison with the baselines For instance, on image recognition, our 50-layers network outperforms in terms of recognition performance on ImageNet dataset its counterpart baseline ResNet with 152 layers, while having 239 times less parameters, 252 times lower computational complexity and more than 3 times less layers On image segmentation, our novel framework sets a new state-of-the-art on the challenging ADE20K benchmark for scene parsing We will make the code and models publicly available

Journal ArticleDOI
21 Jul 2021-PeerJ
TL;DR: Zhang et al. as mentioned in this paper designed a multi-scale meta-relational network to solve small sample learning through the idea of meta-learning "how to learn by using previous experience".
Abstract: Small sample learning aims to learn information about object categories from a single or a few training samples. This learning style is crucial for deep learning methods based on large amounts of data. The deep learning method can solve small sample learning through the idea of meta-learning "how to learn by using previous experience." Therefore, this paper takes image classification as the research object to study how meta-learning quickly learns from a small number of sample images. The main contents are as follows: After considering the distribution difference of data sets on the generalization performance of measurement learning and the advantages of optimizing the initial characterization method, this paper adds the model-independent meta-learning algorithm and designs a multi-scale meta-relational network. First, the idea of META-SGD is adopted, and the inner learning rate is taken as the learning vector and model parameter to learn together. Secondly, in the meta-training process, the model-independent meta-learning algorithm is used to find the optimal parameters of the model. The inner gradient iteration is canceled in the process of meta-validation and meta-test. The experimental results show that the multi-scale meta-relational network makes the learned measurement have stronger generalization ability, which further improves the classification accuracy on the benchmark set and avoids the need for fine-tuning of the model-independent meta-learning algorithm.

Journal ArticleDOI
TL;DR: This work represents an initial experimentation using image texture feature descriptors, feed-forward and convolutional neural networks on newly created databases with COVID-19 images to set a baseline for the future development of a system capable of automatically detecting the CO VID-19 disease based on its manifestation on chest x-rays and computerized tomography images of the lungs.

Posted Content
TL;DR: Poseformer as discussed by the authors is a purely transformer-based approach for 3D human pose estimation in videos without convolutional architectures involved and achieves state-of-the-art performance on two popular and standard benchmark datasets.
Abstract: Transformer architectures have become the model of choice in natural language processing and are now being introduced into computer vision tasks such as image classification, object detection, and semantic segmentation. However, in the field of human pose estimation, convolutional architectures still remain dominant. In this work, we present PoseFormer, a purely transformer-based approach for 3D human pose estimation in videos without convolutional architectures involved. Inspired by recent developments in vision transformers, we design a spatial-temporal transformer structure to comprehensively model the human joint relations within each frame as well as the temporal correlations across frames, then output an accurate 3D human pose of the center frame. We quantitatively and qualitatively evaluate our method on two popular and standard benchmark datasets: Human3.6M and MPI-INF-3DHP. Extensive experiments show that PoseFormer achieves state-of-the-art performance on both datasets. Code is available at \url{this https URL}

Journal ArticleDOI
TL;DR: The proposed approach contains the complete structural information extracted from the local binary patterns and also extracts the additional information using the information of magnitude, thereby achieving extra discriminative power.
Abstract: This paper presents a content-based image retrieval technique that focuses on extraction and reduction in multiple features. To obtain multi-level decomposition of the image by extracting approximation and correct coefficients, discrete wavelet transformation is applied to the RGB channels initially. Therefore, both approximation and correct coefficients are applied to the dominant rotated local binary pattern termed as texture descriptor which is computationally effective and rotationally invariant. For a local neighbor patch, a rotation invariance function image is obtained by measuring the descriptor relative to the reference. The proposed approach contains the complete structural information extracted from the local binary patterns and also extracts the additional information using the information of magnitude, thereby achieving extra discriminative power. Then, GLCM description is used by obtaining the dominant rotated local binary pattern image to extract the statistical characteristics for texture image classification. The proposed technique is applied to CORAL dataset with the help of particle swarm optimization-based feature selector to minimize the number of features that can be used during the classification process. The three classifiers, i.e., support vector machine, K-nearest neighbor, and decision tree, are trained and tested. The comparison is based in terms of Accuracy, precision, recall, and F-measure performance metrics for classification. Experimental results show that the proposed approach achieves better accuracy, precision, recall, and F-measure values for most of the CORAL dataset classes.

Journal ArticleDOI
30 Mar 2021-Cancers
TL;DR: A novel transfer learning approach to overcome the previous drawbacks by means of training the deep learning model on large unlabeled medical image datasets and by next transferring the knowledge to train the deepLearning model on the small amount of labeled medical images is proposed.
Abstract: Deep learning requires a large amount of data to perform well. However, the field of medical image analysis suffers from a lack of sufficient data for training deep learning models. Moreover, medical images require manual labeling, usually provided by human annotators coming from various backgrounds. More importantly, the annotation process is time-consuming, expensive, and prone to errors. Transfer learning was introduced to reduce the need for the annotation process by transferring the deep learning models with knowledge from a previous task and then by fine-tuning them on a relatively small dataset of the current task. Most of the methods of medical image classification employ transfer learning from pretrained models, e.g., ImageNet, which has been proven to be ineffective. This is due to the mismatch in learned features between the natural image, e.g., ImageNet, and medical images. Additionally, it results in the utilization of deeply elaborated models. In this paper, we propose a novel transfer learning approach to overcome the previous drawbacks by means of training the deep learning model on large unlabeled medical image datasets and by next transferring the knowledge to train the deep learning model on the small amount of labeled medical images. Additionally, we propose a new deep convolutional neural network (DCNN) model that combines recent advancements in the field. We conducted several experiments on two challenging medical imaging scenarios dealing with skin and breast cancer classification tasks. According to the reported results, it has been empirically proven that the proposed approach can significantly improve the performance of both classification scenarios. In terms of skin cancer, the proposed model achieved an F1-score value of 89.09% when trained from scratch and 98.53% with the proposed approach. Secondly, it achieved an accuracy value of 85.29% and 97.51%, respectively, when trained from scratch and using the proposed approach in the case of the breast cancer scenario. Finally, we concluded that our method can possibly be applied to many medical imaging problems in which a substantial amount of unlabeled image data is available and the labeled image data is limited. Moreover, it can be utilized to improve the performance of medical imaging tasks in the same domain. To do so, we used the pretrained skin cancer model to train on feet skin to classify them into two classes-either normal or abnormal (diabetic foot ulcer (DFU)). It achieved an F1-score value of 86.0% when trained from scratch, 96.25% using transfer learning, and 99.25% using double-transfer learning.