scispace - formally typeset
Search or ask a question

Showing papers on "Feature extraction published in 2021"


Proceedings ArticleDOI
23 Aug 2021
TL;DR: Wang et al. as discussed by the authors proposed a strong baseline model SwinIR for image restoration based on the Swin Transformer, which consists of three parts: shallow feature extraction, deep feature extraction and high-quality image reconstruction.
Abstract: Image restoration is a long-standing low-level vision problem that aims to restore high-quality images from low-quality images (e.g., downscaled, noisy and compressed images). While state-of-the-art image restoration methods are based on convolutional neural networks, few attempts have been made with Transformers which show impressive performance on high-level vision tasks. In this paper, we propose a strong baseline model SwinIR for image restoration based on the Swin Transformer. SwinIR consists of three parts: shallow feature extraction, deep feature extraction and high-quality image reconstruction. In particular, the deep feature extraction module is composed of several residual Swin Transformer blocks (RSTB), each of which has several Swin Transformer layers together with a residual connection. We conduct experiments on three representative tasks: image super-resolution (including classical, lightweight and real-world image super-resolution), image denoising (including grayscale and color image denoising) and JPEG compression artifact reduction. Experimental results demonstrate that SwinIR outperforms state-of-the-art methods on different tasks by $\textbf{up to 0.14$\sim$0.45dB}$, while the total number of parameters can be reduced by $\textbf{up to 67%}$.

1,064 citations


Journal ArticleDOI
TL;DR: An extensive review of deep learning-based self-supervised general visual feature learning methods from images or videos as a subset of unsupervised learning methods to learn general image and video features from large-scale unlabeled data without using any human-annotated labels is provided.
Abstract: Large-scale labeled data are generally required to train deep neural networks in order to obtain better performance in visual feature learning from images or videos for computer vision applications. To avoid extensive cost of collecting and annotating large-scale datasets, as a subset of unsupervised learning methods, self-supervised learning methods are proposed to learn general image and video features from large-scale unlabeled data without using any human-annotated labels. This paper provides an extensive review of deep learning-based self-supervised general visual feature learning methods from images or videos. First, the motivation, general pipeline, and terminologies of this field are described. Then the common deep neural network architectures that used for self-supervised learning are summarized. Next, the schema and evaluation metrics of self-supervised learning methods are reviewed followed by the commonly used datasets for images, videos, audios, and 3D data, as well as the existing self-supervised visual feature learning methods. Finally, quantitative performance comparisons of the reviewed methods on benchmark datasets are summarized and discussed for both image and video feature learning. At last, this paper is concluded and lists a set of promising future directions for self-supervised visual feature learning.

876 citations


Journal ArticleDOI
TL;DR: A large tracking database that offers an unprecedentedly wide coverage of common moving objects in the wild, called GOT-10k, and the first video trajectory dataset that uses the semantic hierarchy of WordNet to guide class population, which ensures a comprehensive and relatively unbiased coverage of diverse moving objects.
Abstract: We introduce here a large tracking database that offers an unprecedentedly wide coverage of common moving objects in the wild, called GOT-10k. Specifically, GOT-10k is built upon the backbone of WordNet structure [1] and it populates the majority of over 560 classes of moving objects and 87 motion patterns, magnitudes wider than the most recent similar-scale counterparts [19] , [20] , [23] , [26] . By releasing the large high-diversity database, we aim to provide a unified training and evaluation platform for the development of class-agnostic, generic purposed short-term trackers. The features of GOT-10k and the contributions of this article are summarized in the following. (1) GOT-10k offers over 10,000 video segments with more than 1.5 million manually labeled bounding boxes, enabling unified training and stable evaluation of deep trackers. (2) GOT-10k is by far the first video trajectory dataset that uses the semantic hierarchy of WordNet to guide class population, which ensures a comprehensive and relatively unbiased coverage of diverse moving objects. (3) For the first time, GOT-10k introduces the one-shot protocol for tracker evaluation, where the training and test classes are zero-overlapped . The protocol avoids biased evaluation results towards familiar objects and it promotes generalization in tracker development. (4) GOT-10k offers additional labels such as motion classes and object visible ratios, facilitating the development of motion-aware and occlusion-aware trackers. (5) We conduct extensive tracking experiments with 39 typical tracking algorithms and their variants on GOT-10k and analyze their results in this paper. (6) Finally, we develop a comprehensive platform for the tracking community that offers full-featured evaluation toolkits, an online evaluation server, and a responsive leaderboard. The annotations of GOT-10k’s test data are kept private to avoid tuning parameters on it.

852 citations


Proceedings ArticleDOI
20 Jun 2021
TL;DR: Li et al. as discussed by the authors presented a detailed study of improving visual representations for vision language (VL) tasks, and developed an improved object detection model to provide object-centric representations of images.
Abstract: This paper presents a detailed study of improving visual representations for vision language (VL) tasks and develops an improved object detection model to provide object-centric representations of images. Compared to the most widely used bottom-up and top-down model [2], the new model is bigger, better-designed for VL tasks, and pre-trained on much larger training corpora that combine multiple public annotated object detection datasets. Therefore, it can generate representations of a richer collection of visual objects and concepts. While previous VL research focuses mainly on improving the vision-language fusion model and leaves the object detection model improvement untouched, we show that visual features matter significantly in VL models. In our experiments we feed the visual features generated by the new object detection model into a Transformer-based VL fusion model OSCAR [20], and utilize an improved approach OSCAR+ to pre-train the VL model and fine-tune it on a wide range of downstream VL tasks. Our results show that the new visual features significantly improve the performance across all VL tasks, creating new state-of-the-art results on seven public benchmarks. Code, models and pre-extracted features are released at https://github.com/pzzhang/VinVL.

543 citations


Journal ArticleDOI
TL;DR: A hybrid model using deep and classical machine learning for face mask detection will be presented, and the SVM classifier achieved 99.64 % testing accuracy in RMFD.

540 citations


Proceedings ArticleDOI
01 Jun 2021
TL;DR: Transformer as discussed by the authors proposes an attention-based feature fusion network, which effectively combines the template and search region features solely using attention, and achieves very promising results on six challenging datasets, especially on large-scale LaSOT, TrackingNet, and GOT-10k benchmarks.
Abstract: Correlation acts as a critical role in the tracking field, especially in recent popular Siamese-based trackers. The correlation operation is a simple fusion manner to consider the similarity between the template and the search region. However, the correlation operation itself is a local linear matching process, leading to lose semantic information and fall into local optimum easily, which may be the bottleneck of designing high-accuracy tracking algorithms. Is there any better feature fusion method than correlation? To address this issue, inspired by Transformer, this work presents a novel attention-based feature fusion network, which effectively combines the template and search region features solely using attention. Specifically, the proposed method includes an ego-context augment module based on self-attention and a cross-feature augment module based on cross-attention. Finally, we present a Transformer tracking (named TransT) method based on the Siamese-like feature extraction backbone, the designed attention-based fusion mechanism, and the classification and regression head. Experiments show that our TransT achieves very promising results on six challenging datasets, especially on large-scale LaSOT, TrackingNet, and GOT-10k benchmarks. Our tracker runs at approximatively 50 fps on GPU. Code and models are available at https://github.com/chenxin-dlut/TransT.

528 citations


Journal ArticleDOI
TL;DR: Zhang et al. as mentioned in this paper proposed a residual dense block (RDB) to extract abundant local features via densely connected convolutional layers, which further allows direct connections from the state of preceding RDB to all the layers of current RDB, leading to a contiguous memory mechanism.
Abstract: Recently, deep convolutional neural network (CNN) has achieved great success for image restoration (IR) and provided hierarchical features at the same time. However, most deep CNN based IR models do not make full use of the hierarchical features from the original low-quality images; thereby, resulting in relatively-low performance. In this work, we propose a novel and efficient residual dense network (RDN) to address this problem in IR, by making a better tradeoff between efficiency and effectiveness in exploiting the hierarchical features from all the convolutional layers. Specifically, we propose residual dense block (RDB) to extract abundant local features via densely connected convolutional layers. RDB further allows direct connections from the state of preceding RDB to all the layers of current RDB, leading to a contiguous memory mechanism. To adaptively learn more effective features from preceding and current local features and stabilize the training of wider network, we proposed local feature fusion in RDB. After fully obtaining dense local features, we use global feature fusion to jointly and adaptively learn global hierarchical features in a holistic way. We demonstrate the effectiveness of RDN with several representative IR applications, single image super-resolution, Gaussian image denoising, image compression artifact reduction, and image deblurring. Experiments on benchmark and real-world datasets show that our RDN achieves favorable performance against state-of-the-art methods for each IR task quantitatively and visually.

519 citations


Journal ArticleDOI
TL;DR: Results showed the deep approaches to be quite efficient when compared to the local texture descriptors in the detection of COVID-19 based on chest X-ray images.
Abstract: COVID-19 is a novel virus that causes infection in both the upper respiratory tract and the lungs. The numbers of cases and deaths have increased on a daily basis on the scale of a global pandemic. Chest X-ray images have proven useful for monitoring various lung diseases and have recently been used to monitor the COVID-19 disease. In this paper, deep-learning-based approaches, namely deep feature extraction, fine-tuning of pretrained convolutional neural networks (CNN), and end-to-end training of a developed CNN model, have been used in order to classify COVID-19 and normal (healthy) chest X-ray images. For deep feature extraction, pretrained deep CNN models (ResNet18, ResNet50, ResNet101, VGG16, and VGG19) were used. For classification of the deep features, the Support Vector Machines (SVM) classifier was used with various kernel functions, namely Linear, Quadratic, Cubic, and Gaussian. The aforementioned pretrained deep CNN models were also used for the fine-tuning procedure. A new CNN model is proposed in this study with end-to-end training. A dataset containing 180 COVID-19 and 200 normal (healthy) chest X-ray images was used in the study's experimentation. Classification accuracy was used as the performance measurement of the study. The experimental works reveal that deep learning shows potential in the detection of COVID-19 based on chest X-ray images. The deep features extracted from the ResNet50 model and SVM classifier with the Linear kernel function produced a 94.7% accuracy score, which was the highest among all the obtained results. The achievement of the fine-tuned ResNet50 model was found to be 92.6%, whilst end-to-end training of the developed CNN model produced a 91.6% result. Various local texture descriptors and SVM classifications were also used for performance comparison with alternative deep approaches; the results of which showed the deep approaches to be quite efficient when compared to the local texture descriptors in the detection of COVID-19 based on chest X-ray images.

460 citations


Journal ArticleDOI
TL;DR: An Attention-based Bidirectional CNN-RNN Deep Model (ABCDM) is proposed that achieves state-of-the-art results on both long review and short tweet polarity classification and is evaluated on sentiment polarity detection.

385 citations


Journal ArticleDOI
TL;DR: A comprehensive review of the recent developments on deep face recognition can be found in this paper, covering broad topics on algorithm designs, databases, protocols, and application scenes, as well as the technical challenges and several promising directions.

353 citations


Journal ArticleDOI
TL;DR: A convolutional autoencoder deep learning framework to support unsupervised image features learning for lung nodule through unlabeled data, which only needs a small amount of labeled data for efficient feature learning.
Abstract: At present, computed tomography (CT) is widely used to assist disease diagnosis. Especially, computer aided diagnosis (CAD) based on artificial intelligence (AI) recently exhibits its importance in intelligent healthcare. However, it is a great challenge to establish an adequate labeled dataset for CT analysis assistance, due to the privacy and security issues. Therefore, this paper proposes a convolutional autoencoder deep learning framework to support unsupervised image features learning for lung nodule through unlabeled data, which only needs a small amount of labeled data for efficient feature learning. Through comprehensive experiments, it shows that the proposed scheme is superior to other approaches, which effectively solves the intrinsic labor-intensive problem during artificial image labeling. Moreover, it verifies that the proposed convolutional autoencoder approach can be extended for similarity measurement of lung nodules images. Especially, the features extracted through unsupervised learning are also applicable in other related scenarios.

Proceedings ArticleDOI
17 Mar 2021
TL;DR: YOuyang et al. as discussed by the authors revisited feature pyramids networks (FPN) for one-stage detectors and pointed out that the success of FPN is due to its divide-and-conquer solution to the optimization problem in object detection rather than multi-scale feature fusion.
Abstract: This paper revisits feature pyramids networks (FPN) for one-stage detectors and points out that the success of FPN is due to its divide-and-conquer solution to the optimization problem in object detection rather than multi-scale feature fusion. From the perspective of optimization, we introduce an alternative way to address the problem instead of adopting the complex feature pyramids - utilizing only one-level feature for detection. Based on the simple and efficient solution, we present You Only Look One-level Feature (YOLOF). In our method, two key components, Dilated Encoder and Uniform Matching, are proposed and bring considerable improvements. Extensive experiments on the COCO benchmark prove the effectiveness of the proposed model. Our YOLOF achieves comparable results with its feature pyramids counterpart RetinaNet while being 2.5× faster. Without transformer layers, YOLOF can match the performance of DETR in a single-level feature manner with 7× less training epochs. Code is available at https://github.com/megvii-model/YOLOF.

Journal ArticleDOI
TL;DR: The weighted double-margin contrastive loss is proposed to address the imbalanced sample is a serious problem in change detection, i.e., unchanged samples are much more abundant than changed samples, which is one of the main reasons for pseudochanges.
Abstract: Change detection is a basic task of remote sensing image processing. The research objective is to identify the change information of interest and filter out the irrelevant change information as interference factors. Recently, the rise in deep learning has provided new tools for change detection, which have yielded impressive results. However, the available methods focus mainly on the difference information between multitemporal remote sensing images and lack robustness to pseudochange information. To overcome the lack of resistance in current methods to pseudochanges, in this article, we propose a new method, namely, dual attentive fully convolutional Siamese networks, for change detection in high-resolution images. Through the dual attention mechanism, long-range dependencies are captured to obtain more discriminant feature representations to enhance the recognition performance of the model. Moreover, the imbalanced sample is a serious problem in change detection, i.e., unchanged samples are much more abundant than changed samples, which is one of the main reasons for pseudochanges. We propose the weighted double-margin contrastive loss to address this problem by punishing attention to unchanged feature pairs and increasing attention to changed feature pairs. The experimental results of our method on the change detection dataset and the building change detection dataset demonstrate that compared with other baseline methods, the proposed method realizes maximum improvements of 2.9% and 4.2%, respectively, in the F 1 score. Our PyTorch implementation is available at https://github.com/lehaifeng/DASNet .

Journal ArticleDOI
TL;DR: Compared to other state-of-the-art segmentation networks, this model yields better segmentation performance, increasing the accuracy of the predictions while reducing the standard deviation, which demonstrates the efficiency of the approach to generate precise and reliable automatic segmentations of medical images.
Abstract: Even though convolutional neural networks (CNNs) are driving progress in medical image segmentation, standard models still have some drawbacks. First, the use of multi-scale approaches, i.e., encoder-decoder architectures, leads to a redundant use of information, where similar low-level features are extracted multiple times at multiple scales. Second, long-range feature dependencies are not efficiently modeled, resulting in non-optimal discriminative feature representations associated with each semantic class. In this paper we attempt to overcome these limitations with the proposed architecture, by capturing richer contextual dependencies based on the use of guided self-attention mechanisms. This approach is able to integrate local features with their corresponding global dependencies, as well as highlight interdependent channel maps in an adaptive manner. Further, the additional loss between different modules guides the attention mechanisms to neglect irrelevant information and focus on more discriminant regions of the image by emphasizing relevant feature associations. We evaluate the proposed model in the context of semantic segmentation on three different datasets: abdominal organs, cardiovascular structures and brain tumors. A series of ablation experiments support the importance of these attention modules in the proposed architecture. In addition, compared to other state-of-the-art segmentation networks our model yields better segmentation performance, increasing the accuracy of the predictions while reducing the standard deviation. This demonstrates the efficiency of our approach to generate precise and reliable automatic segmentations of medical images. Our code is made publicly available at: https://github.com/sinAshish/Multi-Scale-Attention .

Journal ArticleDOI
TL;DR: The objective of this paper is to annotate and localize the medical face mask objects in real-life images to improve the object detection process and it is concluded that the adam optimizer achieved the highest average precision percentage of 81% as a detector.

Journal ArticleDOI
TL;DR: In this study, EfficientNet deep learning architecture was proposed in plant leaf disease classification and the performance of this model was compared with other state-of-the-art deep learning models.

Journal ArticleDOI
TL;DR: Experimental results show that the proposed SNUNet-CD method improves greatly on many evaluation criteria and has a better tradeoff between accuracy and calculation amount than other state-of-the-art (SOTA) change detection methods.
Abstract: Change detection is an important task in remote sensing (RS) image analysis. It is widely used in natural disaster monitoring and assessment, land resource planning, and other fields. As a pixel-to-pixel prediction task, change detection is sensitive about the utilization of the original position information. Recent change detection methods always focus on the extraction of deep change semantic feature, but ignore the importance of shallow-layer information containing high-resolution and fine-grained features, this often leads to the uncertainty of the pixels at the edge of the changed target and the determination miss of small targets. In this letter, we propose a densely connected siamese network for change detection, namely SNUNet-CD (the combination of Siamese network and NestedUNet). SNUNet-CD alleviates the loss of localization information in the deep layers of neural network through compact information transmission between encoder and decoder, and between decoder and decoder. In addition, Ensemble Channel Attention Module (ECAM) is proposed for deep supervision. Through ECAM, the most representative features of different semantic levels can be refined and used for the final classification. Experimental results show that our method improves greatly on many evaluation criteria and has a better tradeoff between accuracy and calculation amount than other state-of-the-art (SOTA) change detection methods.

Journal ArticleDOI
TL;DR: This work proposes an approach for semi-supervised semantic segmentation that learns from limited pixel-wise annotated samples while exploiting additional annotation-free images, and achieves significant improvement over existing methods, especially when trained with very few labeled samples.
Abstract: The ability to understand visual information from limited labeled data is an important aspect of machine learning. While image-level classification has been extensively studied in a semi-supervised setting, dense pixel-level classification with limited data has only drawn attention recently. In this work, we propose an approach for semi-supervised semantic segmentation that learns from limited pixel-wise annotated samples while exploiting additional annotation-free images. The proposed approach relies on adversarial training with a feature matching loss to learn from unlabeled images. It uses two network branches that link semi-supervised classification with semi-supervised segmentation including self-training. The dual-branch approach reduces both the low-level and the high-level artifacts typical when training with few labels. The approach attains significant improvement over existing methods, especially when trained with very few labeled samples. On several standard benchmarks—PASCAL VOC 2012, PASCAL-Context, and Cityscapes—the approach achieves new state-of-the-art in semi-supervised learning.

Proceedings ArticleDOI
27 Apr 2021
TL;DR: A novel and efficient structure named Short-Term Dense Concatenate network (STDC network) is proposed by removing structure redundancy by gradually reducing the dimension of feature maps and use the aggregation of them for image representation, which forms the basic module of STDC network.
Abstract: BiSeNet [28], [27] has been proved to be a popular two-stream network for real-time segmentation. However, its principle of adding an extra path to encode spatial information is time-consuming, and the backbones borrowed from pretrained tasks, e.g., image classification, may be inefficient for image segmentation due to the deficiency of task-specific design. To handle these problems, we propose a novel and efficient structure named Short-Term Dense Concatenate network (STDC network) by removing structure redundancy. Specifically, we gradually reduce the dimension of feature maps and use the aggregation of them for image representation, which forms the basic module of STDC network. In the decoder, we propose a Detail Aggregation module by integrating the learning of spatial information into low-level layers in single-stream manner. Finally, the low-level features and deep features are fused to predict the final segmentation results. Extensive experiments on Cityscapes and CamVid dataset demonstrate the effectiveness of our method by achieving promising trade-off between segmentation accuracy and inference speed. On Cityscapes, we achieve 71.9% mIoU on the test set with a speed of 250.4 FPS on NVIDIA GTX 1080Ti, which is 45.2% faster than the latest methods, and achieve 76.8% mIoU with 97.0 FPS while inferring on higher resolution images. Code is available at https://github.com/MichaelFan01/STDC-Seg.

Proceedings ArticleDOI
20 Jun 2021
TL;DR: In this paper, a multi-attentional deepfake detection network is proposed, which consists of three key components: 1) multiple spatial attention heads to make the network attend to different local parts; 2) textural feature enhancement block to zoom in the subtle artifacts in shallow features; 3) aggregate the low-level textural features and high-level semantic features guided by the attention maps.
Abstract: Face forgery by deepfake is widely spread over the internet and has raised severe societal concerns. Recently, how to detect such forgery contents has become a hot research topic and many deepfake detection methods have been proposed. Most of them model deepfake detection as a vanilla binary classification problem, i.e, first use a backbone network to extract a global feature and then feed it into a binary classifier (real/fake). But since the difference between the real and fake images in this task is often subtle and local, we argue this vanilla solution is not optimal. In this paper, we instead formulate deepfake detection as a fine-grained classification problem and propose a new multi-attentional deepfake detection network. Specifically, it consists of three key components: 1) multiple spatial attention heads to make the network attend to different local parts; 2) textural feature enhancement block to zoom in the subtle artifacts in shallow features; 3) aggregate the low-level textural feature and high-level semantic features guided by the attention maps. Moreover, to address the learning difficulty of this network, we further introduce a new regional independence loss and an attention guided data augmentation strategy. Through extensive experiments on different datasets, we demonstrate the superiority of our method over the vanilla binary classifier counterparts, and achieve state-of-the-art performance. The models will be released recently at https://github.com/yoctta/multiple-attention.

Journal ArticleDOI
TL;DR: The proposed Deep Affinity Network (DAN) learns compact, yet comprehensive features of pre-detected objects at several levels of abstraction, and performs exhaustive pairing permutations of those features in any two frames to infer object affinities.
Abstract: Multiple Object Tracking (MOT) plays an important role in solving many fundamental problems in video analysis and computer vision. Most MOT methods employ two steps: Object Detection and Data Association. The first step detects objects of interest in every frame of a video, and the second establishes correspondence between the detected objects in different frames to obtain their tracks. Object detection has made tremendous progress in the last few years due to deep learning. However, data association for tracking still relies on hand crafted constraints such as appearance, motion, spatial proximity, grouping etc. to compute affinities between the objects in different frames. In this paper, we harness the power of deep learning for data association in tracking by jointly modeling object appearances and their affinities between different frames in an end-to-end fashion. The proposed Deep Affinity Network (DAN) learns compact, yet comprehensive features of pre-detected objects at several levels of abstraction, and performs exhaustive pairing permutations of those features in any two frames to infer object affinities. DAN also accounts for multiple objects appearing and disappearing between video frames. We exploit the resulting efficient affinity computations to associate objects in the current frame deep into the previous frames for reliable on-line tracking. Our technique is evaluated on popular multiple object tracking challenges MOT15, MOT17 and UA-DETRAC. Comprehensive benchmarking under twelve evaluation metrics demonstrates that our approach is among the best performing techniques on the leader board for these challenges. The open source implementation of our work is available at https://github.com/shijieS/SST.git .

Journal ArticleDOI
Minghao Zhu1, Licheng Jiao1, Fang Liu1, Shuyuan Yang1, Jianing Wang1 
TL;DR: Zhang et al. as discussed by the authors proposed an end-to-end residual spectral-spatial attention network (RSSAN) for hyperspectral image classification, which takes raw 3D cubes as input data without additional feature engineering.
Abstract: In the last five years, deep learning has been introduced to tackle the hyperspectral image (HSI) classification and demonstrated good performance. In particular, the convolutional neural network (CNN)-based methods for HSI classification have made great progress. However, due to the high dimensionality of HSI and equal treatment of all bands, the performance of these methods is hampered by learning features from useless bands for classification. Moreover, for patchwise-based CNN models, equal treatment of spatial information from the pixel-centered neighborhood also hinders the performance of these methods. In this article, we propose an end-to-end residual spectral–spatial attention network (RSSAN) for HSI classification. The RSSAN takes raw 3-D cubes as input data without additional feature engineering. First, a spectral attention module is designed for spectral band selection from raw input data by emphasizing useful bands for classification and suppressing useless bands. Then, a spatial attention module is designed for the adaptive selection of spatial information by emphasizing pixels from the same class as the center pixel or those are useful for classification in the pixel-centered neighborhood and suppressing those from a different class or useless. Second, two attention modules are also used in the following CNN for adaptive feature refinement in spectral–spatial feature learning. Third, a sequential spectral–spatial attention module is embedded into a residual block to avoid overfitting and accelerate the training of the proposed model. Experimental studies demonstrate that the RSSAN achieved superior classification accuracy compared with the state of the art on three HSI data sets: Indian Pines (IN), University of Pavia (UP), and Kennedy Space Center (KSC).

Journal ArticleDOI
TL;DR: The proposed deep fuzzy hashing network (DFHN) method combines the fuzzy logic technique and the DNN to learn more effective binary codes, which can leverage fuzzy rules to model the uncertainties underlying the data.
Abstract: Hashing methods for efficient image retrieval aim at learning hash functions that map similar images to semantically correlated binary codes in the Hamming space with similarity well preserved. The traditional hashing methods usually represent image content by hand-crafted features. Deep hashing methods based on deep neural network (DNN) architectures can generate more effective image features and obtain better retrieval performance. However, the underlying data structure is hardly captured by existing DNN models. Moreover, the similarity (either visually or semantically) between pairwise images is ambiguous, even uncertain, to be measured in the existing deep hashing methods. In this article, we propose a novel hashing method termed deep fuzzy hashing network (DFHN) to overcome the shortcomings of existing deep hashing approaches. Our DFHN method combines the fuzzy logic technique and the DNN to learn more effective binary codes, which can leverage fuzzy rules to model the uncertainties underlying the data. Derived from fuzzy logic theory, the generalized hamming distance is devised in the convolutional layers and fully connected layers in our DFHN to model their outputs, which come from an efficient xor operation on given inputs and weights. Extensive experiments show that our DFHN method obtains competitive retrieval accuracy with highly efficient training speed compared with several state-of-the-art deep hashing approaches on two large-scale image datasets: CIFAR-10 and NUS-WIDE.

Journal ArticleDOI
TL;DR: Zhang et al. as discussed by the authors proposed a supervised multi-view hash model which can enhance the multiview information through neural networks, and the proposed method utilizes an effective view stability evaluation method to actively explore the relationship among views, which will affect the optimization direction of the entire network.
Abstract: Hashing is an efficient method for nearest neighbor search in large-scale data space by embedding high-dimensional feature descriptors into a similarity preserving Hamming space with a low dimension. However, large-scale high-speed retrieval through binary code has a certain degree of reduction in retrieval accuracy compared to traditional retrieval methods. We have noticed that multi-view methods can well preserve the diverse characteristics of data. Therefore, we try to introduce the multi-view deep neural network into the hash learning field, and design an efficient and innovative retrieval model, which has achieved a significant improvement in retrieval performance. In this paper, we propose a supervised multi-view hash model which can enhance the multi-view information through neural networks. This is a completely new hash learning method that combines multi-view and deep learning methods. The proposed method utilizes an effective view stability evaluation method to actively explore the relationship among views, which will affect the optimization direction of the entire network. We have also designed a variety of multi-data fusion methods in the Hamming space to preserve the advantages of both convolution and multi-view. In order to avoid excessive computing resources on the enhancement procedure during retrieval, we set up a separate structure called memory network which participates in training together. The proposed method is systematically evaluated on the CIFAR-10, NUS-WIDE and MS-COCO datasets, and the results show that our method significantly outperforms the state-of-the-art single-view and multi-view hashing methods.

Journal ArticleDOI
TL;DR: CA-Net as mentioned in this paper proposes a joint spatial attention module to make the network focus more on the foreground region and a novel channel attention module is proposed to adaptively recalibrate channel-wise feature responses and highlight the most relevant feature channels.
Abstract: Accurate medical image segmentation is essential for diagnosis and treatment planning of diseases. Convolutional Neural Networks (CNNs) have achieved state-of-the-art performance for automatic medical image segmentation. However, they are still challenged by complicated conditions where the segmentation target has large variations of position, shape and scale, and existing CNNs have a poor explainability that limits their application to clinical decisions. In this work, we make extensive use of multiple attentions in a CNN architecture and propose a comprehensive attention-based CNN (CA-Net) for more accurate and explainable medical image segmentation that is aware of the most important spatial positions, channels and scales at the same time. In particular, we first propose a joint spatial attention module to make the network focus more on the foreground region. Then, a novel channel attention module is proposed to adaptively recalibrate channel-wise feature responses and highlight the most relevant feature channels. Also, we propose a scale attention module implicitly emphasizing the most salient feature maps among multiple scales so that the CNN is adaptive to the size of an object. Extensive experiments on skin lesion segmentation from ISIC 2018 and multi-class segmentation of fetal MRI found that our proposed CA-Net significantly improved the average segmentation Dice score from 87.77% to 92.08% for skin lesion, 84.79% to 87.08% for the placenta and 93.20% to 95.88% for the fetal brain respectively compared with U-Net. It reduced the model size to around 15 times smaller with close or even better accuracy compared with state-of-the-art DeepLabv3+. In addition, it has a much higher explainability than existing networks by visualizing the attention weight maps. Our code is available at https://github.com/HiLab-git/CA-Net .

Journal ArticleDOI
28 Apr 2021
TL;DR: In this paper, an attention-based deep learning architecture called AttnSleep was proposed to classify sleep stages using single-channel EEG signals, which leverages a multi-head attention mechanism to capture the temporal dependencies among the extracted features.
Abstract: Automatic sleep stage mymargin classification is of great importance to measure sleep quality. In this paper, we propose a novel attention-based deep learning architecture called AttnSleep to classify sleep stages using single channel EEG signals. This architecture starts with the feature extraction module based on multi-resolution convolutional neural network (MRCNN) and adaptive feature recalibration (AFR). The MRCNN can extract low and high frequency features and the AFR is able to improve the quality of the extracted features by modeling the inter-dependencies between the features. The second module is the temporal context encoder (TCE) that leverages a multi-head attention mechanism to capture the temporal dependencies among the extracted features. Particularly, the multi-head attention deploys causal convolutions to model the temporal relations in the input features. We evaluate the performance of our proposed AttnSleep model using three public datasets. The results show that our AttnSleep outperforms state-of-the-art techniques in terms of different evaluation metrics. Our source codes, experimental data, and supplementary materials are available at https://github.com/emadeldeen24/AttnSleep .

Journal ArticleDOI
TL;DR: Li et al. as mentioned in this paper proposed an underwater image enhancement network via medium transmission-guided multi-color space embedding, which enriches the diversity of feature representations by incorporating the characteristics of different color spaces into a unified structure.
Abstract: Underwater images suffer from color casts and low contrast due to wavelength- and distance-dependent attenuation and scattering. To solve these two degradation issues, we present an underwater image enhancement network via medium transmission-guided multi-color space embedding, called Ucolor . Concretely, we first propose a multi-color space encoder network, which enriches the diversity of feature representations by incorporating the characteristics of different color spaces into a unified structure. Coupled with an attention mechanism, the most discriminative features extracted from multiple color spaces are adaptively integrated and highlighted. Inspired by underwater imaging physical models, we design a medium transmission (indicating the percentage of the scene radiance reaching the camera)-guided decoder network to enhance the response of network towards quality-degraded regions. As a result, our network can effectively improve the visual quality of underwater images by exploiting multiple color spaces embedding and the advantages of both physical model-based and learning-based methods. Extensive experiments demonstrate that our Ucolor achieves superior performance against state-of-the-art methods in terms of both visual quality and quantitative metrics. The code is publicly available at: https://li-chongyi.github.io/Proj_Ucolor.html .

Journal ArticleDOI
TL;DR: In this paper, the authors present the current trends and challenges for the detection of plant leaf disease using deep learning and advanced imaging techniques, and discuss some of the current challenges and problems that need to be resolved.
Abstract: Deep learning is a branch of artificial intelligence. In recent years, with the advantages of automatic learning and feature extraction, it has been widely concerned by academic and industrial circles. It has been widely used in image and video processing, voice processing, and natural language processing. At the same time, it has also become a research hotspot in the field of agricultural plant protection, such as plant disease recognition and pest range assessment, etc. The application of deep learning in plant disease recognition can avoid the disadvantages caused by artificial selection of disease spot features, make plant disease feature extraction more objective, and improve the research efficiency and technology transformation speed. This review provides the research progress of deep learning technology in the field of crop leaf disease identification in recent years. In this paper, we present the current trends and challenges for the detection of plant leaf disease using deep learning and advanced imaging techniques. We hope that this work will be a valuable resource for researchers who study the detection of plant diseases and insect pests. At the same time, we also discussed some of the current challenges and problems that need to be resolved.

Journal ArticleDOI
TL;DR: This paper reviews the application of machine learning techniques in building load prediction under the organization and logic of the machine learning, which is to perform tasks T using Performance measure P and based on learning from Experience E.

Journal ArticleDOI
TL;DR: Experiments demonstrate that the proposed VLSTM model can efficiently cope with imbalance and high-dimensional issues, and significantly improve the accuracy and reduce the false rate in anomaly detection for IBD according to F1, area under curve (AUC), and false alarm rate (FAR).
Abstract: With the increasing population of Industry 4.0, industrial big data (IBD) has become a hotly discussed topic in digital and intelligent industry field. The security problem existing in the signal processing on large scale of data stream is still a challenge issue in industrial internet of things, especially when dealing with the high-dimensional anomaly detection for intelligent industrial application. In this article, to mitigate the inconsistency between dimensionality reduction and feature retention in imbalanced IBD, we propose a variational long short-term memory (VLSTM) learning model for intelligent anomaly detection based on reconstructed feature representation. An encoder–decoder neural network associated with a variational reparameterization scheme is designed to learn the low-dimensional feature representation from high-dimensional raw data. Three loss functions are defined and quantified to constrain the reconstructed hidden variable into a more explicit and meaningful form. A lightweight estimation network is then fed with the refined feature representation to identify anomalies in IBD. Experiments using a public IBD dataset named UNSW-NB15 demonstrate that the proposed VLSTM model can efficiently cope with imbalance and high-dimensional issues, and significantly improve the accuracy and reduce the false rate in anomaly detection for IBD according to F1, area under curve (AUC), and false alarm rate (FAR).