scispace - formally typeset
Search or ask a question

Showing papers on "Encoder published in 2022"


Proceedings ArticleDOI
01 Jun 2022
TL;DR: Masked autoencoders (MAE) as discussed by the authors are scalable self-supervised learners for computer vision, which is based on two core designs: an asymmetric encoder-decoder architecture with an encoder that operates only on the visible subset of patches (without mask tokens), along with a lightweight decoder that reconstructs the original image from the latent representation and mask tokens.
Abstract: This paper shows that masked autoencoders (MAE) are scalable self-supervised learners for computer vision. Our MAE approach is simple: we mask random patches of the input image and reconstruct the missing pixels. It is based on two core designs. First, we develop an asymmetric encoder-decoder architecture, with an encoder that operates only on the visible subset of patches (without mask tokens), along with a lightweight decoder that reconstructs the original image from the latent representation and mask tokens. Second, we find that masking a high proportion of the input image, e.g., 75%, yields a nontrivial and meaningful self-supervisory task. Coupling these two designs enables us to train large models efficiently and effectively: we accelerate training (by 3× or more) and improve accuracy. Our scalable approach allows for learning high-capacity models that generalize well: e.g., a vanilla ViT-Huge model achieves the best accuracy (87.8%) among methods that use only ImageNet-1K data. Transfer performance in downstream tasks outperforms supervised pretraining and shows promising scaling behavior.

311 citations


Proceedings ArticleDOI
01 Jan 2022
TL;DR: UNETR as discussed by the authors utilizes a transformer encoder to learn sequence representations of the input volume and effectively capture the global multi-scale information, while also following the successful U-shaped network design for the encoder and decoder.
Abstract: Fully Convolutional Neural Networks (FCNNs) with contracting and expanding paths have shown prominence for the majority of medical image segmentation applications since the past decade. In FCNNs, the encoder plays an integral role by learning both global and local features and contextual representations which can be utilized for semantic output prediction by the decoder. Despite their success, the locality of convolutional layers in FCNNs, limits the capability of learning long-range spatial dependencies. Inspired by the recent success of transformers for Natural Language Processing (NLP) in long-range sequence learning, we reformulate the task of volumetric (3D) medical image segmentation as a sequence-to-sequence prediction problem. We introduce a novel architecture, dubbed as UNEt TRansformers (UNETR), that utilizes a transformer as the encoder to learn sequence representations of the input volume and effectively capture the global multi-scale information, while also following the successful "U-shaped" network design for the encoder and decoder. The transformer encoder is directly connected to a decoder via skip connections at different resolutions to compute the final semantic segmentation output. We have validated the performance of our method on the Multi Atlas Labeling Beyond The Cranial Vault (BTCV) dataset for multi-organ segmentation and the Medical Segmentation Decathlon (MSD) dataset for brain tumor and spleen segmentation tasks. Our benchmarks demonstrate new state-of-the-art performance on the BTCV leaderboard.

219 citations


Book ChapterDOI
04 Jan 2022
TL;DR: Wang et al. as mentioned in this paper proposed a novel segmentation model termed Swin UNEt TRansformers (Swin UNETR), which reformulated the task of 3D brain tumor semantic segmentation as a sequence to sequence prediction problem.
Abstract: Semantic segmentation of brain tumors is a fundamental medical image analysis task involving multiple MRI imaging modalities that can assist clinicians in diagnosing the patient and successively studying the progression of the malignant entity. In recent years, Fully Convolutional Neural Networks (FCNNs) approaches have become the de facto standard for 3D medical image segmentation. The popular “U-shaped” network architecture has achieved state-of-the-art performance benchmarks on different 2D and 3D semantic segmentation tasks and across various imaging modalities. However, due to the limited kernel size of convolution layers in FCNNs, their performance of modeling long-range information is sub-optimal, and this can lead to deficiencies in the segmentation of tumors with variable sizes. On the other hand, transformer models have demonstrated excellent capabilities in capturing such long-range information in multiple domains, including natural language processing and computer vision. Inspired by the success of vision transformers and their variants, we propose a novel segmentation model termed Swin UNEt TRansformers (Swin UNETR). Specifically, the task of 3D brain tumor semantic segmentation is reformulated as a sequence to sequence prediction problem wherein multi-modal input data is projected into a 1D sequence of embedding and used as an input to a hierarchical Swin transformer as the encoder. The swin transformer encoder extracts features at five different resolutions by utilizing shifted windows for computing self-attention and is connected to an FCNN-based decoder at each resolution via skip connections. We have participated in BraTS 2021 segmentation challenge, and our proposed model ranks among the top-performing approaches in the validation phase. Code: https://monai.io/research/swin-unetr .

169 citations


Journal ArticleDOI
TL;DR: Wang et al. as mentioned in this paper proposed a decoupling network-based IVIF method (DNFusion), which utilizes the decoupled maps to design additional constraints on the network and force the network to retain the saliency information of the source image effectively.
Abstract: In general, the goal of existing infrared and visible image fusion (IVIF) methods is to make the fused image contain both the high-contrast regions of the infrared image and the texture details of the visible image. However, this definition would lead the fusion image losing information from the visible image in high-contrast areas. For this problem, this paper proposed a decoupling network-based IVIF method (DNFusion), which utilizes the decoupled maps to design additional constraints on the network to force the network to retain the saliency information of the source image effectively. The current definition of image fusion is satisfied while effectively maintaining the saliency objective of the source images. Specifically, the feature interaction module inside effectively facilitates the information exchange within the encoder and improves the utilization of complementary information. Also, a hybrid loss function constructed with weight fidelity loss, gradient loss, and decoupling loss which ensures the fusion image to be generated to effectively preserves the source image’s texture details and luminance information. The qualitative and quantitative comparison of extensive experiments demonstrates that our model can generate a fused image containing saliency objects and clear details of the source images, and the method we proposed has a better performance than other state-of-the-art methods.

121 citations


Proceedings ArticleDOI
01 Jun 2022
TL;DR: Wang et al. as discussed by the authors proposed an effective and efficient Transformer-based architecture for image restoration, in which they build a hierarchical encoder-decoder network using the Transformer block.
Abstract: In this paper, we present Uformer, an effective and efficient Transformer-based architecture for image restoration, in which we build a hierarchical encoder-decoder network using the Transformer block. In Uformer, there are two core designs. First, we introduce a novel locally-enhanced window (LeWin) Transformer block, which performs non-overlapping window-based self-attention instead of global self-attention. It significantly reduces the computational complexity on high resolution feature map while capturing local context. Second, we propose a learnable multi-scale restoration modulator in the form of a multi-scale spatial bias to adjust features in multiple layers of the Uformer decoder. Our modulator demonstrates superior capability for restoring details for various image restoration tasks while introducing marginal extra parameters and computational cost. Powered by these two designs, Uformer enjoys a high capability for capturing both local and global dependencies for image restoration. To evaluate our approach, extensive experiments are conducted on several image restoration tasks, including image denoising, motion deblurring, defocus deblurring and deraining. Without bells and whistles, our Uformer achieves superior or comparable performance compared with the state-of-the-art algorithms. The code and models are available at https://github.com/ZhendongWang6/Uformer.

118 citations


Journal ArticleDOI
TL;DR: Wang et al. as mentioned in this paper proposed SNUNet-CD (the combination of Siamese network and NestedUNet), which alleviated the loss of localization information in the deep layers of neural network through compact information transmission between encoder and decoder.
Abstract: Change detection is an important task in remote sensing (RS) image analysis. It is widely used in natural disaster monitoring and assessment, land resource planning, and other fields. As a pixel-to-pixel prediction task, change detection is sensitive about the utilization of the original position information. Recent change detection methods always focus on the extraction of deep change semantic feature, but ignore the importance of shallow-layer information containing high-resolution and fine-grained features, this often leads to the uncertainty of the pixels at the edge of the changed target and the determination miss of small targets. In this letter, we propose a densely connected siamese network for change detection, namely SNUNet-CD (the combination of Siamese network and NestedUNet). SNUNet-CD alleviates the loss of localization information in the deep layers of neural network through compact information transmission between encoder and decoder, and between decoder and decoder. In addition, Ensemble Channel Attention Module (ECAM) is proposed for deep supervision. Through ECAM, the most representative features of different semantic levels can be refined and used for the final classification. Experimental results show that our method improves greatly on many evaluation criteria and has a better tradeoff between accuracy and calculation amount than other state-of-the-art (SOTA) change detection methods.

100 citations


Journal ArticleDOI
TL;DR: Li et al. as mentioned in this paper proposed a bitemporal image transformer (BIT) to efficiently and effectively model contexts within the spatial-temporal domain, and incorporated BIT in a deep feature differencing-based CD framework.
Abstract: Modern change detection (CD) has achieved remarkable success by the powerful discriminative ability of deep convolutions. However, high-resolution remote sensing CD remains challenging due to the complexity of objects in the scene. Objects with the same semantic concept may show distinct spectral characteristics at different times and spatial locations. Most recent CD pipelines using pure convolutions are still struggling to relate long-range concepts in space-time. Non-local self-attention approaches show promising performance via modeling dense relations among pixels, yet are computationally inefficient. Here, we propose a bitemporal image transformer (BIT) to efficiently and effectively model contexts within the spatial-temporal domain. Our intuition is that the high-level concepts of the change of interest can be represented by a few visual words, i.e., semantic tokens. To achieve this, we express the bitemporal image into a few tokens, and use a transformer encoder to model contexts in the compact token-based space-time. The learned context-rich tokens are then feedback to the pixel-space for refining the original features via a transformer decoder. We incorporate BIT in a deep feature differencing-based CD framework. Extensive experiments on three CD datasets demonstrate the effectiveness and efficiency of the proposed method. Notably, our BIT-based model significantly outperforms the purely convolutional baseline using only 3 times lower computational costs and model parameters. Based on a naive backbone (ResNet18) without sophisticated structures (e.g., FPN, UNet), our model surpasses several state-of-the-art CD methods, including better than four recent attention-based methods in terms of efficiency and accuracy. Our code is available at https://github.com/justchenhao/BIT\_CD.

97 citations


Proceedings ArticleDOI
01 Jun 2022
TL;DR: TrackTrackformer as mentioned in this paper is an end-to-end trainable multi-object tracking approach based on an encoder-decoder Transformer architecture which achieves data association between frames via attention by evolving a set of track predictions through a video sequence.
Abstract: The challenging task of multi-object tracking (MOT) requires simultaneous reasoning about track initialization, identity, and spatio-temporal trajectories. We formulate this task as a frame-to-frame set prediction problem and introduce TrackFormer, an end-to-end trainable MOT approach based on an encoder-decoder Transformer architecture. Our model achieves data association between frames via attention by evolving a set of track predictions through a video sequence. The Transformer decoder initializes new tracks from static object queries and autoregressively follows existing tracks in space and time with the conceptually new and identity preserving track queries. Both query types benefit from self- and encoder-decoder attention on global frame-level features, thereby omitting any additional graph optimization or modeling of motion and/or appearance. TrackFormer introduces a new tracking-by-attention paradigm and while simple in its design is able to achieve state-of-the-art performance on the task of multi-object tracking (MOT17) and segmentation (MOTS20). The code is available at https://github.com/timmeinhardt/TrackFormer

86 citations


Journal ArticleDOI
TL;DR: In this article , a neural network hyperparameter optimization method was used to improve the accuracy and generalization for RUL prediction of zinc-ion batteries, and the validity of the research work done in this paper is verified by a series of comparative experiments.

85 citations


Journal ArticleDOI
TL;DR: A novel planar flow-based variational auto-encoder prediction model (PFVAE) is proposed, which uses the long- and short-term memory network (LSTM) as the auto- Encoder and designs the variational Auto-Encoder (VAE), as a time series data predictor to overcome the noise effects.
Abstract: Prediction based on time series has a wide range of applications. Due to the complex nonlinear and random distribution of time series data, the performance of learning prediction models can be reduced by the modeling bias or overfitting. This paper proposes a novel planar flow-based variational auto-encoder prediction model (PFVAE), which uses the long- and short-term memory network (LSTM) as the auto-encoder and designs the variational auto-encoder (VAE) as a time series data predictor to overcome the noise effects. In addition, the internal structure of VAE is transformed using planar flow, which enables it to learn and fit the nonlinearity of time series data and improve the dynamic adaptability of the network. The prediction experiments verify that the proposed model is superior to other models regarding prediction accuracy and proves it is effective for predicting time series data.

76 citations


Journal ArticleDOI
Zhou Deng1
TL;DR: Zhang et al. as discussed by the authors proposed a feature adaptive transformer network (FAT-Net) which integrates an extra transformer branch to capture long-range dependencies and global context information, and employed a memory-efficient decoder and a feature adaptation module to enhance the feature fusion between the adjacent-level features by activating the effective channels and restraining the irrelevant background noise.

Journal ArticleDOI
TL;DR: Wang et al. as discussed by the authors proposed a Siamese global learning (Siam-GL) framework, which is a novel semantic change detction framework for HSR remote sensing images.
Abstract: Due to the abundant features of high spatial resolution (HSR) remote sensing images, change detection of these images is crucial to understanding the land-use and land-cover (LULC) changes. However, previous works mostly focus on traditional binary change detection without considering the semantic information of the change classes. The latest progress of deep learning (DL) shows its advantages in HSR remote sensing images change detection. However, due to the large number of parameter calculations, the DL network always requires a large quantity of labeled data. In addition, DL methods for change detection usually follow a patch-based learning framework, which considers only the local area and leads to a sample imbalance problem for semantic change detection. To address the above issues, we first proposed a Siamese global learning (Siam-GL) framework, which is a novel semantic change detction framework for HSR remote sensing images. In Siam-GL, the Siamese architecture with shared parameters is constructed to effectively extract the representative features of bi-temporal HSR remote sensing images. The global hierarchical (G-H) sampling mechanism is designed to address the imbalanced training sample problem with insufficient samples. Furthermore, the binary change mask is added between the encoder and decoder to weaken the influence of the no-change regional background on the change regional foreground, further improving the accuracy of the proposed framework. The experimental results obtained with three diverse HSR datasets of typical Chinese cities demonstrated that the Siam-GL framework outperforms the advanced semantic change detection methods in terms of both quantity and quality. Moreover, to verify the generalization performance of the Siam-GL framework, a larger dataset was used for evaluation, and the results show that the Siam-GL framework has strong generalization performance.

Journal ArticleDOI
01 Jun 2022-Energy
TL;DR: In this article , an automatic encoder (AE) extreme learning machine (ELM)-AE-ELM model is proposed to predict the NOx emission concentration based on the combination of mutual information algorithm (MI), AE, and ELM.

Journal ArticleDOI
TL;DR: Wang et al. as discussed by the authors proposed an SER-enhanced traffic efficiency solution for autonomous vehicles in a 5G-enabled space and applied it to the ground integrated network (SAGIN)-based ITS.
Abstract: Speech emotion recognition (SER) is becoming the main human–computer interaction logic for autonomous vehicles in the next generation of intelligent transportation systems (ITSs). It can improve not only the safety of autonomous vehicles but also the personalized in-vehicle experience. However, current vehicle-mounted SER systems still suffer from two major shortcomings. One is the insufficient service capacity of the vehicle communication network, which is unable to meet the SER needs of autonomous vehicles in next-generation ITSs in terms of the data transmission rate, power consumption, and latency. Second, the accuracy of SER is poor, and it cannot provide sufficient interactivity and personalization between users and vehicles. To address these issues, we propose an SER-enhanced traffic efficiency solution for autonomous vehicles in a 5G-enabled space–air–ground integrated network (SAGIN)-based ITS. First, we convert the vehicle speech information data into spectrograms and input them into an AlexNet network model to obtain the high-level features of the vehicle speech acoustic model. At the same time, we convert the vehicle speech information data into text information and input it into the Bidirectional Encoder Representations from Transformers (BERT) model to obtain the high-level features of the corresponding text model. Finally, these two sets of high-level features are cascaded together to obtain fused features, which are sent to a softmax classifier for emotion matching and classification. Experiments show that the proposed solution can improve not only the SAGIN’s service capabilities, resulting in a large capacity, high bandwidth, ultralow latency, and high reliability, but also the accuracy of vehicle SER as well as the performance, practicality, and user experience of the ITS

Journal ArticleDOI
TL;DR: In this paper , a transformer and CNN hybrid deep neural network for semantic segmentation of very high-resolution remote sensing imagery is presented, where an encoder module uses a new universal backbone swin transformer to extract features to achieve better long-range spatial dependencies modeling.
Abstract: This paper presents a transformer and CNN hybrid deep neural network for semantic segmentation of very-high-resolution remote sensing imagery. The model follows an encoder-decoder structure. The encoder module uses a new universal backbone swin transformer to extract features to achieve better long-range spatial dependencies modeling. The decoder module draws on some effective blocks and successful strategies of CNN-based models in remote sensing image segmentation. In the middle of the framework, an atrous spatial pyramid pooling block based on depth-wise separable convolution (SASPP) is applied to obtain multi-scale context. An U-shaped decoder is used to gradually restore the size of the feature maps. Three skip connections are built between the encoder and decoder feature maps of the same size to maintain the transmission of local details and enhance the communication of multi-scale features. A squeeze-and-excitation (SE) channel attention block is added before segmentation for feature augmentation. An auxiliary boundary detection branch is combined to provide edge constraints for semantic segmentation. Extensive ablation experiments were conducted on the ISPRS Vaihingen and Potsdam benchmarks to test the effectiveness of multiple components of the network. At the same time, the proposed method is compared with the current state-of-the-art methods on the two benchmarks. The proposed hybrid network achieved the second highest overall accuracy (OA) on both the Potsdam and Vaihingen benchmarks. Code and models are available at https://github.com/zq7734509/mmsegmentation-multi-layer.

Journal ArticleDOI
28 Jun 2022
TL;DR: UCTransNet as discussed by the authors proposes a channel-wise cross-attention mechanism to solve the problem of incompatible feature sets of encoder and decoder stage, which negatively affects the segmentation performance.
Abstract: Most recent semantic segmentation methods adopt a U-Net framework with an encoder-decoder architecture. It is still challenging for U-Net with a simple skip connection scheme to model the global multi-scale context: 1) Not each skip connection setting is effective due to the issue of incompatible feature sets of encoder and decoder stage, even some skip connection negatively influence the segmentation performance; 2) The original U-Net is worse than the one without any skip connection on some datasets. Based on our findings, we propose a new segmentation framework, named UCTransNet (with a proposed CTrans module in U-Net), from the channel perspective with attention mechanism. Specifically, the CTrans (Channel Transformer) module is an alternate of the U-Net skip connections, which consists of a sub-module to conduct the multi-scale Channel Cross fusion with Transformer (named CCT) and a sub-module Channel-wise Cross-Attention (named CCA) to guide the fused multi-scale channel-wise information to effectively connect to the decoder features for eliminating the ambiguity. Hence, the proposed connection consisting of the CCT and CCA is able to replace the original skip connection to solve the semantic gaps for an accurate automatic medical image segmentation. The experimental results suggest that our UCTransNet produces more precise segmentation performance and achieves consistent improvements over the state-of-the-art for semantic segmentation across different datasets and conventional architectures involving transformer or U-shaped framework. Code: https://github.com/McGregorWwww/UCTransNet.

Journal ArticleDOI
01 Aug 2022
TL;DR: Wang et al. as mentioned in this paper proposed a Transformer-based decoder and constructed a UNet-like Transformer (UNetformer) for real-time urban scene segmentation.
Abstract: Semantic segmentation of remotely sensed urban scene images is required in a wide range of practical applications, such as land cover mapping, urban change detection, environmental protection, and economic assessment.Driven by rapid developments in deep learning technologies, the convolutional neural network (CNN) has dominated semantic segmentation for many years. CNN adopts hierarchical feature representation, demonstrating strong capabilities for local information extraction. However, the local property of the convolution layer limits the network from capturing the global context. Recently, as a hot topic in the domain of computer vision, Transformer has demonstrated its great potential in global information modelling, boosting many vision-related tasks such as image classification, object detection, and particularly semantic segmentation. In this paper, we propose a Transformer-based decoder and construct a UNet-like Transformer (UNetFormer) for real-time urban scene segmentation. For efficient segmentation, the UNetFormer selects the lightweight ResNet18 as the encoder and develops an efficient global-local attention mechanism to model both global and local information in the decoder. Extensive experiments reveal that our method not only runs faster but also produces higher accuracy compared with state-of-the-art lightweight models. Specifically, the proposed UNetFormer achieved 67.8% and 52.4% mIoU on the UAVid and LoveDA datasets, respectively, while the inference speed can achieve up to 322.4 FPS with a 512x512 input on a single NVIDIA GTX 3090 GPU. In further exploration, the proposed Transformer-based decoder combined with a Swin Transformer encoder also achieves the state-of-the-art result (91.3% F1 and 84.1% mIoU) on the Vaihingen dataset. The source code will be freely available at https://github.com/WangLibo1995/GeoSeg.

Journal ArticleDOI
TL;DR: Wang et al. as discussed by the authors proposed a dual-scale encoder-decoder architecture with self-attention to enhance the semantic segmentation quality of varying medical images, which can effectively model the non-local dependencies and multi-scale contexts for enhancing the pixellevel intrinsic structural features inside each patch.
Abstract: Automatic medical image segmentation has made great progress owing to the powerful deep representation learning. Inspired by the success of self-attention mechanism in Transformer, considerable efforts are devoted to designing the robust variants of encoder-decoder architecture with Transformer. However, the patch division used in the existing Transformer-based models usually ignores the pixel-level intrinsic structural features inside each patch. In this paper, we propose a novel deep medical image segmentation framework called Dual Swin Transformer U-Net (DS-TransUNet), which aims to incorporate the hierarchical Swin Transformer into both encoder and decoder of the standard U-shaped architecture. Our DS-TransUNet benefits from the self-attention computation in Swin Transformer and the designed dual-scale encoding, which can effectively model the non-local dependencies and multi-scale contexts for enhancing the semantic segmentation quality of varying medical images. Unlike many prior Transformer-based solutions, the proposed DS-TransUNet adopts a well-established dual-scale encoding mechanism that utilizes dual-scale encoders based on Swin Transformer to extract the coarse and fine-grained feature representations of different semantic scales. Meanwhile, a well-designed Transformer Interactive Fusion (TIF) module is proposed to effectively perform the multi-scale information fusion through the self-attention mechanism. Furthermore, we introduce the Swin Transformer block into decoder to further explore the long-range contextual information during the up-sampling process. Extensive experiments across four typical tasks for medical image segmentation demonstrate the effectiveness of DS-TransUNet, and our approach significantly outperforms the state-of-the-art methods.

Proceedings ArticleDOI
17 Jul 2022
TL;DR: In this paper , a transformer-based Siamese network architecture (abbreviated by ChangeFormer) is proposed for change detection from a pair of co-registered remote sensing images.
Abstract: This paper presents a transformer-based Siamese network architecture (abbreviated by ChangeFormer) for Change Detection (CD) from a pair of co-registered remote sensing images. Different from recent CD frameworks, which are based on fully convolutional networks (ConvNets), the proposed method unifies hierarchically structured transformer encoder with Multi-Layer Perception (MLP) decoder in a Siamese network architecture to efficiently render multi-scale long-range details required for accurate CD. Experiments on two CD datasets show that the proposed end-to-end trainable ChangeFormer architecture achieves better CD performance than previous counterparts. Our code and pre-trained models are available at github.com/wgcban/ChangeFormer.

Journal ArticleDOI
TL;DR: Zhang et al. as mentioned in this paper proposed a model based on a Transformer architecture, which has two parts: the encoder part to learn useful representations from the fake news data and the decoder part that predicts the future behaviour based on past observations.
Abstract: Fake news is a real problem in today's world, and it has become more extensive and harder to identify. A major challenge in fake news detection is to detect it in the early phase. Another challenge in fake news detection is the unavailability or the shortage of labelled data for training the detection models. We propose a novel fake news detection framework that can address these challenges. Our proposed framework exploits the information from the news articles and the social contexts to detect fake news. The proposed model is based on a Transformer architecture, which has two parts: the encoder part to learn useful representations from the fake news data and the decoder part that predicts the future behaviour based on past observations. We also incorporate many features from the news content and social contexts into our model to help us classify the news better. In addition, we propose an effective labelling technique to address the label shortage problem. Experimental results on real-world data show that our model can detect fake news with higher accuracy within a few minutes after it propagates (early detection) than the baselines.

Journal ArticleDOI
TL;DR: This study proposes a novel multiclass wind turbine bearing fault diagnosis strategy based on the conditional variational generative adversarial network (CVAE-GAN) model combining multisource signals fusion and shows that the proposed strategy can increase wind turbines bearing fault diagnostic accuracy in complex scenarios.
Abstract: Low fault diagnosis accuracy in case of insufficient and imbalanced samples is a major problem in the wind turbine fault diagnosis. The imbalance of samples refers to the large difference in the number of samples of different categories or the lack of a certain fault sample, which requires good learning of the characteristics of a small number of samples. Sample generation in the deep learning generation model can effectively solve this problem. In this study, we proposed a novel multiclass wind turbine bearing fault diagnosis strategy based on the conditional variational generative adversarial network (CVAE-GAN) model combining multisource signals fusion. This strategy converts multisource 1-D vibration signals into 2-D signals, and the multisource 2-D signals were fused by using wavelet transform. The CVAE-GAN model was developed by merging the variational autoencoder (VAE) with the generative adversarial network (GAN). The VAE encoder was introduced as the front end of the GAN generator. The sample label was introduced as the model input to improve the model’s training efficiency. Finally, the sample set was used to train encoder, generator, and discriminator in the CVAE-GAN model to supplement the number of the fault samples. In the classifier, the sample set is used to do experimental analysis under various sample circumstances. The results show that the proposed strategy can increase wind turbine bearing fault diagnostic accuracy in complex scenarios.

Journal ArticleDOI
TL;DR: A novel encoder-decoder architecture, called contextual ensemble network (CENet), for semantic segmentation, where the contextual cues are aggregated via densely usampling the convolutional features of deep layer to the shallow deconvolutional layers.

Journal ArticleDOI
TL;DR: Wang et al. as discussed by the authors proposed a novel time-frequency Transformer (TFT) model inspired by the massive success of vanilla Transformer in sequence processing, which designed a fresh tokenizer and encoder module to extract effective abstractions from the timefrequency representation (TFR) of vibration signals.

Journal ArticleDOI
TL;DR: In this paper , an expanded sequence-to-sequence (E-Seq2Seq)-based data-driven SCUC expert system for dynamic multiple-sequence mapping samples is proposed.
Abstract: Under the background of the rapid change of energy technology and the deep integration of artificial intelligence into the power system, it is of great significance to study the intelligent decision-making method of security-constrained unit commitment (SCUC) with high adaptability and high accuracy. Thus, in this article, an expanded sequence-to-sequence (E-Seq2Seq)-based data-driven SCUC expert system for dynamic multiple-sequence mapping samples is proposed. First, dynamic multiple-sequence mapping samples of SCUC are reconstructed by analyzing the input–output sequence characteristics. Then, an E-Seq2Seq approach with a multiple-encoder–decoder architecture and a fully connected extension layer is proposed. On this basis, the simple recurrent unit is introduced as a neuron of the E-Seq2Seq approach to construct deep learning models, and an intelligent data-driven expert system for SCUC is further developed. The proposed approach has been simulated on a typical IEEE 118-bus system and a practical system in Hunan province in China. The results indicate that the proposed approach could possess strong generality, high solution accuracy, and efficiency over traditional methods.

Journal ArticleDOI
TL;DR: A feature construction encoder is proposed to obtain the features layerwise in a top-down manner, where the feature nodes in the higher layer flow to the adjacent low layer by dynamically changing their structure, and the proposed FRNet is comparable to state-of-the-art RGB-D indoor scene parsing methods on two public indoor datasets.
Abstract: We recently demonstrated the remarkable performance of scene parsing, and one of its aspects was shown to be relevant to performance, namely, generation of multilevel feature representations. However, most existing scene parsing methods obtain multilevel feature representations with weak distinctions and large spans. Therefore, despite using complex mechanisms, the effects on the feature representations are minimal. To address this, we leverage the inherent multilevel cross-modal data and back propagation to develop a novel feature reconstruction network (FRNet) for RGB-D indoor scene parsing. Specifically, a feature construction encoder is proposed to obtain the features layerwise in a top-down manner, where the feature nodes in the higher layer flow to the adjacent low layer by dynamically changing their structure. In addition, we propose a cross-level enriching module in the encoder to selectively refine and weight the features in each layer in the RGB and depth modalities as well as a cross-modality awareness module to generate the feature nodes containing the modality data. Finally, we integrate the multilevel feature representations simply via dilated convolutions at different rates. Extensive quantitative and qualitative experiments were conducted, and the results demonstrate that the proposed FRNet is comparable to state-of-the-art RGB-D indoor scene parsing methods on two public indoor datasets.

Journal ArticleDOI
TL;DR: UNeXt as discussed by the authors is a Convolutional multilayer perceptron (MLP) based network for image segmentation, which has an early convolutional stage and a MLP stage in the latent stage.
Abstract: AbstractUNet and its latest extensions like TransUNet have been the leading medical image segmentation methods in recent years. However, these networks cannot be effectively adopted for rapid image segmentation in point-of-care applications as they are parameter-heavy, computationally complex and slow to use. To this end, we propose UNeXt which is a Convolutional multilayer perceptron (MLP) based network for image segmentation. We design UNeXt in an effective way with an early convolutional stage and a MLP stage in the latent stage. We propose a tokenized MLP block where we efficiently tokenize and project the convolutional features and use MLPs to model the representation. To further boost the performance, we propose shifting the channels of the inputs while feeding in to MLPs so as to focus on learning local dependencies. Using tokenized MLPs in latent space reduces the number of parameters and computational complexity while being able to result in a better representation to help segmentation. The network also consists of skip connections between various levels of encoder and decoder. We test UNeXt on multiple medical image segmentation datasets and show that we reduce the number of parameters by 72x, decrease the computational complexity by 68x, and improve the inference speed by 10x while also obtaining better segmentation performance over the state-of-the-art medical image segmentation architectures. Code is available at https://github.com/jeya-maria-jose/UNeXt-pytorch. KeywordsMedical image segmentationMLPPoint-of-care

Journal ArticleDOI
TL;DR: A Transformer-based neural network is designed that combines denoising and prediction tasks into a unified framework for predicting Remaining Useful Life (RUL) of a Li-ion battery.
Abstract: Accurately predicting the Remaining Useful Life (RUL) of a Li-ion battery plays an important role in managing the health and estimating the state of a battery. With the rapid development of electric vehicles, there is an increasing need to develop and improve the techniques for predicting RUL. To predict RUL, we designed a Transformer-based neural network. First, battery capacity data is always full of noise, especially during battery charge/discharge regeneration. To alleviate this problem, we applied a Denoising Auto-Encoder (DAE) to process raw data. Then, to capture temporal information and learn useful features, a reconstructed sequence was fed into a Transformer network. Finally, to bridge denoising and prediction tasks, we combined these two tasks into a unified framework. Results of extensive experiments conducted on two data sets and a comparison with some existing methods show that our proposed method performs better in predicting RUL. Our projects are all open source and are available at https://github.com/XiuzeZhou/RUL.

Journal ArticleDOI
TL;DR: Transformer-based pre-trained language models (PLMs) have started a new era in modern natural language processing (NLP), which combine the power of transformers, transfer learning, and self-supervised learning as mentioned in this paper .

Journal ArticleDOI
TL;DR: KiU-Net as mentioned in this paper uses an overcomplete convolutional architecture where the input image is projected into a higher dimension such that the receptive field from increasing in the deep layers of the network is constrained.
Abstract: Most methods for medical image segmentation use U-Net or its variants as they have been successful in most of the applications. After a detailed analysis of these "traditional" encoder-decoder based approaches, we observed that they perform poorly in detecting smaller structures and are unable to segment boundary regions precisely. This issue can be attributed to the increase in receptive field size as we go deeper into the encoder. The extra focus on learning high level features causes U-Net based approaches to learn less information about low-level features which are crucial for detecting small structures. To overcome this issue, we propose using an overcomplete convolutional architecture where we project the input image into a higher dimension such that we constrain the receptive field from increasing in the deep layers of the network. We design a new architecture for im- age segmentation- KiU-Net which has two branches: (1) an overcomplete convolutional network Kite-Net which learns to capture fine details and accurate edges of the input, and (2) U-Net which learns high level features. Furthermore, we also propose KiU-Net 3D which is a 3D convolutional architecture for volumetric segmentation. We perform a detailed study of KiU-Net by performing experiments on five different datasets covering various image modalities. We achieve a good performance with an additional benefit of fewer parameters and faster convergence. We also demonstrate that the extensions of KiU-Net based on residual blocks and dense blocks result in further performance improvements. Code: https://github.com/jeya-maria-jose/KiU-Net-pytorch.

Book ChapterDOI
TL;DR: In this article , the authors explore the recent StyleGAN3 architecture, compare it to its predecessor, and investigate its unique advantages, as well as drawbacks, and propose an encoding scheme trained solely on aligned data, yet can still invert unaligned images.
Abstract: AbstractStyleGAN is arguably one of the most intriguing and well-studied generative models, demonstrating impressive performance in image generation, inversion, and manipulation. In this work, we explore the recent StyleGAN3 architecture, compare it to its predecessor, and investigate its unique advantages, as well as drawbacks. In particular, we demonstrate that while StyleGAN3 can be trained on unaligned data, one can still use aligned data for training, without hindering the ability to generate unaligned imagery. Next, our analysis of the disentanglement of the different latent spaces of StyleGAN3 indicates that the commonly used \(\mathcal {W}/\mathcal {W}+\) spaces are more entangled than their StyleGAN2 counterparts, underscoring the benefits of using the StyleSpace for fine-grained editing. Considering image inversion, we observe that existing encoder-based techniques struggle when trained on unaligned data. We therefore propose an encoding scheme trained solely on aligned data, yet can still invert unaligned images. Finally, we introduce a novel video inversion and editing workflow that leverages the capabilities of a fine-tuned StyleGAN3 generator to reduce texture sticking and expand the field of view of the edited video. KeywordsGenerative Adversarial NetworksImage and video editing