scispace - formally typeset
Search or ask a question
Author

Jiayu Xu

Bio: Jiayu Xu is an academic researcher. The author has contributed to research in topics: Computer science & Artificial intelligence. The author has an hindex of 1, co-authored 1 publications receiving 3 citations.

Papers
More filters
Posted Content
TL;DR: Wang et al. as discussed by the authors proposed a dual-scale encoder subnetworks based on Swin Transformer to extract the coarse and fine-grained feature representations of different semantic scales.
Abstract: Automatic medical image segmentation has made great progress benefit from the development of deep learning. However, most existing methods are based on convolutional neural networks (CNNs), which fail to build long-range dependencies and global context connections due to the limitation of receptive field in convolution operation. Inspired by the success of Transformer in modeling the long-range contextual information, some researchers have expended considerable efforts in designing the robust variants of Transformer-based U-Net. Moreover, the patch division used in vision transformers usually ignores the pixel-level intrinsic structural features inside each patch. To alleviate these problems, we propose a novel deep medical image segmentation framework called Dual Swin Transformer U-Net (DS-TransUNet), which might be the first attempt to concurrently incorporate the advantages of hierarchical Swin Transformer into both encoder and decoder of the standard U-shaped architecture to enhance the semantic segmentation quality of varying medical images. Unlike many prior Transformer-based solutions, the proposed DS-TransUNet first adopts dual-scale encoder subnetworks based on Swin Transformer to extract the coarse and fine-grained feature representations of different semantic scales. As the core component for our DS-TransUNet, a well-designed Transformer Interactive Fusion (TIF) module is proposed to effectively establish global dependencies between features of different scales through the self-attention mechanism. Furthermore, we also introduce the Swin Transformer block into decoder to further explore the long-range contextual information during the up-sampling process. Extensive experiments across four typical tasks for medical image segmentation demonstrate the effectiveness of DS-TransUNet, and show that our approach significantly outperforms the state-of-the-art methods.

59 citations

Journal ArticleDOI
TL;DR: A serial attention frame (SAF) containing MAB and SAB is presented to address the effect of the complex background of waste bottle recognition and exhibited good recognition performance in the collected waste bottle datasets.
Abstract: The multi-label recognition of damaged waste bottles has important significance in environmental protection. However, most of the previous methods are known for their poor performance, especially in regards to damaged waste bottle classification. In this paper, we propose the use of a serial attention frame (SAF) to overcome the mentioned drawback. The proposed network architecture includes the following three parts: a residual learning block (RB), a mixed attention block (MAB), and a self-attention block (SAB). The RB uses ResNet to pretrain the SAF to extract more detailed information. To address the effect of the complex background of waste bottle recognition, a serial attention mechanism containing MAB and SAB is presented. MAB is used to extract more salient category information via the simultaneous use of spatial attention and channel attention. SAB exploits the obtained features and its parameters to enable the diverse features to improve the classification results of waste bottles. The experimental results demonstrate that our proposed model exhibited good recognition performance in the collected waste bottle datasets, with eight labels of three classifications, i.e., the color, whether the bottle was damage, and whether the wrapper had been removed, as well as public image classification datasets.

6 citations

Journal ArticleDOI
TL;DR: An attention mechanism acts a CNN for image classification and the main architecture of CNNs with attentions, public and collected datasets, experimental results in image classification are given.
Abstract: Deep learning techniques as well as CNNs can learn power context information, they have been widely applied in image recognition. However, deep CNNs may reply on large width and large depth, which may increase computational costs. Attention mechanism fused into CNNs can address this problem. In this paper, we summary an attention mechanism acts a CNN for image classification. Firstly, the survey shows the development of CNNs for image classification. Then, we illustrate basis of CNNs and attention mechanisms for image classification. Next, we give the main architecture of CNNs with attentions, public and our collected datasets, experimental results in image classification. Finally, we point out potential research points, challenges attention-based for image classification and summary the whole paper.

5 citations

Book ChapterDOI
01 Jan 2022
TL;DR: ConTrans as discussed by the authors proposes a concurrent structure which consists of two parallel encoders, i.e., a Swin Transformer encoder and a CNN encoder, which can couple detailed localization information with global contexts to the maximum extent.
Abstract: Over the past few years, convolution neural networks (CNNs) and vision transformers (ViTs) have been two dominant architectures in medical image segmentation. Although CNNs can efficiently capture local representations, they experience difficulty establishing long-distance dependencies. Comparably, ViTs achieve impressive success owing to their powerful global contexts modeling capabilities, but they may not generalize well on insufficient datasets due to the lack of inductive biases inherent to CNNs. To inherit the merits of these two different design paradigms while avoiding their respective limitations, we propose a concurrent structure termed ConTrans, which can couple detailed localization information with global contexts to the maximum extent. ConTrans consists of two parallel encoders, i.e., a Swin Transformer encoder and a CNN encoder. Specifically, the CNN encoder is progressively stacked by the novel Depthwise Attention Block (DAB), with the aim to provide the precise local features we need. Furthermore, a well-designed Spatial-Reduction-Cross-Attention (SRCA) module is embedded in the decoder to form a comprehensive fusion of these two distinct feature representations and eliminate the semantic divergence between them. This allows to obtain accurate semantic information and ensure the up-sampling features with semantic consistency in a hierarchical manner. Extensive experiments across four typical tasks show that ConTrans significantly outperforms state-of-the-art methods on ten famous benchmarks.

4 citations

Journal ArticleDOI
TL;DR: A novel framework with context to locate and classify nuclei in microscopy image data is proposed and experimental results demonstrate that the method outperforms other recent state-of-the-art models in nucleus identification.
Abstract: MOTIVATION Nucleus identification supports many quantitative analysis studies that rely on nuclei positions or categories. Contextual information in pathology images refers to information near the to-be-recognized cell, which can be very helpful for nucleus subtyping. Current CNN-based methods do not explicitly encode contextual information within the input images and point annotations. RESULTS In this paper, we propose a novel framework with context to locate and classify nuclei in microscopy image data. Specifically, first we use state-of-the-art network architectures to extract multi-scale feature representations from multi-field-of-view, multi-resolution input images and then conduct feature aggregation on-the-fly with stacked convolutional operations. Then, two auxiliary tasks are added to the model to effectively utilize the contextual information. One for predicting the frequencies of nuclei, and the other for extracting the regional distribution information of the same kind of nuclei. The entire framework is trained in an end-to-end, pixel-to-pixel fashion. We evaluate our method on two histopathological image datasets with different tissue and stain preparations, and experimental results demonstrate that our method outperforms other recent state-of-the-art models in nucleus identification. AVAILABILITY The source code of our method is freely available at https://github.com/qjxjy123/DonRabbit. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.

Cited by
More filters
Journal ArticleDOI
02 Jun 2022
TL;DR: A comprehensive review of the state-of-the-art Transformer-based approaches for medical imaging is presented in this paper , where the Transformer's key defining properties are compared with CNNs and its type of architecture, which specifies the manner in which the Transformers and CNN are combined.
Abstract: Transformer, one of the latest technological advances of deep learning, has gained prevalence in natural language processing or computer vision. Since medical imaging bear some resemblance to computer vision, it is natural to inquire about the status quo of Transformers in medical imaging and ask the question: can the Transformer models transform medical imaging? In this paper, we attempt to make a response to the inquiry. After a brief introduction of the fundamentals of Transformers, especially in comparison with convolutional neural networks (CNNs), and highlighting key defining properties that characterize the Transformers, we offer a comprehensive review of the state-of-the-art Transformer-based approaches for medical imaging and exhibit current research progresses made in the areas of medical image segmentation, recognition, detection, registration, reconstruction, enhancement, etc. In particular, what distinguishes our review lies in its organization based on the Transformer's key defining properties, which are mostly derived from comparing the Transformer and CNN, and its type of architecture, which specifies the manner in which the Transformer and CNN are combined, all helping the readers to best understand the rationale behind the reviewed approaches. We conclude with discussions of future perspectives.

37 citations

Journal ArticleDOI
TL;DR: In this article , the authors proposed an efficient date classification model based on MobileNetV2 architecture, which achieved 99% accuracy on eight different classes of date fruit and compared with other existing models such as AlexNet, VGG16, InceptionV3, ResNet, and MobileNetv2.
Abstract: A total of 8.46 million tons of date fruit are produced annually around the world. The date fruit is considered a high-valued confectionery and fruit crop. The hot arid zones of Southwest Asia, North Africa, and the Middle East are the major producers of date fruit. The production of dates in 1961 was 1.8 million tons, which increased to 2.8 million tons in 1985. In 2001, the production of dates was recorded at 5.4 million tons, whereas recently it has reached 8.46 million tons. A common problem found in the industry is the absence of an autonomous system for the classification of date fruit, resulting in reliance on only the manual expertise, often involving hard work, expense, and bias. Recently, Machine Learning (ML) techniques have been employed in such areas of agriculture and fruit farming and have brought great convenience to human life. An automated system based on ML can carry out the fruit classification and sorting tasks that were previously handled by human experts. In various fields, CNNs (convolutional neural networks) have achieved impressive results in image classification. Considering the success of CNNs and transfer learning in other image classification problems, this research also employs a similar approach and proposes an efficient date classification model. In this research, a dataset of eight different classes of date fruit has been created to train the proposed model. Different preprocessing techniques have been applied in the proposed model, such as image augmentation, decayed learning rate, model checkpointing, and hybrid weight adjustment to increase the accuracy rate. The results show that the proposed model based on MobileNetV2 architecture has achieved 99% accuracy. The proposed model has also been compared with other existing models such as AlexNet, VGG16, InceptionV3, ResNet, and MobileNetV2. The results prove that the proposed model performs better than all other models in terms of accuracy.

30 citations

Journal ArticleDOI
TL;DR: Wang et al. as mentioned in this paper proposed a multi-feature integration network (Swin-MFINet), which consists of an encoder, a Swin transformer-based decoder, and multi-Feature integration (MFI) modules.
Abstract: Automatic surface defect detection is critical for manufacturing industries, such as steel, fabric, and marble industries. This study proposes a Swin transformer-based model called Multi-Feature Integration Network (Swin-MFINet) for pixel-level surface defect detection. The proposed model consists of an encoder, a Swin transformer-based decoder, and Multi-Feature Integration (MFI) modules. In the encoder module of the proposed model, a pre-trained Inception network is used to extract key features from small-size datasets. In the decoder section, global semantic features are obtained from the initial features by using the Swin-transformer block, which is the newest transformer technology of today. In addition, the convolution layer is used in the last step of the decoder, since transformers are limited in acquiring small spatial details such as edges, colors, and textures, which are important in detecting some small defects. In the last module called MFI, feature maps from different decoder stages are combined, and the channel squeeze-spatial excitation block is applied to reveal important features. Finally, a prediction map is obtained by applying a convolution layer and sigmoid activation function to the MFI module output, respectively. The performance of proposed model is analyzed over MT and MVTec datasets containing surface defect images. The proposed model obtained mIoU scores of 81.37%, and 77.07% respectively, for these two datasets These results outperform the state-of-the-art for the surface defect detection problem.

14 citations

Proceedings ArticleDOI
01 Jan 2023
TL;DR: In this article , the authors propose Hiformer, a novel method that efficiently bridges a CNN and a transformer for medical image segmentation, which designs two multi-scale feature representations using the seminal Swin Transformer module and a CNN-based encoder.
Abstract: Convolutional neural networks (CNNs) have been the consensus for medical image segmentation tasks. However, they suffer from the limitation in modeling long-range dependencies and spatial correlations due to the nature of convolution operation. Although transformers were first developed to address this issue, they fail to capture low-level features. In contrast, it is demonstrated that both local and global features are crucial for dense prediction, such as segmenting in challenging contexts. In this paper, we propose HiFormer, a novel method that efficiently bridges a CNN and a transformer for medical image segmentation. Specifically, we design two multi-scale feature representations using the seminal Swin Transformer module and a CNN-based encoder. To secure a fine fusion of global and local features obtained from the two aforementioned representations, we propose a Double-Level Fusion (DLF) module in the skip connection of the encoder-decoder structure. Extensive experiments on various medical image segmentation datasets demonstrate the effectiveness of HiFormer over other CNN-based, transformer-based, and hybrid methods in terms of computational complexity, quantitative and qualitative results. Our code is publicly available at GitHub.

8 citations

Posted Content
TL;DR: Not-a-Nother transformer (nnnformer) as discussed by the authors combines self-attention and convolution to learn volumetric representations from 3D local volumes.
Abstract: Transformers, the default model of choices in natural language processing, have drawn scant attention from the medical imaging community. Given the ability to exploit long-term dependencies, transformers are promising to help atypical convolutional neural networks (convnets) to overcome its inherent shortcomings of spatial inductive bias. However, most of recently proposed transformer-based segmentation approaches simply treated transformers as assisted modules to help encode global context into convolutional representations without investigating how to optimally combine self-attention (i.e., the core of transformers) with convolution. To address this issue, in this paper, we introduce nnFormer (i.e., Not-aNother transFormer), a powerful segmentation model with an interleaved architecture based on empirical combination of self-attention and convolution. In practice, nnFormer learns volumetric representations from 3D local volumes. Compared to the naive voxel-level self-attention implementation, such volume-based operations help to reduce the computational complexity by approximate 98% and 99.5% on Synapse and ACDC datasets, respectively. In comparison to prior-art network configurations, nnFormer achieves tremendous improvements over previous transformer-based methods on two commonly used datasets Synapse and ACDC. For instance, nnFormer outperforms Swin-UNet by over 7 percents on Synapse. Even when compared to nnUNet, currently the best performing fully-convolutional medical segmentation network, nnFormer still provides slightly better performance on Synapse and ACDC.

7 citations