scispace - formally typeset
Search or ask a question

Showing papers in "IEEE/ACM Transactions on Computational Biology and Bioinformatics in 2023"


Journal ArticleDOI
TL;DR: DL-m6A as discussed by the authors uses three encoding schemes which give the required contextual feature representation to the input RNA sequence, and then these contextual feature vectors individually go through several neural network layers for shallow feature extraction after which they are concatenated to a single feature vector.
Abstract: N6-methyladenosine (m6A) is a common post-transcriptional alteration that plays a critical function in a variety of biological processes. Although experimental approaches for identifying m6A sites have been developed and deployed, they are currently expensive for transcriptome-wide m6A identification. Some computational strategies for identifying m6A sites have been presented as an effective complement to the experimental procedure. However, their performance still requires improvement. In this study, we have proposed a novel tool called DL-m6A for the identification of m6A sites in mammals using deep learning based on different encoding schemes. The proposed tool uses three encoding schemes which give the required contextual feature representation to the input RNA sequence. Later these contextual feature vectors individually go through several neural network layers for shallow feature extraction after which they are concatenated to a single feature vector. The concatenated feature map is then used by several other layers to extract the deep features so that the insight features of the sequence can be used for the prediction of m6A sites. The proposed tool is firstly evaluated on the tissue-specific dataset and later on a full transcript dataset. To ensure the generalizability of the tool we assessed the proposed model by training it on a full transcript dataset and test on the tissue-specific dataset. The achieved results by the proposed model have outperformed the existing tools. The results demonstrate that the proposed tool can be of great use for the biology experts and therefore a freely accessible web-server is created which can be accessed at: http://nsclbio.jbnu.ac.kr/tools/DL-m6A/ .

9 citations


Journal ArticleDOI
TL;DR: In this article , a two-stage deep learning-based framework leveraging DNA structural features, natural language processing, convolutional neural network, and long short-term memory was proposed to predict the enhancer elements accurately in the genomics data.
Abstract: Enhancer, a distal cis-regulatory element controls gene expression. Experimental prediction of enhancer elements is time-consuming and expensive. Consequently, various inexpensive deep learning-based fast methods have been developed for predicting the enhancers and determining their strength. In this paper, we have proposed a two-stage deep learning-based framework leveraging DNA structural features, natural language processing, convolutional neural network, and long short-term memory to predict the enhancer elements accurately in the genomics data. In the first stage, we extracted the features from DNA sequence data by using three feature representation techniques viz., k-mer based feature extraction along with word2vector based interpretation of underlined patterns, one-hot encoding, and the DNAshape technique. In the second stage, strength of enhancers is predicted from the extracted features using a hybrid deep learning model. The method is capable of adapting itself to varying sizes of datasets. Also, as proposed model can capture long-range sequencing patterns, the robustness of the method remains unaffected against minor variations in the genomics sequence. The method outperforms the other state-of-the-art methods at both stages in terms of performance metrics of prediction accuracy, specificity, Mathew's correlation coefficient, and area under the ROC curve. In summary, the proposed method is a reliable method for enhancer prediction.

9 citations


Journal ArticleDOI
TL;DR: Zhang et al. as discussed by the authors proposed a Deep Attention Neural Network based Drug-Drug Interaction prediction framework, abbreviated as DANN-DDI, to predict unobserved drug-drug interactions.
Abstract: Drug-drug interactions are one of the main concerns in drug discovery. Accurate prediction of drug-drug interactions plays a key role in increasing the efficiency of drug research and safety when multiple drugs are co-prescribed. With various data sources that describe the relationships and properties between drugs, the comprehensive approach that integrates multiple data sources would be considerably effective in making high-accuracy prediction. In this paper, we propose a Deep Attention Neural Network based Drug-Drug Interaction prediction framework, abbreviated as DANN-DDI, to predict unobserved drug-drug interactions. First, we construct multiple drug feature networks and learn drug representations from these networks using the graph embedding method; then, we concatenate the learned drug embeddings and design an attention neural network to learn representations of drug-drug pairs; finally, we adopt a deep neural network to accurately predict drug-drug interactions. The experimental results demonstrate that our model DANN-DDI has improved prediction performance compared with state-of-the-art methods. Moreover, the proposed model can predict novel drug-drug interactions and drug-drug interaction-associated events.

9 citations


Journal ArticleDOI
TL;DR: In this article , an Orthogonal SoftMax Layer (OSL)-based Acute Leukemia detection model that consists of ResNet 18-based deep feature extraction followed by efficient OSL-based classification was proposed.
Abstract: For the early diagnosis of hematological disorders like blood cancer, microscopic analysis of blood cells is very important. Traditional deep CNNs lead to overfitting when it receives small medical image datasets such as ALLIDB1, ALLIDB2, and ASH. This paper proposes a new and effective model for classifying and detecting Acute Lymphoblastic Leukemia (ALL) or Acute Myelogenous Leukemia (AML) that delivers excellent performance in small medical datasets. Here, we have proposed a novel Orthogonal SoftMax Layer (OSL)-based Acute Leukemia detection model that consists of ResNet 18-based deep feature extraction followed by efficient OSL-based classification. Here, OSL is integrated with the ResNet18 to improve the classification performance by making the weight vectors orthogonal to each other. Hence, it integrates ResNet benefits (residual learning and identity mapping) with the benefits of OSL-based classification (improvement of feature discrimination capability and computational efficiency). Furthermore, we have introduced extra dropout and ReLu layers in the architecture to achieve a faster network with enhanced performance. The performance verification is performed on standard ALLIDB1, ALLIDB2, and $ C\_{N}MC\_{2}019$ datasets for efficient ALL detection and ASH dataset for effective AML detection. The experimental performance demonstrates the superiority of the proposed model over other compairing models.

8 citations



Journal ArticleDOI
TL;DR: Li et al. as mentioned in this paper proposed a hybrid two-stage teaching-learning-based optimization (TS-TLBO) algorithm to improve the performance of bioinformatics data classification, where potentially informative features, as well as noisy features, are selected to effectively reduce the search space.
Abstract: The “curse of dimensionality” brings new challenges to the feature selection (FS) problem, especially in bioinformatics filed. In this paper, we propose a hybrid Two-Stage Teaching-Learning-Based Optimization (TS-TLBO) algorithm to improve the performance of bioinformatics data classification. In the selection reduction stage, potentially informative features, as well as noisy features, are selected to effectively reduce the search space. In the following comparative self-learning stage, the teacher and the worst student with self-learning evolve together based on the duality of the FS problems to enhance the exploitation capabilities. In addition, an opposition-based learning strategy is utilized to generate initial solutions to rapidly improve the quality of the solutions. We further develop a self-adaptive mutation mechanism to improve the search performance by dynamically adjusting the mutation rate according to the teacher's convergence ability. Moreover, we integrate a differential evolutionary method with TLBO to boost the exploration ability of our algorithm. We conduct comparative experiments on 31 public data sets with different data dimensions, including 7 bioinformatics datasets, and evaluate our TS-TLBO algorithm compared with 11 related methods. The experimental results show that the TS-TLBO algorithm obtains a good feature subset with better classification performance, and indicates its generality to the FS problems.

6 citations


Journal ArticleDOI
TL;DR: In this paper , a Prompt Deep Lightweight Vessel Segmentation Network (PLVS-Net) is proposed to improve the performance of the segmentation network while simultaneously decreasing the number of trainable parameters.
Abstract: Achieving accurate retinal vessel segmentation is critical in the progression and diagnosis of vision-threatening diseases such as diabetic retinopathy and age-related macular degeneration. Existing vessel segmentation methods are based on encoder-decoder architectures, which frequently fail to take into account the retinal vessel structure's context in their analysis. As a result, such methods have difficulty bridging the semantic gap between encoder and decoder characteristics. This paper proposes a Prompt Deep Light-weight Vessel Segmentation Network (PLVS-Net) to address these issues by using prompt blocks. Each prompt block use combination of asymmetric kernel convolutions, depth-wise separable convolutions, and ordinary convolutions to extract useful features. This novel strategy improves the performance of the segmentation network while simultaneously decreasing the number of trainable parameters. Our method outperformed competing approaches in the literature on three benchmark datasets, including DRIVE, STARE, and CHASE.

6 citations


Journal ArticleDOI
TL;DR: Wang et al. as mentioned in this paper proposed a nonnegative matrix factorization (LSNMF) algorithm for layer-specific graph clustering in multi-layer networks, where the orthogonality constraint is imposed on the specific components to ensure the specificity of features of vertices.
Abstract: Multi-layer networks provide an effective and efficient tool to model and characterize complex systems with multiple types of interactions, which differ greatly from the traditional single-layer networks. Graph clustering in multi-layer networks is highly non-trivial since it is difficult to balance the connectivity of clusters and the connection of various layers. The current algorithms for the layer-specific clusters are criticized for the low accuracy and sensitivity to the perturbation of networks. To overcome these issues, a novel algorithm for the layer-specific module in multi-layer networks based on nonnegative matrix factorization (LSNMF) is proposed by explicitly exploring the specific features of vertices. LSNMF first extract features of vertices in multi-layer networks by using nonnegative matrix factorization (NMF) and then decompose features of vertices into the common and specific components. The orthogonality constraint is imposed on the specific components to ensure the specificity of features of vertices, which provides a better strategy to characterize and model the structure of layer-specific modules. The extensive experiments demonstrate that the proposed algorithm dramatically outperforms state-of-the-art baselines in terms of various measurements. Furthermore, LSNMF efficiently extracts stage-specific modules, which are more likely to enrich the known functions, and also associate with the survival time of patients.

5 citations


Journal ArticleDOI
TL;DR: The scaling alignment-based phylogenetic placement (SCAMPP) as mentioned in this paper is a technique to extend the scalability of these likelihood-based placement methods to ultra-large backbone trees.
Abstract: Phylogenetic placement, the problem of placing a “query” sequence into a precomputed phylogenetic “backbone” tree, is useful for constructing large trees, performing taxon identification of newly obtained sequences, and other applications. The most accurate current methods, such as pplacer and EPA-ng, are based on maximum likelihood and require that the query sequence be provided within a multiple sequence alignment that includes the leaf sequences in the backbone tree. This approach enables high accuracy but also makes these likelihood-based methods computationally intensive on large backbone trees, and can even lead to them failing when the backbone trees are very large (e.g., having 50,000 or more leaves). We present SCAMPP (SCaling AlignMent-based Phylogenetic Placement), a technique to extend the scalability of these likelihood-based placement methods to ultra-large backbone trees. We show that pplacer-SCAMPP and EPA-ng-SCAMPP both scale well to ultra-large backbone trees (even up to 200,000 leaves), with accuracy that improves on APPLES and APPLES-2, two recently developed fast phylogenetic placement methods that scale to ultra-large datasets. EPA-ng-SCAMPP and pplacer-SCAMPP are available at https://github.com/chry04/PLUSplacer .

5 citations


Journal ArticleDOI
TL;DR: In this paper , the authors proposed two deep learning approaches for ADHD classification based on functional magnetic resonance imaging (fMRI) and correlation autoencoder method, which used correlations between regions of interest of the brain as the input of an auto-encoder to learn latent features which are then used in the classification task by a new neural network.
Abstract: Attention Deficit Hyperactivity Disorder (ADHD) is a type of mental health disorder that can be seen from children to adults and affects patients’ normal life. Accurate diagnosis of ADHD as early as possible is very important for the treatment of patients in clinical applications. Some traditional classification methods, although having been shown powerful in many other classification tasks, are not as successful in the application of ADHD classification. In this paper, we propose two novel deep learning approaches for ADHD classification based on functional magnetic resonance imaging. The first method incorporates independent component analysis with convolutional neural network. It first extracts independent components from each subject. The independent components are then fed into a convolutional neural network as input features to classify the ADHD patient from typical controls. The second method, called the correlation autoencoder method, uses correlations between regions of interest of the brain as the input of an autoencoder to learn latent features, which are then used in the classification task by a new neural network. These two methods use different ways to extract the inter-voxel information from fMRI, but both use convolutional neural networks to further extract predictive features for the classification task. Empirical experiments show that both methods are able to outperform the classical methods such as logistic regression, support vector machines, and other methods used in previous studies.

4 citations


Journal ArticleDOI
TL;DR: Li et al. as mentioned in this paper proposed an efficient network-based prediction algorithm, namely PPISB, based on a mixed membership stochastic blockmodel, which is able to capture the latent community structures.
Abstract: Protein-protein interactions (PPIs) play an essential role for most of biological processes in cells. Many computational algorithms have thus been proposed to predict PPIs. However, most of them heavily rest on the biological information of proteins while ignoring the latent structural features of proteins presented in a PPI network. In this paper, we propose an efficient network-based prediction algorithm, namely PPISB, based on a mixed membership stochastic blockmodel. By simulating the generative process of a PPI network, PPISB is able to capture the latent community structures. The inference procedure adopted by PPISB further optimizes the membership distributions of proteins over different complexes. After that, a distance measure is designed to compute the similarity between two proteins in terms of their likelihoods of being in the same complex, thus verifying whether they interact with each other or not. To evaluate the performance of PPISB, a series of extensive experiments have been conducted with five PPI networks collected from different species and the results demonstrate that PPISB has a promising performance when applied to predict PPIs in terms of several evaluation metrics. Hence, we reason that PPISB is preferred over state-of-the-art network-based prediction algorithms especially for predicting potential PPIs.

Journal ArticleDOI
TL;DR: In this paper , a graph neural network (GNN) was used to predict drug side-effect (DSE) on a dataset of heterogeneous data, such as drug molecules and genes.
Abstract: Drug Side–Effects (DSEs) have a high impact on public health, care system costs, and drug discovery processes. Predicting the probability of side–effects, before their occurrence, is fundamental to reduce this impact, in particular on drug discovery. Candidate molecules could be screened before undergoing clinical trials, reducing the costs in time, money, and health of the participants. Drug side–effects are triggered by complex biological processes involving many different entities, from drug structures to protein–protein interactions. To predict their occurrence, it is necessary to integrate data from heterogeneous sources. In this work, such heterogeneous data is integrated into a graph dataset, expressively representing the relational information between different entities, such as drug molecules and genes. The relational nature of the dataset represents an important novelty for drug side–effect predictors. Graph Neural Networks (GNNs) are exploited to predict DSEs on our dataset with very promising results. GNNs are deep learning models that can process graph–structured data, with minimal information loss, and have been applied on a wide variety of biological tasks. Our experimental results confirm the advantage of using relationships between data entities, suggesting interesting future developments in this scope. The experimentation also shows the importance of specific subsets of data in determining associations between drugs and side–effects.

Journal ArticleDOI
TL;DR: In this paper , the authors proposed a novel deep neural network architecture involving transfer learning approach, formed by freezing and concatenating all the layers till block 4 pool layer of VGG16 pre-trained model with the layers of a randomly initialized naïve Inception block module.
Abstract: In this paper, we have presented a novel deep neural network architecture involving transfer learning approach, formed by freezing and concatenating all the layers till block4 pool layer of VGG16 pre-trained model (at the lower level) with the layers of a randomly initialized naïve Inception block module (at the higher level). Further, we have added the batch normalization, flatten, dropout and dense layers in the proposed architecture. Our transfer network, called VGGIN-Net, facilitates the transfer of domain knowledge from the larger ImageNet object dataset to the smaller imbalanced breast cancer dataset. To improve the performance of the proposed model, regularization was used in the form of dropout and data augmentation. A detailed block-wise fine tuning has been conducted on the proposed deep transfer network for images of different magnification factors. The results of extensive experiments indicate a significant improvement of classification performance after the application of fine-tuning. The proposed deep learning architecture with transfer learning and fine-tuning yields the highest accuracies in comparison to other state-of-the-art approaches for the classification of BreakHis breast cancer dataset. The articulated architecture is designed in a way that it can be effectively transfer learned on other breast cancer datasets.

Journal ArticleDOI
TL;DR: Zhang et al. as discussed by the authors proposed a novel framework called node2vec-based neural collaborative filtering for predicting miRNA-disease association (NCMD) based on deep neural networks.
Abstract: Numerous studies have reported that micro RNAs (miRNAs) play pivotal roles in disease pathogenesis based on the deregulation of the expressions of target messenger RNAs. Therefore, the identification of disease-related miRNAs is of great significance in understanding human complex diseases, which can also provide insight into the design of novel prognostic markers and disease therapies. Considering the time and cost involved in wet experiments, most recent works have focused on the effective and feasible modeling of computational frameworks to uncover miRNA-disease associations. In this study, we propose a novel framework called node2vec-based neural collaborative filtering for predicting miRNA-disease association (NCMD) based on deep neural networks. Initially, NCMD exploits Node2vec to learn low-dimensional vector representations of miRNAs and diseases. Next, it utilizes a deep learning framework that combines the linear ability of generalized matrix factorization and nonlinear ability of a multilayer perceptron. Experimental results clearly demonstrate the comparable performance of NCMD relative to the state-of-the-art methods according to statistical measures. In addition, case studies on breast cancer, lung cancer and pancreatic cancer validate the effectiveness of NCMD. Extensive experiments demonstrate the benefits of modeling a neural collaborative-filtering-based approach for discovering novel miRNA-disease associations.

Journal ArticleDOI
TL;DR: In this article , a semi-supervised learning model is proposed to use unlabeled scRNAseq cells and a limited amount of labeled scRNA seq cells to implement cell identification.
Abstract: Cell type identification from single-cell transcriptomic data is a common goal of single-cell RNA sequencing (scRNAseq) data analysis. Deep neural networks have been employed to identify cell types from scRNAseq data with high performance. However, it requires a large mount of individual cells with accurate and unbiased annotated types to train the identification models. Unfortunately, labeling the scRNAseq data is cumbersome and time-consuming as it involves manual inspection of marker genes. To overcome this challenge, we propose a semi-supervised learning model “SemiRNet” to use unlabeled scRNAseq cells and a limited amount of labeled scRNAseq cells to implement cell identification. The proposed model is based on recurrent convolutional neural networks (RCNN), which includes a shared network, a supervised network and an unsupervised network. The proposed model is evaluated on two large scale single-cell transcriptomic datasets. It is observed that the proposed model is able to achieve encouraging performance by learning on the very limited amount of labeled scRNAseq cells together with a large number of unlabeled scRNAseq cells.

Journal ArticleDOI
TL;DR: Zhao et al. as discussed by the authors proposed a deep learning-based model, named AttentionDTA, which uses attention mechanism to predict drug-target interactions (DTIs), a binary classification problem.
Abstract: The identification of drug–target relations (DTRs) is substantial in drug development. A large number of methods treat DTRs as drug-target interactions (DTIs), a binary classification problem. The main drawback of these methods are the lack of reliable negative samples and the absence of many important aspects of DTR, including their dose dependence and quantitative affinities. With increasing number of publications of drug–protein binding affinity data recently, DTRs prediction can be viewed as a regression problem of drug–target affinities (DTAs) which reflects how tightly the drug binds to the target and can present more detailed and specific information than DTIs. The growth of affinity data enables the use of deep learning architectures, which have been shown to be among the state-of-the-art methods in binding affinity prediction. Although relatively effective, due to the black-box nature of deep learning, these models are less biologically interpretable. In this study, we proposed a deep learning-based model, named AttentionDTA, which uses attention mechanism to predict DTAs. Different from the models using 3D structures of drug–target complexes or graph representation of drugs and proteins, the novelty of our work is to use attention mechanism to focus on key subsequences which are important in drug and protein sequences when predicting its affinity. We use two separate one-dimensional Convolution Neural Networks (1D-CNNs) to extract the semantic information of drug’s SMILES string and protein’s amino acid sequence. Furthermore, a two-side multi-head attention mechanism is developed and embedded to our model to explore the relationship between drug features and protein features. We evaluate our model on three established DTA benchmark datasets, Davis, Metz, and KIBA. AttentionDTA outperforms the state-of-the-art deep learning methods under different evaluation metrics. The results show that the attention-based model can effectively extract protein features related to drug information and drug features related to protein information to better predict drug target affinities. It is worth mentioning that we test our model on IC50 dataset, which provides the binding sites between drugs and proteins, to evaluate the ability of our model to locate binding sites. Finally, we visualize the attention weight to demonstrate the biological significance of the model. The source code of AttentionDTA can be downloaded from https://github.com/zhaoqichang/AttentionDTA_TCBB .

Journal ArticleDOI
TL;DR: Wang et al. as discussed by the authors proposed a slice grouped domain attention (SGDA) module to enhance the generalization capability of the pulmonary nodule detection networks, which works in the axial, coronal, and sagittal directions.
Abstract: Lung cancer is the leading cause of cancer death worldwide. The best solution for lung cancer is to diagnose the pulmonary nodules in the early stage, which is usually accomplished with the aid of thoracic computed tomography (CT). As deep learning thrives, convolutional neural networks (CNNs) have been introduced into pulmonary nodule detection to help doctors in this labor-intensive task and demonstrated to be very effective. However, the current pulmonary nodule detection methods are usually domain-specific, and cannot satisfy the requirement of working in diverse real-world scenarios. To address this issue, we propose a slice grouped domain attention (SGDA) module to enhance the generalization capability of the pulmonary nodule detection networks. This attention module works in the axial, coronal, and sagittal directions. In each direction, we divide the input feature into groups, and for each group, we utilize a universal adapter bank to capture the feature subspaces of the domains spanned by all pulmonary nodule datasets. Then the bank outputs are combined from the perspective of domain to modulate the input group. Extensive experiments demonstrate that SGDA enables substantially better multi-domain pulmonary nodule detection performance compared with the state-of-the-art multi-domain learning methods.

Journal ArticleDOI
TL;DR: Wang et al. as mentioned in this paper proposed a two-layer predictor called Enhancer-FRL for identifying enhancers (enhancers or nonenhancers) and their activities (strong and weak).
Abstract: Enhancers are crucial for precise regulation of gene expression, while enhancer identification and strength prediction are challenging because of their free distribution and tremendous number of similar fractions in the genome. Although several bioinformatics tools have been developed, shortfalls in these models remain, and their performances need further improvement. In the present study, a two-layer predictor called Enhancer-FRL was proposed for identifying enhancers (enhancers or nonenhancers) and their activities (strong and weak). More specifically, to build an efficient model, the feature representation learning scheme was applied to generate a 50D probabilistic vector based on 10 feature encodings and five machine learning algorithms. Subsequently, the multiview probabilistic features were integrated to construct the final prediction model. Compared with the single feature-based model, Enhancer-FRL showed significant performance improvement and model robustness. Performance assessment on the independent test dataset indicated that the proposed model outperformed state-of-the-art available toolkits. The webserver Enhancer-FRL is freely accessible at http://lab.malab.cn/∼wangchao/softwares/Enhancer-FRL/ , The code and datasets can be downloaded at the webserver page or at the Github https://github.com/wangchao-malab/Enhancer-FRL/ .

Journal ArticleDOI
TL;DR: In this paper , the authors provide an overview of recent computational methods for the detection of protein complexes and functional modules in protein-protein interaction networks, also providing a focus on some of its applications.
Abstract: The ability to identify and characterize not only the protein-protein interactions but also their internal modular organization through network analysis is fundamental for understanding the mechanisms of biological processes at the molecular level. Indeed, the detection of the network communities can enhance our understanding of the molecular basis of disease pathology, and promote drug discovery and disease treatment in personalized medicine. This work gives an overview of recent computational methods for the detection of protein complexes and functional modules in protein-protein interaction networks, also providing a focus on some of its applications. We propose a systematic reformulation of frequently adopted taxonomies for these methods, also proposing new categories to keep up with the most recent research. We review the literature of the last five years (2017-2021) and provide links to existing data and software resources. Finally, we survey recent works exploiting module identification and analysis, in the context of a variety of disease processes for biomarker identification and therapeutic target detection. Our review provides the interested reader with an up-to-date and self-contained view of the existing research, with links to state-of-the-art literature and resources, as well as hints on open issues and future research directions in complex detection and its applications.

Journal ArticleDOI
TL;DR: In this article , a 64×21 substitution matrix is fitted to sequence data, automatically learning the genetic code and detecting subtly homologous regions by considering alternative possible alignments between them, and calculate significance (probability of occurring by chance between random sequences).
Abstract: Protein fossils, i.e. noncoding DNA descended from coding DNA, arise frequently from transposable elements (TEs), decayed genes, and viral integrations. They can reveal, and mislead about, evolutionary history and relationships. They have been detected by comparing DNA to protein sequences, but current methods are not optimized for this task. We describe a powerful DNA-protein homology search method. We use a 64×21 substitution matrix, which is fitted to sequence data, automatically learning the genetic code. We detect subtly homologous regions by considering alternative possible alignments between them, and calculate significance (probability of occurring by chance between random sequences). Our method detects TE protein fossils much more sensitively than blastx, and > 10× faster. Of the ~7 major categories of eukaryotic TE, three were long thought absent in mammals: we find two of them in the human genome, polinton and DIRS/Ngaro. This method increases our power to find ancient fossils, and perhaps to detect non-standard genetic codes. The alternative-alignments and significance paradigm is not specific to DNA-protein comparison, and could benefit homology search generally. This is an extended version of a conference paper [1].

Journal ArticleDOI
TL;DR: In this article , the authors utilized deep transfer learning with varying experimental analysis for reliable classification of Alzheimer's disease (AD) using the genome-wide association studies (GWAS) dataset.
Abstract: Alzheimer's disease (AD) is a type of brain disorder that is regarded as a degenerative disease because the corresponding symptoms aggravate with the time progression. Single nucleotide polymorphisms (SNPs) have been identified as relevant biomarkers for this condition. This study aims to identify SNPs biomarkers associated with the AD in order to perform a reliable classification of AD. In contrast to existing related works, we utilize deep transfer learning with varying experimental analysis for reliable classification of AD. For this purpose, the convolutional neural networks (CNN) are firstly trained over the genome-wide association studies (GWAS) dataset requested from the AD neuroimaging initiative. We then employ the deep transfer learning for further training of our CNN (as base model) over a different AD GWAS dataset, to extract the final set of features. The extracted features are then fed into Support Vector Machine for classification of AD. Detailed experiments are performed using multiple datasets and varying experimental configurations. The statistical outcomes indicate an accuracy of 89% which is a significant improvement when benchmarked with existing related works.

Journal ArticleDOI
TL;DR: Gliomanet as discussed by the authors proposes a convolutional neural network (CNN)-based framework for non-invasive grading of tumors from 3D MRI scans, which leverages the spatial and channel attention modules to recalibrate the feature maps across the layers.
Abstract: Glioma has emerged as the deadliest form of brain tumor for human beings. Timely diagnosis of these tumors is a major step towards effective oncological treatment. Magnetic Resonance Imaging (MRI) typically offers a non-invasive inspection of brain lesions. However, manual inspection of tumors from MRI scans requires a large amount of time and it is also an error-prone process. Therefore, automated diagnosis of tumors plays a crucial role in clinical management and surgical interventions of gliomas. In this study, we propose a Convolutional Neural Network (CNN)-based framework for non-invasive grading of tumors from 3D MRI scans. The proposed framework incorporates two novel CNN architectures. The first CNN architecture performs the segmentation of tumors from multimodel MRI volumes. The proposed segmentation network leverages the spatial and channel attention modules to recalibrate the feature maps across the layers. The second network utilizes the multi-task learning strategy to perform the classification based on the three glioma grading tasks which include characterization of tumor into low-grade or high-grade, identification of 1p19q, and Isocitrate Dehydrogenase (IDH) status. We have carried out several experiments to evaluate the performance of our method. Extensive experimental observations indicate that the proposed framework achieves better performance than several state-of-the-art methods. We have also executed Welch's- t test to show the statistical significance of grading results. The source code of this study is available at https://github.com/prasunc/Gliomanet.

Journal ArticleDOI
TL;DR: GUSignal as discussed by the authors is a free informatics tool that aims to be fast and systematic during the image analysis since it executes specific and ordered instructions, to offer a segmented analysis by areas or regions of interest, providing quantitative results of the image intensity levels.
Abstract: The uidA gene codifies for a glucuronidase (GUS) enzyme which has been used as a biotechnological tool during the last years. When uidA gene is fused to a gene's promotor region, it is possible to evaluate the activity of this one in response to a stimulus. Arabidopsis thaliana has served as the biological platform to elucidate molecular and regulatory signaling responses in plants. Transgenic lines of A. thaliana , tagged with the uidA gene, have allowed explaining how plants modify their hormonal pathways depending on the environmental conditions. Although the information extracted from microscopic images of these transgenic plants is often qualitative and in many publications is not subjected to quantification, in this paper we report the development of an informatics tool focused on computer vision for processing and analysis of digital images in order to analyze the expression of the GUS signal in A. thaliana roots, which is strongly correlated with the intensity of the grayscale images. This means that the presence of the GUS-induced color indicates where the gene has been actively expressed, such as our statistical analysis has demonstrated after treatment of A. thaliana DR5 ::GUS with naphtalen-acetic acid (0.0001 mM and 1 mM). GUSignal is a free informatics tool that aims to be fast and systematic during the image analysis since it executes specific and ordered instructions, to offer a segmented analysis by areas or regions of interest, providing quantitative results of the image intensity levels.

Journal ArticleDOI
TL;DR: In this paper , a computational framework named as BRMCF has been proposed for analysing the prediction capability of chemical and biological properties of drugs toward drug functions in view of multi-label nature of problem.
Abstract: In silico machine learning based prediction of drug functions considering the drug properties would substantially enhance the speed and reduce the cost of identifying promising drug leads. The drug function prediction capability of different drug properties happens to be different. So assessing these is advantageous in drug discovery. The task of drug function prediction is multi-label in nature reason being, in case of several drugs, multiple functions are associated with a drug. A number of existing works have ignored this inherent multi-label nature of the problem in context of addressing the issue of class imbalance. In the present work, a computational framework named as BRMCF has been proposed for analysing the prediction capability of chemical and biological properties of drugs toward drug functions in view of multi-label nature of problem. It employs Binary Relevance (BR) approach along with five base classifiers for handling the multi-label prediction task and MLSMOTE for addressing the issue of class imbalance. The proposed framework has been validated and compared with BR, Classifier Chains (CC) and Deep Neural Network (DNN) method on four drug properties datasets: SMILES Strings (SS) dataset, 17 Molecular Descriptors (17MD) dataset, Protein Sequences (PS) dataset and drug perturbed Gene EXpression Profiles (GEX) dataset. The analysis of results shows that the proposed framework BRMCF has outperformed BR, CC and DNN method in terms of exact match ratio, precision, recall, F1-score, ROC-AUC which signifies the effectiveness of MLSMOTE. Further, assessment of prediction capability of different drug properties is done and they are ranked as SS $>$ GEX $>$ PS $>$ 17MD. Additionally, the visualization and analysis of drug function co-occurrences signify the appropriateness of the proposed framework for drug function co-occurrence detection and in signaling the new possible drug leads where the detection rate varies from 94.34% to 99.61%.

Journal ArticleDOI
TL;DR: In this article , a metaheuristic algorithm named Chemical Reaction Optimization (CRO) and machine learning method are used to identify essential proteins using a large number of biological information, computational methods are becoming popular in recent times to identify the essential protein.
Abstract: For the survival, development, and reproduction of the organism, understanding the working process of the cell, disease study, design drugs, etc. essential protein plays a crucial role. Due to a large number of biological information, computational methods are becoming popular in recent times to identify the essential protein. Many computational methods used machine learning techniques, metaheuristic algorithms, etc. to solve the problem. The problem with these methods is that the essential protein class prediction rate is still low. Many of these methods have not considered the imbalance characteristics of the dataset. In this paper, we have proposed an approach to identify essential proteins using a metaheuristic algorithm named Chemical Reaction Optimization (CRO) and machine learning method. Both topological and biological features are used here. The Saccharomyces cerevisiae (S. cerevisiae) and Escherichia coli (E. coli) datasets are used in the experiment. Topological features are calculated from the PPI network data. Composite features are calculated from the collected features. Synthetic Minority Over-sampling Technique and Edited Nearest Neighbor (SMOTE+ENN) technique is applied to balance the dataset and then the CRO algorithm is applied to achieve the optimal number of features. Our experiment shows that the proposed approach gives better results in both accuracy and f-measure than the existing related methods.

Journal ArticleDOI
TL;DR: C-SHIFT as mentioned in this paper is a covariance shift normalization algorithm that uses optimization techniques together with the blessing of dimensionality philosophy and energy minimization hypothesis for covariance matrix recovery under additive noise.
Abstract: Omics technologies are powerful tools for analyzing patterns in gene expression data for thousands of genes. Due to a number of systematic variations in experiments, the raw gene expression data is often obfuscated by undesirable technical noises. Various normalization techniques were designed in an attempt to remove these non-biological errors prior to any statistical analysis. One of the reasons for normalizing data is the need for recovering the covariance matrix used in gene network analysis. In this paper, we introduce a novel normalization technique, called the covariance shift (C-SHIFT) method. This normalization algorithm uses optimization techniques together with the blessing of dimensionality philosophy and energy minimization hypothesis for covariance matrix recovery under additive noise (in biology, known as the bias). Thus, it is perfectly suited for the analysis of logarithmic gene expression data. Numerical experiments on synthetic data demonstrate the method's advantage over the classical normalization techniques. Namely, the comparison is made with Rank, Quantile, cyclic LOESS (locally estimated scatterplot smoothing), and MAD (median absolute deviation) normalization methods. We also evaluate the performance of C-SHIFT algorithm on real biological data.

Journal ArticleDOI
TL;DR: DeepPFP-CO as mentioned in this paper uses Graph Convolutional Network (GCN) to explore and capture the co-occurrence of Gene Ontology (GO) terms to improve the protein function prediction performance.
Abstract: The understanding of protein functions is critical to many biological problems such as the development of new drugs and new crops. To reduce the huge gap between the increase of protein sequences and annotations of protein functions, many methods have been proposed to deal with this problem. These methods use Gene Ontology (GO) to classify the functions of proteins and consider one GO term as a class label. However, they ignore the co-occurrence of GO terms that is helpful for protein function prediction. We propose a new deep learning model, named DeepPFP-CO, which uses Graph Convolutional Network (GCN) to explore and capture the co-occurrence of GO terms to improve the protein function prediction performance. In this way, we can further deduce the protein functions by fusing the predicted propensity of the center function and its co-occurrence functions. We use Fmax and AUPR to evaluate the performance of DeepPFP-CO and compare DeepPFP-CO with state-of-the-art methods such as DeepGOPlus and DeepGOA. The computational results show that DeepPFP-CO outperforms DeepGOPlus and other methods. Moreover, we further analyze our model at the protein level. The results have demonstrated that DeepPFP-CO improves the performance of protein function prediction. DeepPFP-CO is available at https://csuligroup.com/DeepPFP/.

Journal ArticleDOI
TL;DR: Wang et al. as discussed by the authors proposed a robust and privacy-preserving decentralized deep federated learning (RPDFL) training scheme to improve the communication efficiency in RPDFL training.
Abstract: Federated learning of deep neural networks has emerged as an evolving paradigm for distributed machine learning, gaining widespread attention due to its ability to update parameters without collecting raw data from users, especially in digital healthcare applications. However, the traditional centralized architecture of federated learning suffers from several problems (e.g., single point of failure, communication bottlenecks, etc.), especially malicious servers inferring gradients and causing gradient leakage. To tackle the above issues, we propose a robust and privacy-preserving decentralized deep federated learning (RPDFL) training scheme. Specifically, we design a novel ring FL structure and a Ring-Allreduce-based data sharing scheme to improve the communication efficiency in RPDFL training. Furthermore, we improve the process of distributing parameters of the Chinese residual theorem to update the execution process of the threshold secret sharing, supporting healthcare edge to drop out during the training process without causing data leakage, and ensuring the robustness of the RPDFL training under the Ring-Allreduce-based data sharing scheme. Security analysis indicates that RPDFL is provable secure. Experiment results show that RPDFL is significantly superior to standard FL methods in terms of model accuracy and convergence, and is suitable for digital healthcare applications.

Journal ArticleDOI
TL;DR: SA-Net as mentioned in this paper utilizes k-mer embedding to encode RNA sequences and a self-attention-based neural network to extract sequence features, which achieves state-of-the-art results on the RBP-24 dataset.
Abstract: Proteins binding to Ribonucleic Acid (RNA) inside cells are called RNA-binding proteins (RBP), which play a crucial role in gene regulation. The identification of RNA-protein binding sites helps to understand the function of RBP better. Although many computational methods have been developed to predict RNA-protein binding sites, their prediction accuracy on small sample datasets needs improvement. To overcome this limitation, we propose a novel model called SA-Net, which utilizes k-mer embedding to encode RNA sequences and a self-attention-based neural network to extract sequence features. K-mer embedding assists the model to discover significant subsequence fragments associated with binding sites. The self-attention mechanism captures contextual information from the entire input sequence globally, performing well in small sample sequence learning. Experimental results demonstrate that SA-Net attains state-of-the-art results on the RBP-24 dataset. We find that 4-mer embedding aids the model to achieve optimal performance. We also show that the self-attention network outperforms the commonly used CNN and CNN-BLSTM models in sequence feature extraction.

Journal ArticleDOI
TL;DR: TransCrispr as mentioned in this paper integrates Transformer and convolutional neural network (CNN) architecture to predict sgRNA knockout efficacy, which outperforms state-of-the-art methods in terms of prediction accuracy.
Abstract: CRISPR/Cas9 is a widely used genome editing tool for site-directed modification of deoxyribonucleic acid (DNA) nucleotide sequences. However, how to accurately predict and evaluate the on- and off-target effects of single guide RNA (sgRNA) is one of the key problems for CRISPR/Cas9 system. Using computational methods to obtain high cell-specific sensitivity and specificity is a prerequisite for the optimal design of sgRNAs. Inspired by the work of predecessors, we found that sgRNA on-target knockout efficacy was not only related to the original sequence but also affected by important biological features. Hence, we introduce a novel approach called TransCrispr, which integrates Transformer and convolutional neural network (CNN) architecture to predict sgRNA knockout efficacy. Firstly, we encode the sequence data and send the transformed sgRNA sequence, positional information, and biological features into the network as input. Then, the convolutional neural network will automatically learn an appropriate feature representation for the sgRNA sequence and combine it with the positional information for self-attention learning of the Transformer. Finally, a regression score is generated by predicting biological features. Experiments on seven public datasets illustrate that TransCrispr outperforms state-of-the-art methods in terms of prediction accuracy and generalization ability.