scispace - formally typeset
Search or ask a question

Showing papers in "IEEE/ACM Transactions on Computational Biology and Bioinformatics in 2020"


Journal ArticleDOI
TL;DR: The set-theory based rule is presented which combines a few feature selection methods with their collective strengths and the reduced model using about a half of the original full features performs better than the models based on individual feature selection method and achieves accuracy, sensitivity, and specificity.
Abstract: Chronic Kidney Disease (CKD) is a menace that is affecting 10 percent of the world population and 15 percent of the South African population. The early and cheap diagnosis of this disease with accuracy and reliability will save 20,000 lives in South Africa per year. Scientists are developing smart solutions with Artificial Intelligence (AI). In this paper, several typical and recent AI algorithms are studied in the context of CKD and the extreme gradient boosting (XGBoost) is chosen as our base model for its high performance. Then, the model is optimized and the optimal full model trained on all the features achieves a testing accuracy, sensitivity, and specificity of 1.000, 1.000, and 1.000, respectively. Note that, to cover the widest range of people, the time and monetary costs of CKD diagnosis have to be minimized with fewest patient tests. Thus, the reduced model using fewer features is desirable while it should still maintain high performance. To this end, the set-theory based rule is presented which combines a few feature selection methods with their collective strengths. The reduced model using about a half of the original full features performs better than the models based on individual feature selection methods and achieves accuracy, sensitivity and specificity, of 1.000, 1.000, and 1.000, respectively.

314 citations


Journal ArticleDOI
TL;DR: D-Leaf can be an effective automated system for plant species identification as shown by the experimental results and is found to be fitted well with the ANN classifier.
Abstract: An automated plant species identification system could help botanists and layman in identifying plant species rapidly. Deep learning is robust for feature extraction as it is superior in providing deeper information of images. In this research, a new CNN-based method named D-Leaf was proposed. The leaf images were pre-processed and the features were extracted by using three different Convolutional Neural Network (CNN) models namely pre-trained AlexNet, fine-tuned AlexNet, and D-Leaf. These features were then classified by using five machine learning techniques, namely, Support Vector Machine (SVM), Artificial Neural Network (ANN), k-Nearest-Neighbor (k-NN), Naive-Bayes (NB), and CNN. A conventional morphometric method computed the morphological measurements based on the Sobel segmented veins was employed for benchmarking purposes. The D-Leaf model achieved a comparable testing accuracy of 94.88 percent as compared to AlexNet (93.26 percent) and fine-tuned AlexNet (95.54 percent) models. In addition, CNN models performed better than the traditional morphometric measurements (66.55 percent). The features extracted from the CNN are found to be fitted well with the ANN classifier. D-Leaf can be an effective automated system for plant species identification as shown by the experimental results.

119 citations


Journal ArticleDOI
TL;DR: This work compared eight imputation methods, evaluated their power in recovering original real data, and performed broad analyses to explore their effects on clustering cell types, detecting differentially expressed genes, and reconstructing lineage trajectories in the context of both simulated and real data.
Abstract: Single-cell RNA-sequencing (scRNA-seq) is a recent breakthrough technology, which paves the way for measuring RNA levels at single cell resolution to study precise biological functions. One of the main challenges when analyzing scRNA-seq data is the presence of zeros or dropout events, which may mislead downstream analyses. To compensate the dropout effect, several methods have been developed to impute gene expression since the first Bayesian-based method being proposed in 2016. However, these methods have shown very diverse characteristics in terms of model hypothesis and imputation performance. Thus, large-scale comparison and evaluation of these methods is urgently needed now. To this end, we compared eight imputation methods, evaluated their power in recovering original real data, and performed broad analyses to explore their effects on clustering cell types, detecting differentially expressed genes, and reconstructing lineage trajectories in the context of both simulated and real data. Simulated datasets and case studies highlight that there are no one method performs the best in all the situations. Some defects of these methods such as scalability, robustness, and unavailability in some situations need to be addressed in future studies.

99 citations


Journal ArticleDOI
Jiawei Luo1, Yahui Long1
TL;DR: NTSHMDA has potential ability to identify novel disease-microbe associations and can also provide valuable information for drug discovery and biological researches.
Abstract: Accumulating clinic evidences have demonstrated that the microbes residing in human bodies play a significantly important role in the formation, development, and progression of various complex human diseases. Identifying latent related microbes for disease could provide insight into human disease mechanisms and promote disease prevention, diagnosis, and treatment. In this paper, we first construct a heterogeneous network by connecting the disease similarity network and the microbe similarity network through known microbe-disease association network, and then develop a novel computational model to predict human microbe-disease associations based on random walk by integrating network topological similarity (NTSHMDA). Specifically, each microbe-disease association pair is regarded as a distinct relationship level and, thus, assigned different weights based on network topological similarity. The experimental results show that NTSHMDA outperforms some state-of-the-art methods with average AUCs of 0.9070, 0.8896 $\pm$ ± 0.0038 in the frameworks of Leave-one-out cross validation and 5-fold cross validation, respectively. In case studies, 9, 18, 38 and 9, 18, 45 out of top-10, 20, 50 candidate microbes are verified by recently published literatures for asthma and inflammatory bowel disease, respectively. In conclusion, NTSHMDA has potential ability to identify novel disease-microbe associations and can also provide valuable information for drug discovery and biological researches.

55 citations


Journal ArticleDOI
TL;DR: A deep CNN architecture for automated sleep stage classiffication of human sleep EEG and EOG signals is designed and it spontaneously discovers signal features such as sleep spindles and slow waves that figure prominently in sleep stage categorization as performed by human experts.
Abstract: Convolutional neural networks (CNN) have demonstrated state-of-the-art classification results in image categorization, but have received comparatively little attention for classification of one-dimensional physiological signals. We design a deep CNN architecture for automated sleep stage classiffication of human sleep EEG and EOG signals. The CNN proposed in this paper amply outperforms recent work that uses a different CNN architecture over a single-EEG-channel version of the same dataset. We show that the performance gains achieved by our network rely mainly on network depth, and not on the use of several signal channels. Performance of our approach is on par with human expert inter-scorer agreement. By examining the internal activation levels of our CNN, we find that it spontaneously discovers signal features such as sleep spindles and slow waves that figure prominently in sleep stage categorization as performed by human experts.

55 citations


Journal ArticleDOI
TL;DR: A comprehensive review and assessment for various amino acid encoding methods is proposed and it is shown that the evolution-based position-dependent encoding method PSSM achieved the best performance, and the structure-based and machine-learning encoding methods also show some potential for further application.
Abstract: As the first step of machine-learning based protein structure and function prediction, the amino acid encoding play a fundamental role in the final success of those methods. Different from the protein sequence encoding, the amino acid encoding can be used in both residue-level and sequence-level prediction of protein properties by combining them with different algorithms. However, it has not attracted enough attention in the past decades, and there are no comprehensive reviews and assessments about encoding methods so far. In this article, we make a systematic classification and propose a comprehensive review and assessment for various amino acid encoding methods. Those methods are grouped into five categories according to their information sources and information extraction methodologies, including binary encoding, physicochemical properties encoding, evolution-based encoding, structure-based encoding, and machine-learning encoding. Then, 16 representative methods from five categories are selected and compared on protein secondary structure prediction and protein fold recognition tasks by using large-scale benchmark datasets. The results show that the evolution-based position-dependent encoding method PSSM achieved the best performance, and the structure-based and machine-learning encoding methods also show some potential for further application, the neural network based distributed representation of amino acids in particular may bring new light to this area. We hope that the review and assessment are useful for future studies in amino acid encoding.

55 citations


Journal ArticleDOI
TL;DR: The experimental results show that the proposed computational method can serve as a useful tool for predicting RNA-protein interactions and is compared with the state-of-the-art SVM classifier and other existing methods on the same benchmark data set.
Abstract: Emerging evidence has shown that RNA plays a crucial role in many cellular processes, and their biological functions are primarily achieved by binding with a variety of proteins. High-throughput biological experiments provide a lot of valuable information for the initial identification of RNA-protein interactions (RPIs), but with the increasing complexity of RPIs networks, this method gradually falls into expensive and time-consuming situations. Therefore, there is an urgent need for high speed and reliable methods to predict RNA-protein interactions. In this study, we propose a computational method for predicting the RNA-protein interactions using sequence information. The deep learning convolution neural network (CNN) algorithm is utilized to mine the hidden high-level discriminative features from the RNA and protein sequences and feed it into the extreme learning machine (ELM) classifier. The experimental results with 5-fold cross-validation indicate that the proposed method achieves superior performance on benchmark datasets (RPI1807, RPI2241, and RPI369) with the accuracy of 98.83, 90.83, and 85.63 percent, respectively. We further evaluate the performance of the proposed model by comparing it with the state-of-the-art SVM classifier and other existing methods on the same benchmark data set. In addition, we predicted the independent NPInter v2.0 data set using the model trained on RPI369. The experimental results show that our model can serve as a useful tool for predicting RNA-protein interactions.

53 citations


Journal ArticleDOI
TL;DR: A weakly-supervised convolutional neural network architecture (WSCNN), combining multiple-instance learning (MIL) with CNN, to further boost the performance of predicting protein-DNA binding.
Abstract: Although convolutional neural networks (CNN) have outperformed conventional methods in predicting the sequence specificities of protein-DNA binding in recent years, they do not take full advantage of the intrinsic weakly-supervised information of DNA sequences that a bound sequence may contain multiple TFBS(s). Here, we propose a weakly-supervised convolutional neural network architecture (WSCNN), combining multiple-instance learning (MIL) with CNN, to further boost the performance of predicting protein-DNA binding. WSCNN first divides each DNA sequence into multiple overlapping subsequences (instances) with a sliding window, and then separately models each instance using CNN, and finally fuses the predicted scores of all instances in the same bag using four fusion methods, including Max , Average , Linear Regression , and Top-Bottom Instances . The experimental results on in vivo and in vitro datasets illustrate the performance of the proposed approach. Moreover, models built on in vitro data using WSCNN can predict in vivo protein-DNA binding with good accuracy. In addition, we give a quantitative analysis of the importance of the reverse-complement mode in predicting in vivo protein-DNA binding, and explain why not directly use advanced pooling layers to combine MIL with CNN, through a series of experiments.

53 citations


Journal ArticleDOI
TL;DR: A new feature extractor, called deep manifold preserving autoencoder, to learn discriminative features from unlabeled data by preserving the structure of the input datasets from the manifold learning view and minimizing reconstruction error from the deep learning view from a large amount of unlabeling data.
Abstract: Classifying breast cancer histopathological images automatically is an important task in computer assisted pathology analysis. However, extracting informative and non-redundant features for histopathological image classification is challenging due to the appearance variability caused by the heterogeneity of the disease, the tissue preparation, and staining processes. In this paper, we propose a new feature extractor, called deep manifold preserving autoencoder, to learn discriminative features from unlabeled data. Then, we integrate the proposed feature extractor with a softmax classifier to classify breast cancer histopathology images. Specifically, it learns hierarchal features from unlabeled image patches by minimizing the distance between its input and output, and simultaneously preserving the geometric structure of the whole input data set. After the unsupervised training, we connect the encoder layers of the trained deep manifold preserving autoencoder with a softmax classifier to construct a cascade model and fine-tune this deep neural network with labeled training data. The proposed method learns discriminative features by preserving the structure of the input datasets from the manifold learning view and minimizing reconstruction error from the deep learning view from a large amount of unlabeled data. Extensive experiments on the public breast cancer dataset (BreaKHis) demonstrate the effectiveness of the proposed method.

53 citations


Journal ArticleDOI
TL;DR: It is found that essential proteins appear in triangular structure in PPI network significantly more often than nonessential ones, and a novel pure centrality measure, so-called Neighborhood Closeness Centrality (NCC), is proposed.
Abstract: Identifying essential proteins plays an important role in disease study, drug design, and understanding the minimal requirement for cellular life. Computational methods for essential proteins discovery overcome the disadvantages of biological experimental methods that are often time-consuming, expensive, and inefficient. The topological features of protein-protein interaction (PPI) networks are often used to design computational prediction methods, such as Degree Centrality (DC), Betweenness Centrality (BC), Closeness Centrality (CC), Subgraph Centrality (SC), Eigenvector Centrality (EC), Information Centrality (IC), and Neighborhood Centrality (NC). However, the prediction accuracies of these individual methods still have space to be improved. Studies show that additional information, such as orthologous relations, helps discover essential proteins. Many researchers have proposed different methods by combining multiple information sources to gain improvement of prediction accuracy. In this study, we find that essential proteins appear in triangular structure in PPI network significantly more often than nonessential ones. Based on this phenomenon, we propose a novel pure centrality measure, so-called Neighborhood Closeness Centrality (NCC). Accordingly, we develop a new combination model, Extended Pareto Optimality Consensus model, named EPOC, to fuse NCC and Orthology information and a novel essential proteins identification method, NCCO, is fully proposed. Compared with seven existing classic centrality methods (DC, BC, IC, CC, SC, EC, and NC) and three consensus methods (PeC, ION, and CSC), our results on S.cerevisiae and E.coli datasets show that NCCO has clear advantages. As a consensus method, EPOC also yields better performance than the random walk model.

45 citations


Journal ArticleDOI
TL;DR: A deep learning model using a two-dimensional convolutional neural network and position specific scoring matrices is proposed that could identify FMN interacting residues with the sensitivity of 83.7 percent, specificity of 99.2 percent, accuracy, and Matthews correlation coefficients are outperformed.
Abstract: Flavin mono-nucleotides (FMNs) are cofactors that hold responsibility for carrying and transferring electrons in the electron transport chain stage of cellular respiration. Without being facilitated by FMNs, energy production is stagnant due to the interruption in most of the cellular processes. Investigation on FMN's functions, therefore, can gain holistic understanding about human diseases and molecular information on drug targets. We proposed a deep learning model using a two-dimensional convolutional neural network and position specific scoring matrices that could identify FMN interacting residues with the sensitivity of 83.7%, specificity of 99.2%, accuracy of 98.2%, and Matthews correlation coefficients of 0.85 for an independent dataset containing 141 FMN binding sites and 1,920 non-FMN binding sites. The proposed method outperformed other previous studies using similar evaluation metrics. Our positive outcome can also promote the utilization of deep learning in dealing with various problems in bioinformatics and computational biology.

Journal ArticleDOI
TL;DR: The best practices in gene expression data analysis in terms of analysis of (differential) co-expression, co- expression network, differential networking, and differential connectivity considering both microarray and RNA-seq data along with comparisons are reviewed.
Abstract: Analysis of gene expression data is widely used in transcriptomic studies to understand functions of molecules inside a cell and interactions among molecules. Differential co-expression analysis studies diseases and phenotypic variations by finding modules of genes whose co-expression patterns vary across conditions. We review the best practices in gene expression data analysis in terms of analysis of (differential) co-expression, co-expression network, differential networking, and differential connectivity considering both microarray and RNA-seq data along with comparisons. We highlight hurdles in RNA-seq data analysis using methods developed for microarrays. We include discussion of necessary tools for gene expression analysis throughout the paper. In addition, we shed light on scRNA-seq data analysis by including preprocessing and scRNA-seq in co-expression analysis along with useful tools specific to scRNA-seq. To get insights, biological interpretation and functional profiling is included. Finally, we provide guidelines for the analyst, along with research issues and challenges which should be addressed.

Journal ArticleDOI
TL;DR: It is concluded that CONDEL is a powerful tool for detecting copy number variations on single tumor samples even if these are sequenced at low-coverage.
Abstract: Characterizing copy number variations (CNVs) from sequenced genomes is a both feasible and cost-effective way to search for driver genes in cancer diagnosis. A number of existing algorithms for CNV detection only explored part of the features underlying sequence data and copy number structures, resulting in limited performance. Here, we describe CONDEL, a method for detecting CNVs from single tumor samples using high-throughput sequence data. CONDEL utilizes a novel statistic in combination with a peel-off scheme to assess the statistical significance of genome bins, and adopts a Bayesian approach to infer copy number gains, losses, and deletion zygosity based on statistical mixture models. We compare CONDEL to six peer methods on a large number of simulation datasets, showing improved performance in terms of true positive and false positive rates, and further validate CONDEL on three real datasets derived from the 1000 Genomes Project and the EGA archive. CONDEL obtained higher consistent results in comparison with other three single sample-based methods, and exclusively identified a number of CNVs that were previously associated with cancers. We conclude that CONDEL is a powerful tool for detecting copy number variations on single tumor samples even if these are sequenced at low-coverage.

Journal ArticleDOI
TL;DR: The results of this study reveal that the size of the DNA storage coding set obtained by the DMVO algorithm increased by 4–16 percent, and the variance of the melting temperature decreased by 3–18 percent.
Abstract: At present, huge amounts of data are being produced every second, a situation that will gradually overwhelm current storage technology. DNA is a storage medium that features high storage density and long-term stability and is now considered to be a feasible storage solution. Errors are easily made during the sequencing and synthesis of DNA, however. In order to reduce the error rate, a novel uncorrelated address constrain are reported, and a Damping Multi-Verse Optimizer (DMVO) algorithm is proposed to construct a set of DNA coding, which is used as the non-payload. The DMVO algorithm exchanges objects through black/white holes in order to achieve a stable state and adds damping factors as disturbances. Compared with previous work, the coding set obtained by the DMVO algorithm is larger in size and of higher quality. The results of this study reveal that the size of the DNA storage coding set obtained by the DMVO algorithm increased by 416%, and the variance of the melting temperature decreased by 318%.

Journal ArticleDOI
TL;DR: This paper establishes a novel computational method, named TargetDBP, for accurately targeting DBPs from primary sequences, and constructs a new gold-standard and non-redundant benchmark dataset from PDB database to evaluate and compare the proposed targetDBP with other existing predictors.
Abstract: Accurately identifying DNA-binding proteins (DBPs) from protein sequence information is an important but challenging task for protein function annotations. In this paper, we establish a novel computational method, named TargetDBP, for accurately targeting DBPs from primary sequences. In TargetDBP, four single-view features, i.e., AAC (Amino Acid Composition), PsePSSM (Pseudo Position-Specific Scoring Matrix), PsePRSA (Pseudo Predicted Relative Solvent Accessibility), and PsePPDBS (Pseudo Predicted Probabilities of DNA-Binding Sites), are first extracted to represent different base features, respectively. Second, differential evolution algorithm is employed to learn the weights of four base features. Using the learned weights, we weightedly combine these base features to form the original super feature. An excellent subset of the super feature is then selected by using a suitable feature selection algorithm SVM-REF+CBR (Support Vector Machine Recursive Feature Elimination with Correlation Bias Reduction). Finally, the prediction model is learned via using support vector machine on the selected feature subset. We also construct a new gold-standard and non-redundant benchmark dataset from PDB database to evaluate and compare the proposed TargetDBP with other existing predictors. On this new dataset, TargetDBP can achieve higher performance than other state-of-the-art predictors. The TargetDBP web server and datasets are freely available at http://csbio.njust.edu.cn/bioinf/targetdbp/ for academic use.

Journal ArticleDOI
TL;DR: This study proposes a computational model (named LDICDL) to identify lncRNA-disease associations based on collaborative deep learning that outperforms than other state-of-the-art methods in prediction performance.
Abstract: It has been proved that long noncoding RNA (lncRNA) plays critical roles in many human diseases. Therefore, inferring associations between lncRNAs and diseases can contribute to disease diagnosis, prognosis and treatment. To overcome the limitation of traditional experimental methods such as expensive and time-consuming, several computational methods have been proposed to predict lncRNA-disease associations by fusing different biological data. However, the prediction performance of lncRNA-disease associations identification need to be improved. In this study, we propose a computational model (named LDICDL) to identify lncRNA-disease associations based on collaborative deep learning. It uses an automatic encoder to denoise multiple lncRNA feature information and multiple disease feature information, respectively. Then, the matrix decomposition algorithm is employed to predict the potential lncRNA-disease associations. In addition, to overcome the limitation of matrix decomposition, the hybrid model is developed to predict associations between new lncRNA (or disease) and diseases (or lncRNA). The ten-fold cross validation and de novo test are applied to evaluate the performance of method. The experimental results show LDICDL outperforms than other state-of-the-art methods in prediction performance.

Journal ArticleDOI
TL;DR: This work proposes a deep learning method for the discovery of breast cancer-related genes by using Capsule Network based Modeling of Multi-omics Data (CapsNetMMD), and identifies genes with prognostic values with a significantly better performance than other existing machine learning methods.
Abstract: Breast cancer is one of the most common cancers all over the world, which bring about more than 450,000 deaths each year. Although this malignancy has been extensively studied by a large number of researchers, its prognosis is still poor. Since therapeutic advance can be obtained based on gene signatures, there is an urgent need to discover genes related to breast cancer that may help uncover the mechanisms in cancer progression. We propose a deep learning method for the discovery of breast cancer-related genes by using Capsule Network based Modeling of Multi-omics Data (CapsNetMMD). In CapsNetMMD, we make use of known breast cancer-related genes to transform the issue of gene identification into the issue of supervised classification. The features of genes are generated through comprehensive integration of multi-omics data, e.g., mRNA expression, z scores for mRNA expression, DNA methylation, and two forms of DNA copy-number alterations (CNAs). By modeling features based on the capsule network, we identify breast cancer-related genes with a significantly better performance than other existing machine learning methods. The predicted genes with prognostic values play potential important roles in breast cancer and may serve as candidates for biologists and medical scientists in the future studies of biomarkers.

Journal ArticleDOI
TL;DR: In this study, an improved differential evolution with secondary structure and residue-residue contact information referred to as SCDE is proposed for protein structure prediction and Experimental results demonstrate that the proposedSCDE is effective and efficient.
Abstract: Ab initio protein tertiary structure prediction is one of the long-standing problems in structural bioinformatics. With the help of residue-residue contact and secondary structure prediction information, the accuracy of ab initio structure prediction can be enhanced. In this study, an improved differential evolution with secondary structure and residue-residue contact information referred to as SCDE is proposed for protein structure prediction. In SCDE, two score models based on secondary structure and contact information are proposed, and two selection strategies, namely, secondary structure-based selection strategy and contact-based selection strategy, are designed to guide conformation space search. A probability distribution function is designed to balance these two selection strategies. Experimental results on a benchmark dataset with 28 proteins and four free model targets in CASP12 demonstrate that the proposed SCDE is effective and efficient.

Journal ArticleDOI
TL;DR: A method to predict new microbe-disease associations based on similarity and improving bi-random walk on the disease and microbe networks, which reasonably uses the similarity of microbe network and disease network is developed.
Abstract: Many current studies have evidenced that microbes play important roles in human diseases. Therefore, discovering the associations between microbes and diseases is beneficial to systematically understanding the mechanisms of diseases, diagnosing, and treating complex diseases. It is well known that finding new potential microbe-disease associations via biological experiments is a time-consuming and expensive process. However, the computation methods can provide an opportunity to effectively predict microbe-disease associations. In recent years, efforts toward predicting microbe-disease associations are not in proportional to the importance of microbes to human diseases. In this study, we develop a method (called BRWMDA) to predict new microbe-disease associations based on similarity and improving bi-random walk on the disease and microbe networks. BRWMDA integrates microbe network, disease network, and known microbe-disease associations into a single network. After calculating the Gaussian Interaction Profile (GIP) kernel similarity of microbes based on known microbe-disease associations, the microbe network is obtained by adjusting the similarity with the logistics function. In addition, the disease network is computed by the similarity network fusion (SNF) method with the symptom-based similarity and the GIP kernel similarity based on known microbe-disease associations. Then, these two networks of microbe and disease are connected by known microbe-disease associations. Based on the assumption that similar microbes are normally associated with similar diseases and vice versa, BRWMDA is employed to predict new potential microbe-disease associations via random walk with different steps on microbe and disease networks, which reasonably uses the similarity of microbe network and disease network. The 5-fold cross validation and Leave One Out Cross Validation (LOOCV) are adopted to assess the prediction performance of our BRWMDA algorithm, as well as other competing methods for comparison. 5-fold cross validation experiments show that BRWMDA obtained the maximum AUC value of 0.9087, which is again superior to other methods of 0.9025(NGRHMDA), 0.8797 (LRLSHMDA), 0.8571 (KATZHMDA), 0.7782 (HGBI), and 0.5629 (NBI). In addition, BRWMDA also outperforms other methods in terms of LOOCV, whose AUC value is 0.9397, which is superior to other methods of 0.9111(NGRHMDA), 0.8909 (LRLSHMDA), 0.8644 (KATZHMDA), 0.7866 (HGBI), and 0.5553 (NBI). Case studies also illustrate that BRWMDA is an effective method to predict microbe-disease associations.

Journal ArticleDOI
TL;DR: A new model-based scheme for the construction of the Spatial and Temporal Active Protein Interaction Network (ST-APIN) by integrating time-course gene expression data and subcellular location information is proposed.
Abstract: The rapid development of proteomics and high-throughput technologies has produced a large amount of Protein-Protein Interaction (PPI) data, which makes it possible for considering dynamic properties of protein interaction networks (PINs) instead of static properties. Identification of protein complexes from dynamic PINs becomes a vital scientific problem for understanding cellular life in the post genome era. Up to now, plenty of models or methods have been proposed for the construction of dynamic PINs to identify protein complexes. However, most of the constructed dynamic PINs just focus on the temporal dynamic information and thus overlook the spatial dynamic information of the complex biological systems. To address the limitation of the existing dynamic PIN analysis approaches, in this paper, we propose a new model-based scheme for the construction of the Spatial and Temporal Active Protein Interaction Network (ST-APIN) by integrating time-course gene expression data and subcellular location information. To evaluate the efficiency of ST-APIN, the commonly used classical clustering algorithm MCL is adopted to identify protein complexes from ST-APIN and the other three dynamic PINs, NF-APIN, DPIN, and TC-PIN. The experimental results show that, the performance of MCL on ST-APIN outperforms those on the other three dynamic PINs in terms of matching with known complexes, sensitivity, specificity, and f-measure. Furthermore, we evaluate the identified protein complexes by Gene Ontology (GO) function enrichment analysis. The validation shows that the identified protein complexes from ST-APIN are more biologically significant. This study provides a general paradigm for constructing the ST-APINs, which is essential for further understanding of molecular systems and the biomedical mechanism of complex diseases.

Journal ArticleDOI
TL;DR: Wang et al. as discussed by the authors proposed a new method called ModuleSim to measure associations between diseases by using disease-gene association data and protein-protein interaction network (PPIN) data based on disease module theory.
Abstract: Quantifying the associations between diseases is now playing an important role in modern biology and medicine. Actually discovering associations between diseases could help us gain deeper insights into pathogenic mechanisms of complex diseases, thus could lead to improvements in disease diagnosis, drug repositioning, and drug development. Due to the growing body of high-throughput biological data, a number of methods have been developed for computing similarity between diseases during the past decade. However, these methods rarely consider the interconnections of genes related to each disease in protein-protein interaction network (PPIN). Recently, the disease module theory has been proposed, which states that disease-related genes or proteins tend to interact with each other in the same neighborhood of a PPIN. In this study, we propose a new method called ModuleSim to measure associations between diseases by using disease-gene association data and PPIN data based on disease module theory. The experimental results show that by considering the interactions between disease modules and their modularity, the disease similarity calculated by ModuleSim has a significant correlation with disease classification of Disease Ontology (DO). Furthermore, ModuleSim outperforms other four popular methods which are all using disease-gene association data and PPIN data to measure disease-disease associations. In addition, the disease similarity network constructed by MoudleSim suggests that ModuleSim is capable of finding potential associations between diseases.

Journal ArticleDOI
TL;DR: A coevolutionary pattern-based prediction model for HIV-1 PR cleavage sites, namely EvoCleave, is proposed by integrating the coevolving information obtained from substrate sequences with a linear SVM classifier, and demonstrated that the coEVolutionary patterns offered valuable insights into the understanding of substrate specificity of HIV- 1 PR.
Abstract: Human immunodeficiency virus 1 (HIV-1) protease (PR) plays a crucial role in the maturation of the virus. The study of substrate specificity of HIV-1 PR as a new endeavor strives to increase our ability to understand how HIV-1 PR recognizes its various cleavage sites. To predict HIV-1 PR cleavage sites, most of the existing approaches have been developed solely based on the homogeneity of substrate sequence information with supervised classification techniques. Although efficient, these approaches are found to be restricted to the ability of explaining their results and probably provide few insights into the mechanisms by which HIV-1 PR cleaves the substrates in a site-specific manner. In this work, a coevolutionary pattern-based prediction model for HIV-1 PR cleavage sites, namely EvoCleave, is proposed by integrating the coevolving information obtained from substrate sequences with a linear SVM classifier. The experiment results showed that EvoCleave yielded a very promising performance in terms of ROC analysis and $f$ f -measure. We also prospectively assessed the biological significance of coevolutionary patterns by applying them to study three fundamental issues of HIV-1 PR cleavage site. The analysis results demonstrated that the coevolutionary patterns offered valuable insights into the understanding of substrate specificity of HIV-1 PR.

Journal ArticleDOI
TL;DR: A classifier based on an ensemble learning model, LightGBM, to estimate the interaction propensities of drugs and targets is constructed and it is indicated that GANDTI outperforms several state-of-the-art methods for DTI prediction.
Abstract: The computational prediction of novel drug-target interactions (DTIs) may hasten drug repositioning while reducing costs. Most previous methods integrated multiple kinds of connections between drugs and targets by constructing shallow models. These methods failed to learn low-dimension feature vectors for drugs and targets and ignored their distribution. We proposed a graph convolutional network and generative adversarial network-based method, GANDTI, to predict drug-target interactions. We constructed a drug-target heterogeneous network to integrate various connections between drugs and targets. A graph convolutional autoencoder was established to learn the network embedding of the drug and target nodes in a low-dimensional feature space. This encoder deeply integrated the various connections and attributes of drug and target nodes. A generative adversarial network was introduced to regularize the feature vectors into a Gaussian distribution. Severe class imbalance exists between known and unknown DTIs. We built a classifier based on the ensemble learning model, LightGBM, to evaluate the interaction propensities of drugs and targets. This classifier completely exploited all unknown DTIs and counteracted the negative effects of class imbalance. The experimental results indicated that GANDTI performs better than several state-of-the-art methods. Additionally, case studies of five drugs demonstrated the ability of GANDTI to discover potential drugs' targets.

Journal ArticleDOI
TL;DR: The results show that the proposed method can obtain better performance than traditional methods either in the stage of cell detection or cell life stage recognition, and it encourages and suggests the application in the development of new anticancer drug and cytopathology analysis of cancer patients in the near future.
Abstract: Cancer cell detection and its stages recognition of life cycle are an important step to analyze cellular dynamics in the automation of cell based-experiments. In this work, a two-stage hierarchical method is proposed to detect and recognize different life stages of bladder cells by using two cascade Convolutional Neural Networks (CNNs). Initially, a hybrid object proposal algorithm (called EdgeSelective) by combining EdgeBoxes and Selective Search is proposed to generate candidate object proposals instead of a single Selective Search method in Region-CNN (R-CNN), and it can exploit the advantages of different mechanisms for generating proposals so that each cell in the image can be fully contained by at least one proposed region during the detection process. Then, the obtained cells from the previous step are used to train and extract features by employing CNNs for the purpose of cell life stage recognition. Finally, a series of comparison experiments are implemented. The results show that the proposed method can obtain better performance than traditional methods either in the stage of cell detection or cell life stage recognition, and it encourages and suggests the application in the development of new anticancer drug and cytopathology analysis of cancer patients in the near future.

Journal ArticleDOI
TL;DR: A framework that combines k-means clustering, t-test, sensitivity analysis, self-organizing map (SOM) neural network, and hierarchical clustering methods to classify LUAD into four subtypes provides a foundation for subtype-specific therapy of LUAD.
Abstract: As one of the most common malignancies in the world, lung adenocarcinoma (LUAD) is currently difficult to cure. However, the advent of precision medicine provides an opportunity to improve the treatment of lung cancer. Subtyping lung cancer plays an important role in performing a specific treatment. Here, we developed a framework that combines k-means clustering, t-test, sensitivity analysis, self-organizing map (SOM) neural network, and hierarchical clustering methods to classify LUAD into four subtypes. We determined that 24 differentially expressed genes could be used as therapeutic targets, and five genes (i.e., RTKN2, ADAM6, SPINK1, COL3A1, and COL1A2) could be potential novel markers for LUAD. Multivariate analysis showed that the four subtypes could serve as prognostic subtypes. Representative genes of each subtype were also identified, which could be potentially targetable markers for the different subtypes. The function and pathway enrichment analyses of these representative genes showed that the four subtypes have different pathological mechanisms. Mutations associated with the subtypes, e.g., epidermal growth factor receptor (EGFR) mutations in subtype 4 and tumor protein p53 (TP53) mutations in subtypes 1 and 2, could serve as potential markers for drug development. The four subtypes provide a foundation for subtype-specific therapy of LUAD.

Journal ArticleDOI
TL;DR: A deep matrix factorization model to predict lncRNA-disease associations (DMFLDA), which uses a cascade of non-linear hidden layers to learn latent representation to represent lncRNAs and diseases and performs better than the existing methods.
Abstract: A growing amount of evidence suggests that long non-coding RNAs (lncRNAs) play important roles in the regulation of biological processes in many human diseases. However, the number of experimentally verified lncRNA-disease associations is very limited. Thus, various computational approaches are proposed to predict lncRNA-disease associations. Current matrix factorization-based methods cannot capture the complex non-linear relationship between lncRNAs and diseases, and traditional machine learning-based methods are not sufficiently powerful to learn the representation of lncRNAs and diseases. Thus, we propose a deep matrix factorization model to predict lncRNA-disease associations (DMFLDA in short). DMFLDA uses a cascade of non-linear hidden layers to learn latent semantic vectors to represent lncRNAs and diseases. By using non-linear hidden layers, DMFLDA captures the more complex non-linear relationship between lncRNAs and diseases than traditional matrix factorization-based methods.The low dimensional representations of the lncRNAs and diseases are fused to estimate the new interaction value. To evaluate the performance of DMFLDA, we perform leave-one-out cross-validation on known experimentally verified lncRNA-disease associations. The experimental results show that DMFLDA performs better than the existing methods. The case studies show that many predicted interactions for colorectal cancer, prostate cancer and renal cancer have been verified by recent biomedical literatures.

Journal ArticleDOI
TL;DR: A novel framework is proposed that combines GWAS quality control and logistic regression with deep learning stacked autoencoders to abstract higher-order SNP interactions from large, complex genotyped data for case-control classification tasks in GWAS analysis.
Abstract: Genome-Wide Association Studies (GWAS) are used to identify statistically significant genetic variants in case-control studies. The main objective is to find single nucleotide polymorphisms (SNPs) that influence a particular phenotype (i.e., disease trait). GWAS typically use a p-value threshold of $5*10^{-8}$ 5 * 10 - 8 to identify highly ranked SNPs. While this approach has proven useful for detecting disease-susceptible SNPs, evidence has shown that many of these are, in fact, false positives. Consequently, there is some ambiguity about the most suitable threshold for claiming genome-wide significance. Many believe that using lower p-values will allow us to investigate the joint epistatic interactions between SNPs and provide better insights into phenotype expression. One example that uses this approach is multifactor dimensionality reduction (MDR), which identifies combinations of SNPs that interact to influence a particular outcome. However, computational complexity is increased exponentially as a function of higher-order combinations making approaches like MDR difficult to implement. Even so, understanding epistatic interactions in complex diseases is a fundamental component for robust genotype-phenotype mapping. In this paper, we propose a novel framework that combines GWAS quality control and logistic regression with deep learning stacked autoencoders to abstract higher-order SNP interactions from large, complex genotyped data for case-control classification tasks in GWAS analysis. We focus on the challenging problem of classifying preterm births which has a strong genetic component with unexplained heritability reportedly between 20-40 percent. A GWAS data set, obtained from dbGap is utilised, which contains predominantly urban low-income African-American women who had normal and preterm deliveries. Epistatic interactions from original SNP sequences were extracted through a deep learning stacked autoencoder model and used to fine-tune a classifier for discriminating between term and preterm births observations. All models are evaluated using standard binary classifier performance metrics. The findings show that important information pertaining to SNPs and epistasis can be extracted from 4,666 raw SNPs generated using logistic regression (p-value = $5*10^{-3}$ 5 * 10 - 3 ) and used to fit a highly accurate classifier model. The following results (Sen = 0.9562, Spec = 0.8780, Gini = 0.9490, Logloss = 0.5901, AUC = 0.9745, and MSE = 0.2010) were obtained using 50 hidden nodes and (Sen = 0.9289, Spec = 0.9591, Gini = 0.9651, Logloss = 0.3080, AUC = 0.9825, and MSE = 0.0942) using 500 hidden nodes. The results were compared with a Support Vector Machine (SVM), a Random Forest (RF), and a Fishers Linear Discriminant Analysis classifier, which all failed to improve on the deep learning approach.

Journal ArticleDOI
TL;DR: This work proposes to use some deep-learning based predictive models in a stacked ensemble framework to improve the prognosis prediction of breast cancer from available multi-modal data sets and shows that this model produces better result than already existing approaches.
Abstract: Breast Cancer is a highly aggressive type of cancer generally formed in the cells of the breast. A good predictive model can help in correct prognosis prediction of breast cancer. Previous works rely mostly on uni-modal data (selected gene expression) for predictive model design. In recent years, however, multi-modal cancer data sets have become available (gene expression, copy number alteration and clinical). Motivated by the enhancement of deep-learning based models, in the current study, we propose to use some deep-learning based predictive models in a stacked ensemble framework to improve the prognosis prediction of breast cancer from available multi-modal data sets. One of the unique advantages of the proposed approach lies in the architecture of the model. It is a two-stage model. Stage one uses a convolutional neural network for feature extraction, while stage two uses the extracted features as input to the stack-based ensemble model. The predictive performance evaluated using different performance measures shows that this model produces a better result than already existing approaches. This model results in AUC value equals to 0.93 and accuracy equals to 90.2% at medium stringency level (Specificity = 95% and threshold = 0.4.

Journal ArticleDOI
TL;DR: NAPOLI as mentioned in this paper is a web server that combines large-scale analysis of conserved interactions in protein-ligand complexes at the atomic level, interactive visual representations, and comprehensive reports of the interacting residues/atoms to detect and explore conserved non-covalent interactions.
Abstract: Essential roles in biological systems depend on protein-ligand recognition, which is mostly driven by specific non-covalent interactions. Consequently, investigating these interactions contributes to understanding how molecular recognition occurs. Nowadays, a large-scale data set of protein-ligand complexes is available in the Protein Data Bank, what led several tools to be proposed as an effort to elucidate protein-ligand interactions. Nonetheless, there is not an all-in-one tool that couples large-scale statistical, visual, and interactive analysis of conserved protein-ligand interactions. Therefore, we propose nAPOLI (Analysis of PrOtein-Ligand Interactions), a web server that combines large-scale analysis of conserved interactions in protein-ligand complexes at the atomic-level, interactive visual representations, and comprehensive reports of the interacting residues/atoms to detect and explore conserved non-covalent interactions. We demonstrate the potential of nAPOLI in detecting important conserved interacting residues through four case studies: two involving a human cyclin-dependent kinase 2 (CDK2), one related to ricin, and other to the human nuclear receptor subfamily 3 (hNR3). nAPOLI proved to be suitable to identify conserved interactions according to literature, as well as highlight additional interactions. Finally, we illustrate, with a virtual screening ligand selection, how nAPOLI can be widely applied in structural biology and drug design. nAPOLI is freely available at bioinfo.dcc.ufmg.br/napoli/ .

Journal ArticleDOI
TL;DR: The architecture of deep learning, which obtains high-level representations and handles noises and outliers presented in large-scale biological datasets, is introduced into the side information of genes in the Deep Collaborative Filtering (DCF) model and achieves substantially improved performance over other state-of-the-art methods on diseases from the Online Mendelian Inheritance in Man (OMIM) database.
Abstract: Accurate prioritization of potential disease genes is a fundamental challenge in biomedical research. Various algorithms have been developed to solve such problems. Inductive Matrix Completion (IMC) is one of the most reliable models for its well-established framework and its superior performance in predicting gene-disease associations. However, the IMC method does not hierarchically extract deep features, which might limit the quality of recovery. In this case, the architecture of deep learning, which obtains high-level representations and handles noises and outliers presented in large-scale biological datasets, is introduced into the side information of genes in our Deep Collaborative Filtering (DCF) model. Further, for lack of negative examples, we also exploit Positive-Unlabeled (PU) learning formulation to low-rank matrix completion. Our approach achieves substantially improved performance over other state-of-the-art methods on diseases from the Online Mendelian Inheritance in Man (OMIM) database. Our approach is 10 percent more efficient than standard IMC in detecting a true association, and significantly outperforms other alternatives in terms of the precision-recall metric at the top-k predictions. Moreover, we also validate the disease with no previously known gene associations and newly reported OMIM associations. The experimental results show that DCF is still satisfactory for ranking novel disease phenotypes as well as mining unexplored relationships. The source code and the data are available at https://github.com/xzenglab/DCF .