scispace - formally typeset
Search or ask a question

Showing papers in "Journal of Bioinformatics and Computational Biology in 2020"


Journal ArticleDOI
TL;DR: Reanalysis of Readhead et al.'s data using highly sensitive and specific alternative methods finds no HHV7 reads in their samples; HHV6A reads were found in only 2 out of their top 15 samples sorted by reported HHV 6A abundance; recreation of Readheads' modified Viromescan method identifies reasons for its low specificity.
Abstract: Readhead et al. recently reported in Neuron the detection and association of human herpesviruses 6A (HHV6A) and 7 (HHV7) with Alzheimer's disease by shotgun sequencing. I was skeptical of the specificity of their modified Viromescan bioinformatics method and subsequent analysis for numerous reasons. Using their supplementary data, the prevalence of variola virus, the etiological agent of the eradicated disease smallpox, can be calculated at 97.5% of their Mount Sinai Brain Bank dataset. Reanalysis of Readhead et al.'s data using highly sensitive and specific alternative methods finds no HHV7 reads in their samples; HHV6A reads were found in only 2 out of their top 15 samples sorted by reported HHV6A abundance. Finally, recreation of Readhead et al.'s modified Viromescan method identifies reasons for its low specificity.

17 citations


Journal ArticleDOI
TL;DR: A quantitative spatio-temporal Ca2+ dynamic model is developed which includes, theCa2+ releasing channels ER leak and voltage-gated Ca2- channel, buffering and re-uptaking mechanism in the T lymphocytes, and the coordinated combination of the incorporated parameters plays a significant role in Ca2+.
Abstract: T lymphocytes are white blood cells that play a central role in cell-mediated immunity. Ca2+ has its major signaling function when it is elevated in the cytosolic compartment. The free cytosolic Ca...

17 citations


Journal ArticleDOI
TL;DR: The updated protein block (PB) sequence database PDB-2-PBv3.0 contains PB sequences for 147,602 PDB structures comprising of 400,355 protein chains and allows the user to download multiple PB records by parameter search and/or by a given list.
Abstract: Our protein block (PB) sequence database PDB-2-PBv1.0 provides PB sequences and dihedral angles for 74,297 protein structures comprising of 103,252 protein chains of Protein Data Bank (PDB) as on 2011. Since there are a lot of practical applications of PB and also as the size of PDB database increases, it becomes necessary to provide the PB sequences for all PDB protein structures. The current updated PDB-2-PBv3.0 contains PB sequences for 147,602 PDB structures comprising of 400,355 protein chains as on October 2019. When compared to our previous version PDB-2-PBv1.0, the current PDB-2-PBv3.0 contains 2- and 4-fold increase in the number of protein structures and chains, respectively. Notably, it provides PB information for any protein chain, regardless of the missing atom records of protein structure data in PDB. It includes protein interaction information with DNA and RNA along with their corresponding functional classes from Nucleic Acid Database (NDB) and PDB. Now, the updated version allows the user to download multiple PB records by parameter search and/or by a given list. This database is freely accessible at http://bioinfo.bdu.ac.in/pb3.

15 citations


Journal ArticleDOI
TL;DR: The PROSPECT tool is based on a hybrid method that integrates the outputs of two convolutional neural network (CNN)-based classifiers and a random forest-based classifier and is able to accurately predict histidine phosphorylation sites from sequence information.
Abstract: Background: Phosphorylation of histidine residues plays crucial roles in signaling pathways and cell metabolism in prokaryotes such as bacteria. While evidence has emerged that protein histidine phosphorylation also occurs in more complex organisms, its role in mammalian cells has remained largely uncharted. Thus, it is highly desirable to develop computational tools that are able to identify histidine phosphorylation sites. Result: Here, we introduce PROSPECT that enables fast and accurate prediction of proteome-wide histidine phosphorylation substrates and sites. Our tool is based on a hybrid method that integrates the outputs of two convolutional neural network (CNN)-based classifiers and a random forest-based classifier. Three features, including the one-of-K coding, enhanced grouped amino acids content (EGAAC) and composition of k-spaced amino acid group pairs (CKSAAGP) encoding, were taken as the input to three classifiers, respectively. Our results show that it is able to accurately predict histidine phosphorylation sites from sequence information. Our PROSPECT web server is user-friendly and publicly available at http://PROSPECT.erc.monash.edu/. Conclusions: PROSPECT is superior than other pHis predictors in both the running speed and prediction accuracy and we anticipate that the PROSPECT webserver will become a popular tool for identifying the pHis sites in bacteria.

14 citations


Journal ArticleDOI
TL;DR: A computational method for predicting suitable strains as the recommendation of the influenza vaccines using recurrent neural networks (RNNs) and the results show significant matches of the recommended vaccine strains to the circulating strains.
Abstract: Influenza viruses are persistently threatening public health, causing annual epidemics and sporadic pandemics due to rapid viral evolution. Vaccines are used to prevent influenza infections but the...

11 citations


Journal ArticleDOI
TL;DR: Experimental results show that with the reconstructed PINs obtained by the proposed denoising approach, complex detection performance can get obviously boosted, in most cases by over 5%, sometimes even by 200%.
Abstract: Identifying protein complexes is an important issue in computational biology, as it benefits the understanding of cellular functions and the design of drugs. In the past decades, many computational methods have been proposed by mining dense subgraphs in Protein-Protein Interaction Networks (PINs). However, the high rate of false positive/negative interactions in PINs prevents accurately detecting complexes directly from the raw PINs. In this paper, we propose a denoising approach for protein complex detection by using variational graph auto-encoder. First, we embed a PIN to vector space by a stacked graph convolutional network (GCN), then decide which interactions in the PIN are credible. If the probability of an interaction being credible is less than a threshold, we delete the interaction. In such a way, we reconstruct a reliable PIN. Following that, we detect protein complexes in the reconstructed PIN by using several typical detection methods, including CPM, Coach, DPClus, GraphEntropy, IPCA and MCODE, and compare the results with those obtained directly from the original PIN. We conduct the empirical evaluation on four yeast PPI datasets (Gavin, Krogan, DIP and Wiphi) and two human PPI datasets (Reactome and Reactomekb), against two yeast complex benchmarks (CYC2008 and MIPS) and three human complex benchmarks (REACT, REACT_uniprotkb and CORE_COMPLEX_human), respectively. Experimental results show that with the reconstructed PINs obtained by our denoising approach, complex detection performance can get obviously boosted, in most cases by over 5%, sometimes even by 200%. Furthermore, we compare our approach with two existing denoising methods (RWS and RedNemo) while varying different matching rates on separate complex distributions. Our results show that in most cases (over 2/3), the proposed approach outperforms the existing methods.

11 citations


Journal ArticleDOI
TL;DR: A review on the major existing scRNA-seq data clustering methods, and a comprehensive performance comparison among them from multiple perspectives, shows that the existing methods are very diverse in performance.
Abstract: Clustering analysis has been widely applied to single-cell RNA-sequencing (scRNA-seq) data to discover cell types and cell states. Algorithms developed in recent years have greatly helped the understanding of cellular heterogeneity and the underlying mechanisms of biological processes. However, these algorithms often use different techniques, were evaluated on different datasets and compared with some of their counterparts usually using different performance metrics. Consequently, there lacks an accurate and complete picture of their merits and demerits, which makes it difficult for users to select proper algorithms for analyzing their data. To fill this gap, we first do a review on the major existing scRNA-seq data clustering methods, and then conduct a comprehensive performance comparison among them from multiple perspectives. We consider 13 state of the art scRNA-seq data clustering algorithms, and collect 12 publicly available real scRNA-seq datasets from the existing works to evaluate and compare these algorithms. Our comparative study shows that the existing methods are very diverse in performance. Even the top-performance algorithms do not perform well on all datasets, especially those with complex structures. This suggests that further research is required to explore more stable, accurate, and efficient clustering algorithms for scRNA-seq data.

10 citations


Journal ArticleDOI
TL;DR: A deep learning-based approach (EDeepSSP) that employs convolutional neural networks (CNNs) architecture for automatic feature extraction and effectively predicts splice sites is proposed and has outperformed many state-of-the-art approaches.
Abstract: Splice site prediction is crucial for understanding underlying gene regulation, gene function for better genome annotation. Many computational methods exist for recognizing the splice sites. Although most of the methods achieve a competent performance, their interpretability remains challenging. Moreover, all traditional machine learning methods manually extract features, which is tedious job. To address these challenges, we propose a deep learning-based approach (EDeepSSP) that employs convolutional neural networks (CNNs) architecture for automatic feature extraction and effectively predicts splice sites. Our model, EDeepSSP, divulges the opaque nature of CNN by extracting significant motifs and explains why these motifs are vital for predicting splice sites. In this study, experiments have been conducted on six benchmark acceptors and donor datasets of humans, cress, and fly. The results show that EDeepSSP has outperformed many state-of-the-art approaches. EDeepSSP achieves the highest area under the receiver operating characteristic curve (AUC_ROC) and area under the precision-recall curve (AUC_PR) of 99.32% and 99.26% on human donor datasets, respectively. We also analyze various filter activities, feature activations, and extracted significant motifs responsible for the splice site prediction. Further, we validate the learned motifs of our model against known motifs of JASPAR splice site database.

9 citations


Journal ArticleDOI
TL;DR: The easyAmber takes the molecular dynamics to the next level in terms of usability for complex processing of large volumes of data, thus supporting the recent trend away from inefficient "static" approaches in biology toward a deeper understanding of the dynamics in protein structures.
Abstract: Conformational plasticity of the functionally important regions and binding sites in protein/enzyme structures is one of the key factors affecting their function and interaction with substrates/lig...

7 citations


Journal ArticleDOI
TL;DR: Inverse covariance matrix is constructed based on the use of PCCs when the normality assumption can be moderately or severely violated for capturing a wide range of distributional features and complex dependency structure for breast cancer analysis.
Abstract: Many biological and biomedical research areas such as drug design require analyzing the Gene Regulatory Networks (GRNs) to provide clear insight and understanding of the cellular processes in live cells. Under normality assumption for the genes, GRNs can be constructed by assessing the nonzero elements of the inverse covariance matrix. Nevertheless, such techniques are unable to deal with non-normality, multi-modality and heavy tailedness that are commonly seen in current massive genetic data. To relax this limitative constraint, one can apply copula function which is a multivariate cumulative distribution function with uniform marginal distribution. However, since the dependency structures of different pairs of genes in a multivariate problem are very different, the regular multivariate copula will not allow for the construction of an appropriate model. The solution to this problem is using Pair-Copula Constructions (PCCs) which are decompositions of a multivariate density into a cascade of bivariate copula, and therefore, assign different bivariate copula function for each local term. In fact, in this paper, we have constructed inverse covariance matrix based on the use of PCCs when the normality assumption can be moderately or severely violated for capturing a wide range of distributional features and complex dependency structure. To learn the non-Gaussian model for the considered GRN with non-Gaussian genomic data, we apply modified version of copula-based PC algorithm in which normality assumption of marginal densities is dropped. This paper also considers the Dynamic Time Warping (DTW) algorithm to determine the existence of a time delay relation between two genes. Breast cancer is one of the most common diseases in the world where GRN analysis of its subtypes is considerably important; Since by revealing the differences in the GRNs of these subtypes, new therapies and drugs can be found. The findings of our research are used to construct GRNs with high performance, for various subtypes of breast cancer rather than simply using previous models.

7 citations


Journal ArticleDOI
TL;DR: BENIN is a general framework that jointly considers different types of prior knowledge with expression datasets to improve the network inference and uses a popular penalized regression method, the Elastic net, combined with bootstrap resampling to solve it.
Abstract: Gene regulatory network inference is one of the central problems in computational biology. We need models that integrate the variety of data available in order to use their complementarity information to overcome the issues of noisy and limited data. BENIN: Biologically Enhanced Network INference is our proposal to integrate data and infer more accurate networks. BENIN is a general framework that jointly considers different types of prior knowledge with expression datasets to improve the network inference. The method states the network inference as a feature selection problem and uses a popular penalized regression method, the Elastic net, combined with bootstrap resampling to solve it. BENIN significantly outperforms the state-of-the-art methods on the simulated data from the DREAM 4 challenge when combining genome-wide location data, knockout gene expression data, and time series expression data.

Journal ArticleDOI
TL;DR: It is found that genes whose final products are associated with the cytosolic ribosome have expressions that are highly stable with respect to the total RNA content, and these genes appear to be stable in bulk measurements as well.
Abstract: Motivation In single-cell RNA-sequencing (scRNA-seq) experiments, RNA transcripts are extracted and measured from isolated cells to understand gene expression at the cellular level. Measurements from this technology are affected by many technical artifacts, including batch effects. In analogous bulk gene expression experiments, external references, e.g. synthetic gene spike-ins often from the External RNA Controls Consortium (ERCC), may be incorporated to the experimental protocol for use in adjusting measurements for technical artifacts. In scRNA-seq experiments, the use of external spike-ins is controversial due to dissimilarities with endogenous genes and uncertainty about sufficient precision of their introduction. Instead, endogenous genes with highly stable expression could be used as references within scRNA-seq to help normalize the data. First, however, a specific notion of stable expression at the single-cell level needs to be formulated; genes could be stable in absolute expression, in proportion to cell volume, or in proportion to total gene expression. Different types of stable genes will be useful for different normalizations and will need different methods for discovery. Results We compile gene sets whose products are associated with cellular structures and record these gene sets for future reuse and analysis. We find that genes whose final products are associated with the cytosolic ribosome have expressions that are highly stable with respect to the total RNA content. Notably, these genes appear to be stable in bulk measurements as well. Supplementary information Supplementary data are available through GitHub (johanngb/sc-stable).

Journal ArticleDOI
TL;DR: A new POS-based tagging schema that subdivides the dominant class into smaller more balanced units and to solve the problem of unbalanced dataset given by the BMEWO tagging schema and to enforce sequence modeling is developed.
Abstract: The automatic extraction of disease named entity is a challenging research problem that has attracted attention from the biomedical text mining community. Handcrafted feature methods were employed for this task given a little success since they are limited by the scope of the expert. Lately, deep learning-based methods have been employed to solve this issue. However, most architectures used for this task take into consideration long dependencies only. The proposed method is a two-stage deep neural network model. We start by discovering local dependencies and creating high-level features from word embedding inputs using a deep convolutional neural network. Then we identify long dependencies using a bi-directional recurrent neural network. To solve the problem of unbalanced dataset given by the BMEWO tagging schema and to enforce sequence modeling, we developed a new POS-based tagging schema that subdivides the dominant class into smaller more balanced units. The proposed system was trained and tested on NCBI and achieved an [Formula: see text]-score of 85.59 outperforming the current state-of-the-art methods. Our research results show the effectiveness of using both long and short dependencies. The results also illustrate the benefits of combining different word embedding techniques and the incorporation of morphological features in this task.

Journal ArticleDOI
TL;DR: It appears feasible that mutation at -2 nucleotide does not impede promoter activity yet alter its physical properties thus affecting differential RNA polymerase/promoter interaction, and it is suggested that mutations at the position therefore do not cause significant changes in terms of promoter activity.
Abstract: RNA polymerase/promoter recognition represents a basic problem of molecular biology. Decades-long efforts were made in the area, and yet certain challenges persist. The usage of certain most suitab...

Journal ArticleDOI
TL;DR: The findings indicate that divergence in paralogous interaction networks reflects a shared genetic origin, and that this approach may be useful for investigating structural similarity in the interaction networks of paralogyous genes.
Abstract: Current high-throughput experimental techniques make it feasible to infer gene regulatory interactions at the whole-genome level with reasonably good accuracy. Such experimentally inferred regulato...

Journal ArticleDOI
TL;DR: This tutorial details some classical computational methods, from a computational perspective, with the transcription in an algorithmic format towards an easy access by researchers.
Abstract: Cancer is a complex disease caused by the accumulation of genetic alterations during the individual’s life. Such alterations are called genetic mutations and can be divided into two groups: (1) Pas...

Journal ArticleDOI
TL;DR: Methods associated with TE paradigm are more robust compared to TC methods, obtaining trees with more similar topologies in relation to reference trees, and the criteria used to infer the reference evolutionary hypotheses are compared.
Abstract: Phylogenetic inference proposes an evolutionary hypothesis for a group of taxa which is usually represented as a phylogenetic tree. The use of several distinct biological evidence has shown to produce more resolved phylogenies than single evidence approaches. Currently, two conflicting paradigms are applied to combine biological evidence: taxonomic congruence (TC) and total evidence (TE). Although the literature recommends the application of these paradigms depending on the congruence of the input data, the resultant evolutionary hypotheses could vary according to the strategy used to combine the biological evidence biasing the resultant topologies of the trees. In this work, we evaluate the ability of different strategies associated with both paradigms to produce integrated evolutionary hypotheses by considering different features of the data: missing biological evidence, diversity among sequences, complexity, and congruence. Using datasets from the literature, we compare the resultant trees with reference hypotheses obtained by applying two inference criteria: maximum parsimony and likelihood. The results show that methods associated with TE paradigm are more robust compared to TC methods, obtaining trees with more similar topologies in relation to reference trees. These results are obtained regardless of (1) the features of the data, (2) the estimated evolutionary rates, and (3) the criteria used to infer the reference evolutionary hypotheses.

Journal ArticleDOI
TL;DR: Period control of the mammalian cell cycle via coupling with the cellular clock is studied and an hypothesis of a Growth Factor (GF)-responsive clock, involving a pathway of the non-essential cell cycle complex cyclin D/CDK4 is proposed.
Abstract: In this work, we study period control of the mammalian cell cycle via coupling with the cellular clock For this, we make use of the oscillators' synchronization dynamics and investigate methods of slowing down the cell cycle with the use of clock inputs Clock control of the cell cycle is well established via identified molecular mechanisms, such as the CLOCK:BMAL1-mediated induction of the wee1 gene, resulting in the WEE1 kinase that represses the active form of mitosis promoting factor (MPF), the essential cell cycle component To investigate the coupling dynamics of these systems, we use previously developed models of the clock and cell cycle oscillators and center our studies on unidirectional clock [Formula: see text] cell cycle coupling Moreover, we propose an hypothesis of a Growth Factor (GF)-responsive clock, involving a pathway of the non-essential cell cycle complex cyclin D/CDK4 We observe a variety of rational ratios of clock to cell cycle period, such as: 1:1, 3:2, 4:3, and 5:4 Finally, our protocols of period control are successful in effectively slowing down the cell cycle by the use of clock modulating inputs, some of which correspond to existing drugs

Journal ArticleDOI
TL;DR: Results suggest that optimizing the order in which taxa are added improves the likelihood of the resulting trees.
Abstract: Taxon addition order and branch lengths are optimized by genetic algorithms (GAS) within the fastDNAml algorithm for constructing phylogenetic trees of high likelihood. Results suggest that optimiz...

Journal ArticleDOI
TL;DR: The results indicate that the MSI is superior to the Youden index and odds ratio for describing resolving power, and can provide a better quantifiable evaluation of the resolving power of biomarkers with different cardinal numbers.
Abstract: Biomarkers are used for clinical diagnostic purposes, but existing indexes exhibit limitations in terms of the resolving power of biomarkers. This paper proposes a new index, the magnitude-standardized index (MSI), to describe the quantitative variations and resolving powers of different biomarkers. In MSI analysis models, variation scales for ratios and differences are considered simultaneously, and a higher MSI value implies a stronger risk or effect for a biological factor. We explain the rationale for the MSI via hybrid and geometric methods and verify its efficacy through simulation experiments. Our results indicate that the MSI is superior to the Youden index and odds ratio for describing resolving power. When two biomarkers with similar Youden index values, odds ratios, or MSI values but different positive test rates (or cardinal numbers) were combined, all three index values increased; however, only the MSI value remained relatively stable. For a very small cardinal number, such as that of a single nucleotide polymorphism, the MSI value is at most half of the maximum value (0.5), allowing comparisons between MSI values for biomarkers with different cardinal numbers. The MSI can thus provide a better quantifiable evaluation of the resolving power of biomarkers with different cardinal numbers.

Journal ArticleDOI
TL;DR: An approach that identifies clusters by assessing functional similarity between samples through integrating known annotated gene sets in functional annotation databases such as Gene Ontology is reported, which achieves better performances and that the gene sets prioritized by the method are biologically meaningful.
Abstract: Clustering analysis of gene expression data is essential for understanding complex biological data, and is widely used in important biological applications such as the identification of cell subpopulations and disease subtypes. In commonly used methods such as hierarchical clustering (HC) and consensus clustering (CC), holistic expression profiles of all genes are often used to assess the similarity between samples for clustering. While these methods have been proven successful in identifying sample clusters in many areas, they do not provide information about which gene sets (functions) contribute most to the clustering, thus limiting the interpretability of the resulting cluster. We hypothesize that integrating prior knowledge of annotated gene sets would not only achieve satisfactory clustering performance but also, more importantly, enable potential biological interpretation of clusters. Here we report ClusterMine, an approach that identifies clusters by assessing functional similarity between samples through integrating known annotated gene sets in functional annotation databases such as Gene Ontology. In addition to the cluster membership of each sample as provided by conventional approaches, it also outputs gene sets that most likely contribute to the clustering, thus facilitating biological interpretation. We compare ClusterMine with conventional approaches on nine real-world experimental datasets that represent different application scenarios in biology. We find that ClusterMine achieves better performances and that the gene sets prioritized by our method are biologically meaningful. ClusterMine is implemented as an R package and is freely available at: www.genemine.org/clustermine.php.

Journal ArticleDOI
TL;DR: A multi-level hierarchical classifier framework to automatically assign taxonomy labels to DNA sequences using an alignment-free approach called spectrum kernel method for feature extraction and it is shown that the proposed framework is more robust to mutations and noise in sequence data than the non-hierarchical classifiers.
Abstract: Accurately identifying organisms based on their partially available genetic material is an important task to explore the phylogenetic diversity in an environment. Specific fragments in the DNA sequence of a living organism have been defined as DNA barcodes and can be used as markers to identify species efficiently and effectively. The existing DNA barcode-based classification approaches suffer from three major issues: (i) most of them assume that the classification is done within a given taxonomic class and/or input sequences are pre-aligned, (ii) highly performing classifiers, such as SVM, cannot scale to large taxonomies due to high memory requirements, (iii) mutations and noise in input DNA sequences greatly reduce the taxonomic classification score. In order to address these issues, we propose a multi-level hierarchical classifier framework to automatically assign taxonomy labels to DNA sequences. We utilize an alignment-free approach called spectrum kernel method for feature extraction. We build a proof-of-concept hierarchical classifier with two levels, and evaluated it on real DNA sequence data from barcode of life data systems. We demonstrate that the proposed framework provides higher f1-score than regular classifiers. Besides, hierarchical framework scales better to large datasets enabling researchers to employ classifiers with high classification performance and high memory requirement on large datasets. Furthermore, we show that the proposed framework is more robust to mutations and noise in sequence data than the non-hierarchical classifiers.

Journal ArticleDOI
TL;DR: A new pipeline embedding DACs together with bona fide footprints resulting in the generation of a Predictive gene regulatory Network (PreNet) simply from ATAC-seq data is developed and it is demonstrated that PreNet can be used to unveil meaningful molecular regulatory pathways in a given cell type.
Abstract: Assays for transposase-accessible chromatin sequencing (ATAC-seq) provides an innovative approach to study chromatin status in multiple cell types. Moreover, it is also possible to efficiently extr...

Journal ArticleDOI
TL;DR: This work introduces a new type of cost function, which is related to the amount of fragmentation caused by a rearrangement, and presents some results about the lower and upper bounds for the fragmentation-weighted problems and the relation between the unweighted and the fragmentation -weighted approach.
Abstract: One of the main problems in Computational Biology is to find the evolutionary distance among species. In most approaches, such distance only involves rearrangements, which are mutations that alter ...

Journal ArticleDOI
TL;DR: A statistical analysis was carried out on a large dataset of experimentally characterized secondary structure elements to find over- or under-occurrences of specific amino acids defining the boundaries of helical moieties.
Abstract: The secondary and tertiary structure of a protein has a primary role in determining its function. Even though many folding prediction algorithms have been developed in the past decades - mainly based on the assumption that folding instructions are encoded within the protein sequence - experimental techniques remain the most reliable to establish protein structures. In this paper, we searched for signals related to the formation of [Formula: see text]-helices. We carried out a statistical analysis on a large dataset of experimentally characterized secondary structure elements to find over- or under-occurrences of specific amino acids defining the boundaries of helical moieties. To validate our hypothesis, we trained various Machine Learning models, each equipped with an attention mechanism, to predict the occurrence of [Formula: see text]-helices. The attention mechanism allows to interpret the model's decision, weighing the importance the predictor gives to each part of the input. The experimental results show that different models focus on the same subsequences, which can be seen as codes driving the secondary structure formation.

Journal ArticleDOI
TL;DR: This research presents a novel approach to TMB topology prediction with the use of a cascading classifier and results show that the proposed methodology predicts TMB proteins topologies with high accuracy for randomly selected proteins.
Abstract: Membrane proteins are a major focus for new drug discovery. Transmembrane beta-barrel (TMB) proteins play key roles in the translocation machinery, pore formation, membrane anchoring and ion exchange. Given their key roles and the difficulty in membrane protein structure determination, the use of computational modeling is essential. This paper focuses on the topology prediction of TMB proteins. In the field of bioinformatics, many years of research has been spent on the topology prediction of transmembrane alpha-helices. The efforts to TMB proteins topology prediction have been overshadowed and the prediction accuracy could be improved with further research. Various methodologies have been developed in the past for the prediction of TMB protein topology, however, the use of cascading classifier has never been fully explored. This research presents a novel approach to TMB topology prediction with the use of a cascading classifier. The MATLAB computer simulation results show that the proposed methodology predicts TMB proteins topologies with high accuracy for randomly selected proteins. By using the cascading classifier approach, the best overall accuracy is 76.3% with a precision of 0.831 and recall or probability of detection of 0.799 for TMB topology prediction. The accuracy of 76.3% is achieved using a two-layers cascading classifier.

Journal ArticleDOI
TL;DR: An end-to-end prediction model using deep neural networks with long short-term memory units and attention mechanism, to predict the ectodomain shedding events of membrane proteins only by sequence information is designed.
Abstract: Membrane proteins play essential roles in modern medicine. In recent studies, some membrane proteins involved in ectodomain shedding events have been reported as the potential drug targets and biomarkers of some serious diseases. However, there are few effective tools for identifying the shedding event of membrane proteins. So, it is necessary to design an effective tool for predicting shedding event of membrane proteins. In this study, we design an end-to-end prediction model using deep neural networks with long short-term memory (LSTM) units and attention mechanism, to predict the ectodomain shedding events of membrane proteins only by sequence information. Firstly, the evolutional profiles are encoded from original sequences of these proteins by Position-Specific Iterated BLAST (PSI-BLAST) on Uniref50 database. Then, the LSTM units which contain memory cells are used to hold information from past inputs to the network and the attention mechanism is applied to detect sorting signals in proteins regardless of their position in the sequence. Finally, a fully connected dense layer and a softmax layer are used to obtain the final prediction results. Additionally, we also try to reduce overfitting of the model by using dropout, L2 regularization, and bagging ensemble learning in the model training process. In order to ensure the fairness of performance comparison, firstly we use cross validation process on training dataset obtained from an existing paper. The average accuracy and area under a receiver operating characteristic curve (AUC) of five-fold cross-validation are 81.19% and 0.835 using our proposed model, compared to 75% and 0.78 by a previously published tool, respectively. To better validate the performance of the proposed model, we also evaluate the performance of the proposed model on independent test dataset. The accuracy, sensitivity, and specificity are 83.14%, 84.08%, and 81.63% using our proposed model, compared to 70.20%, 71.97%, and 67.35% by the existing model. The experimental results validate that the proposed model can be regarded as a general tool for predicting ectodomain shedding events of membrane proteins. The pipeline of the model and prediction results can be accessed at the following URL: http://www.csbg-jlu.info/DeepSMP/.

Journal ArticleDOI
Lei Tian1, Shu-Lin Wang1
TL;DR: A new method to explore potential miRNA sponge interactions (EPMSIs) for breast cancer by applying a clustering algorithm called BCPlaid, which demonstrates powerful classification performance in each module.
Abstract: MicroRNA (miRNA) sponges’ regulatory mechanisms play an important role in developing human cancer. Herein, we develop a new method to explore potential miRNA sponge interactions (EPMSIs) for breast...

Journal ArticleDOI
TL;DR: This paper systematically study the impact of low-confidence PPIs on the performance of complex detection methods using GO-based semantic similarity measures and finds that each complex detection algorithm significantly improves its performance after the filtration ofLow-similarity scored PPIs.
Abstract: Protein complexes are the cornerstones of most of the biological processes. Identifying protein complexes is crucial in understanding the principles of cellular organization with several important applications, including in disease diagnosis. Several computational techniques have been developed to identify protein complexes from protein-protein interaction (PPI) data (equivalently, from PPI networks). These PPI data have a significant amount of false positives, which is a bottleneck in identifying protein complexes correctly. Gene ontology (GO)-based semantic similarity measures can be used to assign a confidence score to PPIs. Consequently, low-confidence PPIs are highly likely to be false positives. In this paper, we systematically study the impact of low-confidence PPIs on the performance of complex detection methods using GO-based semantic similarity measures. We consider five state-of-the-art complex detection algorithms and nine GO-based similarity measures in the evaluation. We find that each complex detection algorithm significantly improves its performance after the filtration of low-similarity scored PPIs. It is also observed that the percentage improvement and the filtration percentage (of low-confidence PPIs) are highly correlated.

Journal ArticleDOI
TL;DR: This work proposes a multiplex network-based framework that incorporates multiple protein interaction data from their physical, coexpression and phylogenetic profiles and outperformed many recent essential protein prediction techniques in the literature.
Abstract: Cell survival requires the presence of essential proteins. Detection of essential proteins is relevant not only because of the critical biological functions they perform but also the role played by them as a drug target against pathogens. Several computational techniques are in place to identify essential proteins based on protein-protein interaction (PPI) network. Essential protein detection using only physical interaction data of proteins is challenging due to its inherent uncertainty. Hence, in this work, we propose a multiplex network-based framework that incorporates multiple protein interaction data from their physical, coexpression and phylogenetic profiles. An extended version termed as multiplex eigenvector centrality (MEC) is used to identify essential proteins from this network. The methodology integrates the score obtained from the multiplex analysis with subcellular localization and Gene Ontology information and is implemented using Saccharomyces cerevisiae datasets. The proposed method outperformed many recent essential protein prediction techniques in the literature.