scispace - formally typeset
Search or ask a question

Showing papers in "Journal of Bioinformatics and Computational Biology in 2018"


Journal ArticleDOI
TL;DR: A hybrid deep learning framework, two-dimensional convolutional bidirectional recurrent neural network (2C-BRNN), for improving the accuracy of 8-class secondary structure prediction and demonstrating that the proposed models can extract more meaningful features from the matrix of proteins, and the feature vector dimension is also useful for PSSP.
Abstract: Protein secondary structure prediction (PSSP) is an important research field in bioinformatics. The representation of protein sequence features could be treated as a matrix, which includes the amino-acid residue (time-step) dimension and the feature vector dimension. Common approaches to predict secondary structures only focus on the amino-acid residue dimension. However, the feature vector dimension may also contain useful information for PSSP. To integrate the information on both dimensions of the matrix, we propose a hybrid deep learning framework, two-dimensional convolutional bidirectional recurrent neural network (2C-BRNN), for improving the accuracy of 8-class secondary structure prediction. The proposed hybrid framework is to extract the discriminative local interactions between amino-acid residues by two-dimensional convolutional neural networks (2DCNNs), and then further capture long-range interactions between amino-acid residues by bidirectional gated recurrent units (BGRUs) or bidirectional long short-term memory (BLSTM). Specifically, our proposed 2C-BRNNs framework consists of four models: 2DConv-BGRUs, 2DCNN-BGRUs, 2DConv-BLSTM and 2DCNN-BLSTM. Among these four models, the 2DConv- models only contain two-dimensional (2D) convolution operations. Moreover, the 2DCNN- models contain 2D convolutional and pooling operations. Experiments are conducted on four public datasets. The experimental results show that our proposed 2DConv-BLSTM model performs significantly better than the benchmark models. Furthermore, the experiments also demonstrate that the proposed models can extract more meaningful features from the matrix of proteins, and the feature vector dimension is also useful for PSSP. The codes and datasets of our proposed methods are available at https://github.com/guoyanb/JBCB2018/ .

45 citations


Journal ArticleDOI
TL;DR: The results suggest that the computational approach used in this study is highly efficient for prediction of antifungal peptides, which can save time and money in AFP screening and synthesis of novel peptides.
Abstract: With the increase in immunocompromised patients in the recent years, fungal infections have emerged as new and serious threat in hospitals. This, and the insufficiency of current antifungal therapies alongside their toxic effects on patients, has led to the increased interest in seeking new antifungal peptides. In the present study, we have developed a prediction method for screening of antifungal peptides. For this, we have chosen Chou's pseudo amino acid composition (PseAAC) to translate peptide sequences into numeric values. Thus, the SVM classifier was performed for binomial classification of antifungal peptides. The performance of the classifier was evaluated via ten-fold cross-validation and an independent dataset. For further validation of the model developed, 22 P24-derived peptides were predicted using the classifier and in vitro assays were performed on the three peptides with the highest prediction score. The results showed that the PseAAC + SVM method is able to predict AFPs with ACC of 94.76%. In vitro results also validate the SEN and SPC of the classifier. The results suggest that the computational approach used in this study is highly efficient for prediction of antifungal peptides, which can save time and money in AFP screening and synthesis of novel peptides.

31 citations


Journal ArticleDOI
TL;DR: This paper proposes a TL approach for cancer drug sensitivity prediction, where the approach combines three techniques and evaluates the performance of the approach against baseline approaches using the Area Under the receiver operating characteristic (ROC) Curve (AUC) Curve on real clinical trial datasets pertaining to multiple myeloma, nonsmall cell lung cancer, triple-negative breast cancer, and breast cancer.
Abstract: Transfer learning (TL) algorithms aim to improve the prediction performance in a target task (e.g. the prediction of cisplatin sensitivity in triple-negative breast cancer patients) via transferring knowledge from auxiliary data of a related task (e.g. the prediction of docetaxel sensitivity in breast cancer patients), where the distribution and even the feature space of the data pertaining to the tasks can be different. In real-world applications, we sometimes have a limited training set in a target task while we have auxiliary data from a related task. To obtain a better prediction performance in the target task, supervised learning requires a sufficiently large training set in the target task to perform well in predicting future test examples of the target task. In this paper, we propose a TL approach for cancer drug sensitivity prediction, where our approach combines three techniques. First, we shift the representation of a subset of examples from auxiliary data of a related task to a representation closer to a target training set of a target task. Second, we align the shifted representation of the selected examples of the auxiliary data to the target training set to obtain examples with representation aligned to the target training set. Third, we train machine learning algorithms using both the target training set and the aligned examples. We evaluate the performance of our approach against baseline approaches using the Area Under the receiver operating characteristic (ROC) Curve (AUC) on real clinical trial datasets pertaining to multiple myeloma, nonsmall cell lung cancer, triple-negative breast cancer, and breast cancer. Experimental results show that our approach is better than the baseline approaches in terms of performance and statistical significance.

30 citations


Journal ArticleDOI
TL;DR: This paper proposes a novel algorithm for RNA secondary structure prediction that integrates the thermodynamic approach and the machine learning-based weighted approach and achieves the best prediction accuracy compared with existing methods, and heavy overfitting cannot be observed.
Abstract: A popular approach for predicting RNA secondary structure is the thermodynamic nearest-neighbor model that finds a thermodynamically most stable secondary structure with minimum free energy (MFE). For further improvement, an alternative approach that is based on machine learning techniques has been developed. The machine learning-based approach can employ a fine-grained model that includes much richer feature representations with the ability to fit the training data. Although a machine learning-based fine-grained model achieved extremely high performance in prediction accuracy, a possibility of the risk of overfitting for such a model has been reported. In this paper, we propose a novel algorithm for RNA secondary structure prediction that integrates the thermodynamic approach and the machine learning-based weighted approach. Our fine-grained model combines the experimentally determined thermodynamic parameters with a large number of scoring parameters for detailed contexts of features that are trained by the structured support vector machine (SSVM) with the l 1 regularization to avoid overfitting. Our benchmark shows that our algorithm achieves the best prediction accuracy compared with existing methods, and heavy overfitting cannot be observed. The implementation of our algorithm is available at https://github.com/keio-bioinformatics/mxfold .

30 citations


Journal ArticleDOI
TL;DR: Fayyaz et al. as mentioned in this paper developed a machine learning model to predict inter-species protein-protein interactions (PPIs) with a special interest in host-pathogen proteins.
Abstract: Detection of protein-protein interactions (PPIs) plays a vital role in molecular biology. Particularly, pathogenic infections are caused by interactions of host and pathogen proteins. It is important to identify host-pathogen interactions (HPIs) to discover new drugs to counter infectious diseases. Conventional wet lab PPI detection techniques have limitations in terms of cost and large-scale application. Hence, computational approaches are developed to predict PPIs. This study aims to develop machine learning models to predict inter-species PPIs with a special interest in HPIs. Specifically, we focus on seeking answers to three questions that arise while developing an HPI predictor: (1) How should negative training examples be selected? (2) Does assigning sample weights to individual negative examples based on their similarity to positive examples improve generalization performance? and, (3) What should be the size of negative samples as compared to the positive samples during training and evaluation? We compare two available methods for negative sampling: random versus DeNovo sampling and our experiments show that DeNovo sampling offers better accuracy. However, our experiments also show that generalization performance can be improved further by using a soft DeNovo approach that assigns sample weights to negative examples inversely proportional to their similarity to known positive examples during training. Based on our findings, we have also developed an HPI predictor called HOPITOR (Host-Pathogen Interaction Predictor) that can predict interactions between human and viral proteins. The HOPITOR web server can be accessed at the URL: http://faculty.pieas.edu.pk/fayyaz/software.html#HoPItor .

24 citations


Journal ArticleDOI
TL;DR: The proposed approach obviates the need to understand the underlying drug mechanism to predict drug combination synergy and results show that the Random forest models, in comparison to other models, have shown significant performance.
Abstract: Combination drug therapy is considered a better treatment option for various diseases, such as cancer, HIV, hypertension, and infections as compared to targeted drug therapies. Combination or synergism helps to overcome drug resistance, reduction in drug toxicity and dosage. Considering the complexity and heterogeneity among cancer types, drug combination provides promising treatment strategy. Increase in drug combination data raises a challenge for developing a computational approach that can effectively predict drugs synergism. There is a need to model the combination drug screening data to predict new synergistic drug combinations for successful cancer treatment. In such a scenario, machine learning approaches can be used to alleviate the process of drugs synergy prediction. Experimental data from a single-agent or multi-agent drug screens provides feature data for model training. On the contrary, identification of effective drug combination using clinical trials is a time consuming and resource intensive task. This paper attempts to address the aforementioned challenges by developing a computational approach to effectively predict drug synergy. Single-drug efficacy is used for predicting drug synergism. Our approach obviates the need to understand the underlying drug mechanism to predict drug combination synergy. For this purpose, nine machine learning algorithms are trained. It is observed that the Random forest models, in comparison to other models, have shown significant performance. The K -fold cross-validation is performed to evaluate the robustness of the best predictive model. The proposed approach is applied to mutant-BRAF melanoma and further validated using melanoma cell-lines from AstraZeneca-Sanger Drug Combination Prediction DREAM Challenge dataset.

16 citations


Journal ArticleDOI
TL;DR: A methodology, Multi Label Protein Function Prediction (ML_PFP) is proposed which is based on Neighborhood analysis empowered with physico-chemical features of constituent amino acids to predict the functional group of unannotated protein.
Abstract: Protein Function Prediction from Protein–Protein Interaction Network (PPIN) and physico-chemical features using the Gene Ontology (GO) classification are indeed very useful for assigning biological

16 citations


Journal ArticleDOI
TL;DR: In silico search algorithm designed to discover toxin-like proteins containing AMPs was developed based on the evaluation of the properties and structural peculiarities of amino acid sequences and three peptides exhibited antimicrobial activity against bacterial strains, suggesting that the method can be applied to reveal new AMPs in the venoms of other predators as well.
Abstract: As essential conservative component of the innate immune systems of living organisms, antimicrobial peptides (AMPs) could complement pharmaceuticals that increasingly fail to combat various pathogens exhibiting increased resistance to microbial antibiotics. Among the properties of AMPs that suggest their potential as therapeutic agents, diverse peptides in the venoms of various predators demonstrate antimicrobial activity and kill a wide range of microorganisms. To identify potent AMPs, the study reported here involved a transcriptomic profiling of the tentacle secretion of the sea anemone Cnidopus japonicus. An in silico search algorithm designed to discover toxin-like proteins containing AMPs was developed based on the evaluation of the properties and structural peculiarities of amino acid sequences. The algorithm revealed new proteins of the anemone containing antimicrobial candidate sequences, and 10 AMPs verified using high-throughput proteomics were synthesized. The antimicrobial activity of the candidate molecules was experimentally estimated against Gram-positive and -negative bacteria. Ultimately, three peptides exhibited antimicrobial activity against bacterial strains, which suggests that the method can be applied to reveal new AMPs in the venoms of other predators as well.

13 citations


Journal ArticleDOI
TL;DR: Validation based on the literature showed that a considerable number of top-ranked genes from the integrated dataset have known relationships with cancer, implying that novel candidate biomarkers can also be acquired from the proposed analysis method.
Abstract: Currently, cancer biomarker discovery is one of the important research topics worldwide. In particular, detecting significant genes related to cancer is an important task for early diagnosis and treatment of cancer. Conventional studies mostly focus on genes that are differentially expressed in different states of cancer; however, noise in gene expression datasets and insufficient information in limited datasets impede precise analysis of novel candidate biomarkers. In this study, we propose an integrative analysis of gene expression and DNA methylation using normalization and unsupervised feature extractions to identify candidate biomarkers of cancer using renal cell carcinoma RNA-seq datasets. Gene expression and DNA methylation datasets are normalized by Box-Cox transformation and integrated into a one-dimensional dataset that retains the major characteristics of the original datasets by unsupervised feature extraction methods, and differentially expressed genes are selected from the integrated dataset. Use of the integrated dataset demonstrated improved performance as compared with conventional approaches that utilize gene expression or DNA methylation datasets alone. Validation based on the literature showed that a considerable number of top-ranked genes from the integrated dataset have known relationships with cancer, implying that novel candidate biomarkers can also be acquired from the proposed analysis method. Furthermore, we expect that the proposed method can be expanded for applications involving various types of multi-omics datasets.

12 citations


Journal ArticleDOI
TL;DR: The methods of network pharmacology were used to select 83 potential pharmacological targets linked to the selected genes and the most promising targets were chosen based on analysis of published data, giving additional evidence that the approach applied is rather promising.
Abstract: Epilepsy is the fourth most common neurological disease after migraine, stroke, and Alzheimer’s disease. Approximately one-third of all epilepsy cases are refractory to the existing anticonvulsants. Thus, there is an unmet need for newer antiepileptic drugs (AEDs) to manage refractory epilepsy (RE). Discovery of novel AEDs for the treatment of RE further retards for want of potential pharmacological targets, unavailable due to unclear etiology of this disease. In this regard, network pharmacology as an area of bioinformatics is gaining popularity. It combines the methods of network biology and polypharmacology, which makes it a promising approach for finding new molecular targets. This work is aimed at discovering new pharmacological targets for the treatment of RE using network pharmacology methods. In the framework of our study, the genes associated with the development of RE were selected based on analysis of available data. The methods of network pharmacology were used to select 83 potential pharmacol...

12 citations


Journal ArticleDOI
TL;DR: This study has constructed computational models for host tropism prediction on human-adapted subtypes of influenza HA proteins using random forest and indicated that secondary structure and normalized Van der Waals volume were identified as more important physicochemical signatures in determining the host Tropism.
Abstract: Avian influenza viruses from migratory birds have managed to cross host species barriers and infected various hosts like human and swine. Epidemics and pandemics might occur when influenza viruses are adapted to humans, causing deaths and enormous economic loss. Receptor-binding specificity of the virus is one of the key factors for the transmission of influenza viruses across species. The determination of host tropism and understanding of molecular properties would help identify the mechanism why zoonotic influenza viruses can cross species barrier and infect humans. In this study, we have constructed computational models for host tropism prediction on human-adapted subtypes of influenza HA proteins using random forest. The feature vectors of the prediction models were generated based on seven physicochemical properties of amino acids from influenza sequences of three major hosts. Feature aggregation and associative rules were further applied to select top 20 features and extract host-associated physicochemical signatures on the combined model of nonspecific subtypes. The prediction model achieved high performance (Accuracy = 0.948 , Precision = 0.954 , MCC = 0.922 ). Support and confidence rates were calculated for the host class-association rules. The results indicated that secondary structure and normalized Van der Waals volume were identified as more important physicochemical signatures in determining the host tropism.

Journal ArticleDOI
TL;DR: This work proposes to build task-specific SFs that model binding affinities as well as conformations using the root mean square deviation of a ligand pose from the native pose, and finds that ensemble models based on NNs surpass SFs based on other state-of-the-art ML algorithms.
Abstract: Predicting the native poses of ligands correctly is one of the most important steps towards successful structure-based drug design. Binding affinities (BAs) estimated by traditional scoring functions (SFs) are typically used to score and rank-order poses to select the most promising conformation. This BA-based approach is widely applied and some success has been reported, but it is inconsistent and still far from perfect. The main reason for this is that SFs are trained on experimental BA values of only native poses found in co-crystallized structures of protein-ligand complexes (PLCs). However, during docking, they are needed to discriminate between native and decoy poses, a task for which they have not been specifically designed. To overcome this limitation, we propose to build task-specific SFs that model binding affinities (scoring task) as well as conformations (docking task) using the root mean square deviation (RMSD) of a ligand pose from the native pose. Our models are based on boosted ensembles of neural networks and other state-of-the-art machine learning (ML) algorithms in conjunction with multi-perspective interaction modeling techniques for accurate characterization of PLCs. We assess the docking and scoring/ranking accuracies of the proposed ML SFs as well as three conventional SFs in the context of the 2014 CSAR benchmark exercise that encompasses three high-quality protein systems and a diverse set of drug-like molecules. Our proposed docking-specific SFs provide a substantial improvement in the docking task. We find that RMSD-based SFs for BsN, an ensemble neural networks (NN) model based on boosting, and six other ML models provide more than 120% improvement, on average, over their BA-based counterparts. In terms of scoring/ranking accuracy, we find that the approach of using RMSD-based BsN to select the top ligand pose followed by applying BA-based BsN to rank ligands using predicted BA scores leads to consistent and correctly ranked ligands for the two protein targets Spleen Tyrosine Kinase (SYK) and tRNA (m1G37) methyltransferase (TrmD). In addition, the ensemble NN SF BsN is at least 250% more accurate than a single neural network (SNN) model. We further find that ensemble models based on NNs surpass SFs based on other state-of-the-art ML algorithms such as BRT, RF, SVM, and [Formula: see text]NN. Finally, our RF model fitted to PLCs characterized by multiple sets of descriptors from four different sources (X-Score, AffiScore, RF-Score, and GOLD) substantially outperforms the SF RF-Score that uses only one set of features, underlining the value of multi-perspective modeling.

Journal ArticleDOI
TL;DR: A new statistical approach for gene-based GGI analysis, "Hierarchical structural CoMponent analysis of Gene-Gene Interactions" (HisCoM-GGI), which is based on generalized structured component analysis, and can consider hierarchical structural relationships between genes and SNPs.
Abstract: Although genome-wide association studies (GWAS) have successfully identified thousands of single nucleotide polymorphisms (SNPs) associated with common diseases, these observations are limited for fully explaining "missing heritability". Determining gene-gene interactions (GGI) are one possible avenue for addressing the missing heritability problem. While many statistical approaches have been proposed to detect GGI, most of these focus primarily on SNP-to-SNP interactions. While there are many advantages of gene-based GGI analyses, such as reducing the burden of multiple-testing correction, and increasing power by aggregating multiple causal signals across SNPs in specific genes, only a few methods are available. In this study, we proposed a new statistical approach for gene-based GGI analysis, "Hierarchical structural CoMponent analysis of Gene-Gene Interactions" (HisCoM-GGI). HisCoM-GGI is based on generalized structured component analysis, and can consider hierarchical structural relationships between genes and SNPs. For a pair of genes, HisCoM-GGI first effectively summarizes all possible pairwise SNP-SNP interactions into a latent variable, from which it then performs GGI analysis. HisCoM-GGI can evaluate both gene-level and SNP-level interactions. Through simulation studies, HisCoM-GGI demonstrated higher statistical power than existing gene-based GGI methods, in analyzing a GWAS of a Korean population for identifying GGI associated with body mass index. Resultantly, HisCoM-GGI successfully identified 14 potential GGI, two of which, (NCOR2 × SPOCK1) and (LINGO2 × ZNF385D) were successfully replicated in independent datasets. We conclude that HisCoM-GGI method may be a valuable tool for genome to identify GGI in missing heritability, allowing us to better understand the biological genetic mechanisms of complex traits. We conclude that HisCoM-GGI method may be a valuable tool for genome to identify GGI in missing heritability, allowing us to better understand biological genetic mechanisms of complex traits. An implementation of HisCoM-GGI can be downloaded from the website ( http://statgen.snu.ac.kr/software/hiscom-ggi ).

Journal ArticleDOI
TL;DR: A new computational pipeline ASSA is described that combines sequence alignment and thermodynamics-based tools for efficient prediction of RNA-RNA interactions between long transcripts and emphasized a unique property of the [Formula: see text] repeats with respect to the RNA- RNA interactions in the human transcriptome.
Abstract: The discovery of thousands of long noncoding RNAs (lncRNAs) in mammals raises a question about their functionality. It has been shown that some of them are involved in post-transcriptional regulation of other RNAs and form inter-molecular duplexes with their targets. Sequence alignment tools have been used for transcriptome-wide prediction of RNA–RNA interactions. However, such approaches have poor prediction accuracy since they ignore RNA’s secondary structure. Application of the thermodynamics-based algorithms to long transcripts is not computationally feasible on a large scale. Here, we describe a new computational pipeline ASSA that combines sequence alignment and thermodynamics-based tools for efficient prediction of RNA–RNA interactions between long transcripts. To measure the hybridization strength, the sum energy of all the putative duplexes is computed. The main novelty implemented in ASSA is the ability to quickly estimate the statistical significance of the observed interaction energies. Most o...

Journal ArticleDOI
TL;DR: Results show that the proposed system is competitive against other systems for the task of extracting DDIs, and that significant improvements can be achieved by learning from word features and using a deep-learning approach.
Abstract: Information on changes in a drug's effect when taken in combination with a second drug, known as drug-drug interaction (DDI), is relevant in the pharmaceutical industry. DDIs can delay, decrease, or enhance absorption of either drug and thus decrease or increase their action or cause adverse effects. Information Extraction (IE) can be of great benefit in allowing identification and extraction of relevant information on DDIs. We here propose an approach for the extraction of DDI from text using neural word embedding to train a machine learning system. Results show that our system is competitive against other systems for the task of extracting DDIs, and that significant improvements can be achieved by learning from word features and using a deep-learning approach. Our study demonstrates that machine learning techniques such as neural networks and deep learning methods can efficiently aid in IE from text. Our proposed approach is well suited to play a significant role in future research.

Journal ArticleDOI
TL;DR: Numerical results on BAliBASE benchmark have shown the effectiveness of the proposed PSOSA method and its ability to achieve good quality solutions when compared with those given by other existing methods.
Abstract: In this work, a novel hybrid model called PSOSA for solving multiple sequence alignment (MSA) problem is proposed. The developed approach is a combination between particle swarm optimization (PSO) ...

Journal ArticleDOI
TL;DR: A systematic approach is presented that allows us to select less correlated properties for classification by means of both correlation and cophenetic coefficients as well as concordance matrices and demonstrates that the classifier can serve as a reliable tool enabling promoter DNA fragments to be distinguished from promoter islands despite the similarity of their nucleotide sequences.
Abstract: Predicting promoter activity of DNA fragment is an important task for computational biology. Approaches using physical properties of DNA to predict bacterial promoters have recently gained a lot of attention. To select an adequate set of physical properties for training a classifier, various characteristics of DNA molecule should be taken into consideration. Here, we present a systematic approach that allows us to select less correlated properties for classification by means of both correlation and cophenetic coefficients as well as concordance matrices. To prove this concept, we have developed the first classifier that uses not only sequence and static physical properties of DNA fragment, but also dynamic properties of DNA open states. Therefore, the best performing models with accuracy values up to 90% for all types of sequences were obtained. Furthermore, we have demonstrated that the classifier can serve as a reliable tool enabling promoter DNA fragments to be distinguished from promoter islands despite the similarity of their nucleotide sequences.

Journal ArticleDOI
TL;DR: A simple extension of the standard HMM in which the current observed symbol (amino acid residue) depends both on the current state and on a series of observed previous symbols, using an extended alphabet.
Abstract: Hidden Markov Models (HMMs) are probabilistic models widely used in computational molecular biology. However, the Markovian assumption regarding transition probabilities which dictates that the observed symbol depends only on the current state may not be sufficient for some biological problems. In order to overcome the limitations of the first order HMM, a number of extensions have been proposed in the literature to incorporate past information in HMMs conditioning either on the hidden states, or on the observations, or both. Here, we implement a simple extension of the standard HMM in which the current observed symbol (amino acid residue) depends both on the current state and on a series of observed previous symbols. The major advantage of the method is the simplicity in the implementation, which is achieved by properly transforming the observation sequence, using an extended alphabet. Thus, it can utilize all the available algorithms for the training and decoding of HMMs. We investigated the use of several encoding schemes and performed tests in a number of important biological problems previously studied by our team (prediction of transmembrane proteins and prediction of signal peptides). The evaluation shows that, when enough data are available, the performance increased by 1.8%-8.2% and the existing prediction methods may improve using this approach. The methods, for which the improvement was significant (PRED-TMBB2, PRED-TAT and HMM-TM), are available as web-servers freely accessible to academic users at www.compgen.org/tools/ .

Journal ArticleDOI
TL;DR: Functional comparison of the core genes of the two genera revealed a significant difference in the categories "amino acid transport and metabolism" representing their difference in niche specificity, which validate the bias-resilient definition of thecore genome.
Abstract: The commensal genus Bifidobacterium has probiotic properties. We prepared a public library of the gene functions of the genus Bifidobacterium for its online annotation. Orthologous gene cluster analysis showed that the pan genomes of Bifidobacterium and Lactobacillus exhibit striking similarities when mapped to the Clusters of Orthologous Group (COG) database of proteins. When the core genes in each genus were selected based on our statistical definition of "core genome", core genes were present in at least 92% of 52 Bifidobacterium and in 97% of 178 Lactobacillus genomes. Functional comparison of the core genes of the two genera revealed a significant difference in the categories "amino acid transport and metabolism" representing their difference in niche specificity. Over-represented Bifidobacterium protein families were primarily involved in host interactions, the complex compound metabolism, and in stress responses. These findings coincide with the published information and validate our bias-resilient definition of the core genome.

Journal ArticleDOI
TL;DR: An efficient and scalable network Motif Discovery algorithm based on Expansion Tree (MODET) is proposed, which outperforms most of the existing network motif discovery algorithms.
Abstract: Networks are powerful representation of topological features in biological systems like protein interaction and gene regulation. In order to understand the design principles of such complex networks, the concept of network motifs emerged. Network motifs are recurrent patterns with statistical significance that can be seen as basic building blocks of complex networks. Identification of network motifs leads to many important applications, such as understanding the modularity and the large-scale structure of biological networks, classification of networks into super-families, protein function annotation, etc. However, identification of network motifs is challenging as it involves graph isomorphism which is computationally hard. Though this problem has been studied extensively in the literature using different computational approaches, we are far from satisfactory results. Motivated by the challenges involved in this field, an efficient and scalable network Motif Discovery algorithm based on Expansion Tree (MODET) is proposed. Pattern growth approach is used in this proposed motif-centric algorithm. Each node of the expansion tree represents a non-isomorphic pattern. The embeddings corresponding to a child node of the expansion tree are obtained from the embeddings of the parent node through vertex addition and edge addition. Further, the proposed algorithm does not involve any graph isomorphism check and the time complexities of these processes are O ( n ) and O ( 1 ) , respectively. The proposed algorithm has been tested on Protein-Protein Interaction (PPI) network obtained from the MINT database. The computational efficiency of the proposed algorithm outperforms most of the existing network motif discovery algorithms.

Journal ArticleDOI
TL;DR: A new variable selection strategy based on a selection probability that measures selection frequency of individual variables selected by both SGL and network-based regularization is proposed and applied to identify differentially methylated CpG sites and their corresponding genes from ovarian cancer data.
Abstract: In genetic association studies, regularization methods are often used due to their computational efficiency for analysis of high-dimensional genomic data. DNA methylation data generated from Infinium HumanMethylation450 BeadChip Kit have a group structure where an individual gene consists of multiple Cytosine-phosphate-Guanine (CpG) sites. Consequently, group-based regularization can precisely detect outcome-related CpG sites. Representative examples are sparse group lasso (SGL) and network-based regularization. The former is powerful when most of the CpG sites within the same gene are associated with a phenotype outcome. In contrast, the latter is preferred when only a few of the CpG sites within the same gene are related to the outcome. In this paper, we propose new variable selection strategy based on a selection probability that measures selection frequency of individual variables selected by both SGL and network-based regularization. In extensive simulation study, we demonstrated that the proposed strategy can show relatively outstanding selection performance under any situation, compared with both SGL and network-based regularization. Also, we applied the proposed strategy to identify differentially methylated CpG sites and their corresponding genes from ovarian cancer data.

Journal ArticleDOI
TL;DR: Consistent with all previous studies, proteins encoded by multifunctional genes, based on the proposed method, are involved in protein-protein interactions significantly more ([Formula: see text]) than other proteins.
Abstract: Multifunctional genes are important genes because of their essential roles in human cells Studying and analyzing multifunctional genes can help understand disease mechanisms and drug discovery We propose a computational method for scoring gene multifunctionality based on functional annotations of the target gene from the Gene Ontology The method is based on identifying pairs of GO annotations that represent semantically different biological functions and any gene annotated with two annotations from one pair is considered multifunctional The proposed method can be employed to identify multifunctional genes in the entire human genome using solely the GO annotations We evaluated the proposed method in scoring multifunctionality of all human genes using four criteria: gene-disease associations; protein-protein interactions; gene studies with PubMed publications; and published known multifunctional gene sets The evaluation results confirm the validity and reliability of the proposed method for identifying multifunctional human genes The results across all four evaluation criteria were statistically significant in determining multifunctionality For example, the method confirmed that multifunctional genes tend to be associated with diseases more than other genes, with significance [Formula: see text] Moreover, consistent with all previous studies, proteins encoded by multifunctional genes, based on our method, are involved in protein-protein interactions significantly more ([Formula: see text]) than other proteins

Journal ArticleDOI
TL;DR: Cross-validation experiments on two difficult benchmarks demonstrate that the dimension of the input space can be reduced substantially while maintaining the prediction accuracy, enabling the incorporation of additional informative features derived for predicting the structural properties of proteins without reducing the accuracy due to overfitting.
Abstract: Secondary structure and solvent accessibility prediction provide valuable information for estimating the three dimensional structure of a protein. As new feature extraction methods are developed th...

Journal ArticleDOI
TL;DR: This special issue presents materials from the 9th International Young Scientists School on Systems Biology and Bioinformatics (SBB' 2017), organized in June 2017 in Yalta, Russia, and the works of Belyaev Conference-2017 and SBB'2017 School on computational biology were recently covered in special issues of several international journals.
Abstract: This special issue presents materials from the 9th International Young Scientists School on Systems Biology and Bioinformatics (SBB'2017), organized in June 2017 in Yalta, Russia. The Institute of Cytology and Genetics of the Siberian Branch of the Russian Academy of Sciences (ICG SB RAS) hosts the International Multi-conference on Bioinformatics of Genome Regulation and StructurenSystems Biology (BGRSnSB) every two years beginning from 1998. From BGRSnSB'2008 onwards, the Young Scientists School on Systems Biology and Bioinformatics (SBB) runs as a satellite event following the BGRSnSB conference or as a standalone annual event (http:// conf.bionet.nsc.ru/sbb2017/en/archive/). Since the ̄rst meeting, the SBB has grown to a large international event. Gradually, the initial focus has been extended from systems biology and classical bioinformatics topics to gene network analysis and reconstruction, and omics technologies. The Journal of Bioinformatics and Computational Biology (JBCB) publishes special issues on bioinformatics, algorithms, network analysis dedicated to BGRSnSB. The ̄rst JBCB special issue in 2006 highlighted BGRSnSB-2006.2 Then JBCB published special issues on the 2012, 2014, and 2016 conferences. Additionally, the journal publishes reports from the SBB schools. For instance, JBCB has published proceedings of SBB-2015 on modeling of gene network based on material presented at earlier BGRSnSB meetings. To continue traditions of the BGRS conference series in 2017, the Institute of Cytology and Genetics SB RAS organized the Belyaev Conference-2017 on genetics and evolution, dedicated to the 100th anniversary of Academician, Professor Dmitry K. Belyaev (1917–1985), an outstanding scientist, evolutionist and geneticist. The works of Belyaev Conference-2017 and SBB'2017 School on computational biology were recently covered in special issues of several international journals: the BMC Evolutionary Biology, BMC Genetics, BMC Plant Biology, BMC Genomics, BMC Neuroscience and Vavilov Journal of Selection and Breeding (http://vavilov.elpub. ru/jour/issue/view/32/showToc). Journal of Bioinformatics and Computational Biology Vol. 16, No. 1 (2018) 1802001 (5 pages) #.c World Scienti ̄c Publishing Europe Ltd. DOI: 10.1142/S0219720018020018

Journal ArticleDOI
TL;DR: A new method called GI-Cluster is developed, which provides an effective way to integrate multiple GI-related features via consensus clustering and is widely applicable, either to complete and incomplete genomes or to initial GI predictions from other programs.
Abstract: The accurate detection of genomic islands (GIs) in microbial genomes is important for both evolutionary study and medical research, because GIs may promote genome evolution and contain genes involved in pathogenesis Various computational methods have been developed to predict GIs over the years However, most of them cannot make full use of GI-associated features to achieve desirable performance Additionally, many methods cannot be directly applied to newly sequenced genomes We develop a new method called GI-Cluster, which provides an effective way to integrate multiple GI-related features via consensus clustering GI-Cluster does not require training datasets or existing genome annotations, but it can still achieve comparable or better performance than supervised learning methods in comprehensive evaluations Moreover, GI-Cluster is widely applicable, either to complete and incomplete genomes or to initial GI predictions from other programs GI-Cluster also provides plots to visualize the distribution of predicted GIs and related features GI-Cluster is available at https://githubcom/icelu/GI_Cluster

Journal ArticleDOI
TL;DR: This analysis shows that introgression of alleles from Neanderthals and Denisovans to Papuans occurred independently and retention of these alleles may carry specific adaptive advantages.
Abstract: Sequencing of complete nuclear genomes of Neanderthal and Denisovan stimulated studies about their relationship with modern humans demonstrating, in particular, that DNA alleles from both Neanderth...

Journal ArticleDOI
TL;DR: Eight hits-able to mimic pharmacophore properties of bNAb 10E8 by specific and effective interactions with the MPER region of the HIV-1 protein gp41 were selected as the most probable 10E 8-mimetic candidates.
Abstract: An integrated computational approach to in silico drug design was used to identify novel HIV-1 fusion inhibitor scaffolds mimicking broadly neutralizing antibody (bNab) 10E8 targeting the membrane proximal external region (MPER) of the HIV-1 gp41 protein. This computer-based approach included (i) generation of pharmacophore models representing 3D-arrangements of chemical functionalities that make bNAb 10E8 active towards the gp41 MPER segment, (ii) shape and pharmacophore-based identification of the 10E8-mimetic candidates by a web-oriented virtual screening platform pepMMsMIMIC, (iii) high-throughput docking of the identified compounds with the gp41 MPER peptide, and (iv) molecular dynamics simulations of the docked structures followed by binding free energy calculations. As a result, eight hits-able to mimic pharmacophore properties of bNAb 10E8 by specific and effective interactions with the MPER region of the HIV-1 protein gp41 were selected as the most probable 10E8-mimetic candidates. Similar to 10E8, the predicted compounds target the critically important residues of a highly conserved hinge region of the MPER peptide that provides a conformational flexibility necessary for its functioning in cell-virus membrane fusion process. In light of the data obtained, the identified small molecules may present promising HIV-1 fusion inhibitor scaffolds for the design of novel potent antiviral drugs.

Journal ArticleDOI
TL;DR: This paper revisits the genomic scaffold filling problem by considering this important case when a scaffold is given, and presents a simple NP-completeness proof, and a factor-2 approximation algorithm.
Abstract: The genomic scaffold filling problem has attracted a lot of attention recently. The problem is on filling an incomplete sequence (scaffold) I into I′, with respect to a complete reference genome G,...

Journal ArticleDOI
TL;DR: This study proposes a novel ensemble computational framework, termed ProBAPred (Protein-protein Binding Affinity Predictor), for quantitative estimation of protein-protein binding affinity, and develops developed regression models that can facilitate computational characterization and experimental studies ofprotein- protein binding affinity.
Abstract: Protein-protein binding interaction is the most prevalent biological activity that mediates a great variety of biological processes. The increasing availability of experimental data of protein–prot...

Journal ArticleDOI
TL;DR: Experimental results show that the method provides similar or better accuracy than other algorithms reported over a wider dynamic range, and takes fluorescence compensation in duplex assays into account.
Abstract: In the digital polymerase chain reaction (dPCR) detection process, discriminating positive droplets from negative ones directly affects the final concentration and is one of the most important factors affecting accuracy. Current automated classification methods usually discuss single-channel detections, whereas duplex detection experiments are less discussed. In this paper, we designed a classification method by estimating the upper limit of the negative droplets. The right tail of the negative droplets is approximated using a generalized Pareto distribution. Furthermore, our method takes fluorescence compensation in duplex assays into account. We also demonstrate the method on Bio-Rad's mutant detection dataset. Experimental results show that the method provides similar or better accuracy than other algorithms reported over a wider dynamic range.