scispace - formally typeset
Search or ask a question

Showing papers in "Journal of Bioinformatics and Computational Biology in 2015"


Journal ArticleDOI
TL;DR: A new method called PSOVina is presented which combined the particle swarm optimization (PSO) algorithm with the efficient Broyden-Fletcher-Goldfarb-Shannon local search method adopted in AutoDock Vina to tackle the conformational search problem in docking.
Abstract: Protein-ligand docking is an essential step in modern drug discovery process. The challenge here is to accurately predict and efficiently optimize the position and orientation of ligands in the binding pocket of a target protein. In this paper, we present a new method called PSOVina which combined the particle swarm optimization (PSO) algorithm with the efficient Broyden-Fletcher-Goldfarb-Shannon (BFGS) local search method adopted in AutoDock Vina to tackle the conformational search problem in docking. Using a diverse data set of 201 protein-ligand complexes from the PDBbind database and a full set of ligands and decoys for four representative targets from the directory of useful decoys (DUD) virtual screening data set, we assessed the docking performance of PSOVina in comparison to the original Vina program. Our results showed that PSOVina achieves a remarkable execution time reduction of 51-60% without compromising the prediction accuracies in the docking and virtual screening experiments. This improvement in time efficiency makes PSOVina a better choice of a docking tool in large-scale protein-ligand docking applications. Our work lays the foundation for the future development of swarm-based algorithms in molecular docking programs. PSOVina is freely available to non-commercial users at http://cbbio.cis.umac.mo .

62 citations


Journal ArticleDOI
TL;DR: ExAtlas compares multi-component data sets and generates results for all combinations and provides a variety of tools for meta-analyses, including heatmaps, scatter-plots, bar-charts, and three-dimensional images.
Abstract: We have developed ExAtlas, an on-line software tool for meta-analysis and visualization of gene expression data. In contrast to existing software tools, ExAtlas compares multi-component data sets and generates results for all combinations (e.g. all gene expression profiles versus all Gene Ontology annotations). ExAtlas handles both users' own data and data extracted semi-automatically from the public repository (GEO/NCBI database). ExAtlas provides a variety of tools for meta-analyses: (1) standard meta-analysis (fixed effects, random effects, z-score, and Fisher's methods); (2) analyses of global correlations between gene expression data sets; (3) gene set enrichment; (4) gene set overlap; (5) gene association by expression profile; (6) gene specificity; and (7) statistical analysis (ANOVA, pairwise comparison, and PCA). ExAtlas produces graphical outputs, including heatmaps, scatter-plots, bar-charts, and three-dimensional images. Some of the most widely used public data sets (e.g. GNF/BioGPS, Gene Ontology, KEGG, GAD phenotypes, BrainScan, ENCODE ChIP-seq, and protein-protein interaction) are pre-loaded and can be used for functional annotations.

54 citations


Journal ArticleDOI
Yifan Nie1, Wenge Rong1, Yiyuan Zhang1, Yuanxin Ouyang1, Zhang Xiong1 
TL;DR: A word embedding assisted neural network prediction model is proposed to conduct event trigger identification and it is believed that this study could offer researchers insights into semantic-aware solutions for eventtrigger identification.
Abstract: Molecular events normally have significant meanings since they describe important biological interactions or alternations such as binding of a protein. As a crucial step of biological event extraction, event trigger identification has attracted much attention and many methods have been proposed. Traditionally those methods can be categorised into rule-based approach and machine learning approach and machine learning-based approaches have demonstrated its potential and outperformed rule-based approaches in many situations. However, machine learning-based approaches still face several challenges among which a notable one is how to model semantic and syntactic information of different words and incorporate it into the prediction model. There exist many ways to model semantic and syntactic information, among which word embedding is an effective one. Therefore, in order to address this challenge, in this study, a word embedding assisted neural network prediction model is proposed to conduct event trigger identification. The experimental study on commonly used dataset has shown its potential. It is believed that this study could offer researchers insights into semantic-aware solutions for event trigger identification.

38 citations


Journal ArticleDOI
TL;DR: The Pantograph method is introduced, as a toolbox for genome-scale model reconstruction, curation and validation, and scripts for evaluating the model with respect to experimental data are automatically generated, to aid curators in iterative improvement.
Abstract: Genome-scale metabolic models are a powerful tool to study the inner workings of biological systems and to guide applications. The advent of cheap sequencing has brought the opportunity to create metabolic maps of biotechnologically interesting organisms. While this drives the development of new methods and automatic tools, network reconstruction remains a time-consuming process where extensive manual curation is required. This curation introduces specific knowledge about the modeled organism, either explicitly in the form of molecular processes, or indirectly in the form of annotations of the model elements. Paradoxically, this knowledge is usually lost when reconstruction of a different organism is started. We introduce the Pantograph method for metabolic model reconstruction. This method combines a template reaction knowledge base, orthology mappings between two organisms, and experimental phenotypic evidence, to build a genome-scale metabolic model for a target organism. Our method infers implicit knowledge from annotations in the template, and rewrites these inferences to include them in the resulting model of the target organism. The generated model is well suited for manual curation. Scripts for evaluating the model with respect to experimental data are automatically generated, to aid curators in iterative improvement. We present an implementation of the Pantograph method, as a toolbox for genome-scale model reconstruction, curation and validation. This open source package can be obtained from: http://pathtastic.gforge.inria.fr.

33 citations


Journal ArticleDOI
TL;DR: This paper proposes a novel method PRIPU that employs biased-support vector machine (SVM) for predicting Protein-RNA Interactions using only Positive and Unlabeled examples, and is the first work that predicts PRIs using only positive and unlabeled samples.
Abstract: Protein-RNA interactions (PRIs) are considerably important in a wide variety of cellular processes, ranging from transcriptional and post-transcriptional regulations of gene expression to the active defense of host against virus. With the development of high throughput technology, large amounts of PRI information is available for computationally predicting unknown PRIs. In recent years, a number of computational methods for predicting PRIs have been developed in the literature, which usually artificially construct negative samples based on verified nonredundant datasets of PRIs to train classifiers. However, such negative samples are not real negative samples, some even may be unknown positive samples. Consequently, the classifiers trained with such training datasets cannot achieve satisfactory prediction performance. In this paper, we propose a novel method PRIPU that employs biased-support vector machine (SVM) for predicting Protein-RNA Interactions using only Positive and Unlabeled examples. To the best of our knowledge, this is the first work that predicts PRIs using only positive and unlabeled samples. We first collect known PRIs as our benchmark datasets and extract sequence-based features to represent each PRI. To reduce the dimension of feature vectors for lowering computational cost, we select a subset of features by a filter-based feature selection method. Then, biased-SVM is employed to train prediction models with different PRI datasets. To evaluate the new method, we also propose a new performance measure called explicit positive recall (EPR), which is specifically suitable for the task of learning positive and unlabeled data. Experimental results over three datasets show that our method not only outperforms four existing methods, but also is able to predict unknown PRIs. Source code, datasets and related documents of PRIPU are available at: http://admis.fudan.edu.cn/projects/pripu.htm .

31 citations


Journal ArticleDOI
TL;DR: Nine machine learning methods with six physicochemical properties are explored to predict the Root Mean Square Deviation, Template Modeling, and Global Distance Test of modeled protein structure in the absence of its true native state and it is found that Random Forest method outperforms over other machine learning method.
Abstract: Physicochemical properties of proteins always guide to determine the quality of the protein structure, therefore it has been rigorously used to distinguish native or native-like structure from other predicted structures. In this work, we explore nine machine learning methods with six physicochemical properties to predict the Root Mean Square Deviation (RMSD), Template Modeling (TM-score), and Global Distance Test (GDT_TS-score) of modeled protein structure in the absence of its true native state. Physicochemical properties namely total surface area, euclidean distance (ED), total empirical energy, secondary structure penalty (SS), sequence length (SL), and pair number (PN) are used. There are a total of 95,091 modeled structures of 4896 native targets. A real coded Self-adaptive Differential Evolution algorithm (SaDE) is used to determine the feature importance. The K-fold cross validation is used to measure the robustness of the best predictive method. Through the intensive experiments, it is found that Random Forest method outperforms over other machine learning methods. This work makes the prediction faster and inexpensive. The performance result shows the prediction of RMSD, TM-score, and GDT_TS-score on Root Mean Square Error (RMSE) as 1.20, 0.06, and 0.06 respectively; correlation scores are 0.96, 0.92, and 0.91 respectively; R(2) are 0.92, 0.85, and 0.84 respectively; and accuracy are 78.82% (with ± 0.1 err), 86.56% (with ± 0.1 err), and 87.37% (with ± 0.1 err) respectively on the testing data set. The data set used in the study is available as supplement at http://bit.ly/RF-PCP-DataSets.

30 citations


Journal ArticleDOI
TL;DR: This paper proposes a hybridized method to predict synthetic lethality pairs of genes that combines a data-driven model with knowledge of signalling pathways to simulate the influence of single gene knock-down and double genes knock- down to cell death.
Abstract: A major goal of personalized anti-cancer therapy is to increase the drug effects while reducing the side effects as much as possible. A novel therapeutic strategy called synthetic lethality (SL) provides a great opportunity to achieve this goal. SL arises if mutations of both genes lead to cell death while mutation of either single gene does not. Hence, the SL partner of a gene mutated only in cancer cells could be a promising drug target, and the identification of SL pairs of genes is of great significance in pharmaceutical industry. In this paper, we propose a hybridized method to predict SL pairs of genes. We combine a data-driven model with knowledge of signalling pathways to simulate the influence of single gene knock-down and double genes knock-down to cell death. A pair of genes is considered as an SL candidate when double knock-down increases the probability of cell death significantly, but single knock-down does not. The single gene knock-down is confirmed according to the human essential genes database. Our validation against literatures shows that the predicted SL candidates agree well with wet-lab experiments. A few novel reliable SL candidates are also predicted by our model.

26 citations


Journal ArticleDOI
TL;DR: It was demonstrated that in the simplest genetic system model the formation of the alternatively spliced isoforms with opposite functions (activators and repressors) could be a cause of transition to chaotic dynamics.
Abstract: Alternative splicing is a widespread phenomenon in higher eukaryotes, where it serves as a mechanism to increase the functional diversity of proteins. This phenomenon has been described for different classes of proteins, including transcription regulatory proteins. We demonstrated that in the simplest genetic system model the formation of the alternatively spliced isoforms with opposite functions (activators and repressors) could be a cause of transition to chaotic dynamics. Under the simplest genetic system we understand a system consisting of a single gene encoding the structure of a transcription regulatory protein whose expression is regulated by a feedback mechanism. As demonstrated by numerical analysis of the models, if the synthesized isoforms regulate the expression of their own gene acting through different sites and independently of each other, for the generation of chaotic dynamics it is sufficient that the regulatory proteins have a dimeric structure. If regulatory proteins act through one site, the chaotic dynamics is generated in the system only when the repressor protein is either a tetrameric or a higher-dimensional multimer. In this case the activator can be a dimer. It was also demonstrated that if the transcription factor isoforms exhibit either activating or inhibiting activity and are lower-dimensional multimers (< 4), independently of the regulation type the model demonstrates either cyclic or stationary trajectories.

23 citations


Journal ArticleDOI
TL;DR: A new mathematical representation of amino acid chains is proposed, derived using a similarity measure based on the PAM250 amino acid substitution matrix and that generates 20 signals for each protein sequence, which can be integrated into Chou's pseudo amino acid composition (PseAAC) and constitute a useful alternative to amino acid physicochemical properties in Chou's PseAAC.
Abstract: Most of the algorithms used for information extraction and for processing the amino acid chains that make up proteins treat them as symbolic chains. Fewer algorithms exploit signal processing techniques that require a numerical representation of amino acid chains. However, these algorithms are very powerful for extracting regularities that cannot be detected when working with a symbolic chain, which may be important for understanding the biological meaning of a sequence or in classification tasks. In this study, a new mathematical representation of amino acid chains is proposed, which is derived using a similarity measure based on the PAM250 amino acid substitution matrix and that generates 20 signals for each protein sequence. Using this representation 20 consensus spectra for a protein family are determined and the relevance of the frequency peaks is established, obtaining a group of significant frequency peaks that manifest common periodicities of the amino acid sequences that belong to a protein family. We also show that the proposed representation in 20 signals can be integrated into Chou's pseudo amino acid composition (PseAAC) and constitute a useful alternative to amino acid physicochemical properties in Chou's PseAAC.

22 citations


Journal ArticleDOI
TL;DR: An R package, MeSHSim, is developed, which can compute nine similarity measures between MeSH nodes, by which similarity between Me SH Headings as well as MEDLINE documents can be computed.
Abstract: Currently, all MEDLINE documents are indexed by medical subject headings (MeSH). Computing semantic similarity between two MeSH headings as well as two documents has become very important for many biomedical text mining applications. We develop an R package, MeSHSim, which can compute nine similarity measures between MeSH nodes, by which similarity between MeSH headings as well as MEDLINE documents can be easily computed. Also, MeSHSim supports querying hierarchy information of a MeSH heading and retrieving MeSH headings of a query document, and can be easily integrated into pipelines for any biomedical text analysis tasks. MeSHSim is released under general public license (GPL), and available through Bioconductor and from Github at https://github.com/JingZhou2015/MeSHSim.

21 citations


Journal ArticleDOI
TL;DR: This work presents a powerful method, ESSNet, that is able to identify subnetworks consistently across independent datasets of the same disease phenotypes even under very small sample sizes and is shown to be superior when sample size is large.
Abstract: Transcript-level quantification is often measured across two groups of patients to aid the discovery of biomarkers and detection of biological mechanisms involving these biomarkers. Statistical tests lack power and false discovery rate is high when sample size is small. Yet, many experiments have very few samples (≤ 5). This creates the impetus for a method to discover biomarkers and mechanisms under very small sample sizes. We present a powerful method, ESSNet, that is able to identify subnetworks consistently across independent datasets of the same disease phenotypes even under very small sample sizes. The key idea of ESSNet is to fragment large pathways into smaller subnetworks and compute a statistic that discriminates the subnetworks in two phenotypes. We do not greedily select genes to be included based on differential expression but rely on gene-expression-level ranking within a phenotype, which is shown to be stable even under extremely small sample sizes. We test our subnetworks on null distributions obtained by array rotation; this preserves the gene–gene correlation structure and is suitable for datasets with small sample size allowing us to consistently predict relevant subnetworks even when sample size is small. For most other methods, this consistency drops to less than 10% when we test them on datasets with only two samples from each phenotype, whereas ESSNet is able to achieve an average consistency of 58% (72% when we consider genes within the subnetworks) and continues to be superior when sample size is large. We further show that the subnetworks identified by ESSNet are highly correlated to many references in the biological literature. ESSNet and supplementary material are available at: http://compbio.ddns.comp.nus.edu.sg:8080/essnet.

Journal ArticleDOI
TL;DR: It is shown that the insight into the disparity between the static interactome and dynamic protein complexes can be used to improve the performance of complex discovery, and that many existing complex-discovery algorithms have trouble predicting such complexes.
Abstract: Protein interactions and complexes behave in a dynamic fashion, but this dynamism is not captured by interaction screening technologies, and not preserved in protein–protein interaction (PPI) networks. The analysis of static interaction data to derive dynamic protein complexes leads to several challenges, of which we identify three. First, many proteins participate in multiple complexes, leading to overlapping complexes embedded within highly-connected regions of the PPI network. This makes it difficult to accurately delimit the boundaries of such complexes. Second, many condition- and location-specific PPIs are not detected, leading to sparsely-connected complexes that cannot be picked out by clustering algorithms. Third, the majority of complexes are small complexes (made up of two or three proteins), which are extra sensitive to the effects of extraneous edges and missing co-complex edges. We show that many existing complex-discovery algorithms have trouble predicting such complexes, and show that our ...

Journal ArticleDOI
TL;DR: Using mass spectrometry proteome analysis, the permanent constituent of the urine was examined in the Mars-500 experiment and it was shown that the identified proteins may be independent markers of the various conditions and processes in healthy humans and that they can be used as standards in determination of the concentration of other proteins in the urine.
Abstract: Urinary proteins serve as indicators of various conditions in human normal physiology and disease pathology. Using mass spectrometry proteome analysis, the permanent constituent of the urine was examined in the Mars-500 experiment (520 days isolation of healthy volunteers in a terrestrial complex with an autonomous life support system). Seven permanent proteins with predominant distribution in the liver and blood plasma as well as extracellular localization were identified. Analysis of the overrepresentation of the molecular functions and biological processes based on Gene Ontology revealed that the functional association among these proteins was low. The results showed that the identified proteins may be independent markers of the various conditions and processes in healthy humans and that they can be used as standards in determination of the concentration of other proteins in the urine.

Journal ArticleDOI
TL;DR: The TGTCNC consensus of 111 known natural and artificially mutated Auxin response elements (AuxREs) is found with measured auxin-caused relative increase in genes' transcription levels, so-called either "a response to auxin" or "an auxin response."
Abstract: Auxin is one of the main regulators of growth and development in plants. Prediction of auxin response based on gene sequence is of high importance. We found the TGTCNC consensus of 111 known natural and artificially mutated auxin response elements (AuxREs) with measured auxin-caused relative increase in genes' transcription levels, so-called either "a response to auxin" or "an auxin response." This consensus was identical to the most cited AuxRE motif. Also, we found several DNA sequence features that correlate with auxin-caused increase in genes' transcription levels, namely: number of matches with TGTCNC, homology score based on nucleotide frequencies at the consensus positions, abundances of five trinucleotides and five B-helical DNA features around these known AuxREs. We combined these correlations using a four-step empirical model of auxin response based on a gene's sequence with four steps, namely: (1) search for AuxREs with no auxin; (2) stop at the found AuxRE; (3) repression of the basal transcription of the gene having this AuxRE; and (4) manifold increase of this gene's transcription in response to auxin. Independently measured increases in transcription levels in response to auxin for 70 Arabidopsis genes were found to significantly correlate with predictions of this equation (r = 0.44, p < 0.001) as well as with TATA-binding protein (TBP)'s affinity to promoters of these genes and with nucleosome packing of these promoters (both, p < 0.025). Finally, we improved our equation for prediction of a gene's transcription increase in response to auxin by taking into account TBP-binding and nucleosome packing (r = 0.53, p < 10-6). Fisher's F-test validated the significant impact of both TBP/promoter-affinity and promoter nucleosome on auxin response in addition to those of AuxRE, F = 4.07, p < 0.025. It means that both TATA-box and nucleosome should be taken into account to recognize transcription factor binding sites upon DNA sequences: in the case of the TATA-less nucleosome-rich promoters, recognition scores must be higher than in the case of the TATA-containing nucleosome-free promoters at the same transcription activity.

Journal ArticleDOI
Bo Liao1, Sumei Ding1, Haowen Chen1, Zejun Li1, Lijun Cai1 
TL;DR: A new diffusion-based method (NDBM) to explore global network similarity for miRNA-disease association inference and some associations who strongly predicted by the method are confirmed by public databases, suggesting that NDBM could be an effective and important tool for biomedical research.
Abstract: Identifying the microRNA-disease relationship is vital for investigating the pathogenesis of various diseases. However, experimental verification of disease-related microRNAs remains considerable challenge to many researchers, particularly for the fact that numerous new microRNAs are discovered every year. As such, development of computational methods for disease-related microRNA prediction has recently gained eminent attention. In this paper, first, we construct a miRNA functional network and a disease similarity network by integrating different information sources. Then, we further introduce a new diffusion-based method (NDBM) to explore global network similarity for miRNA-disease association inference. Even though known miRNA-disease associations in the database are rare, NDBM still achieves an area under the ROC curve (AUC) of 85.62% in the leave-one-out cross-validation in improving the prediction accuracy of previous methods significantly. Moreover, our method is applicable to diseases with no known related miRNAs as well as new miRNAs with unknown target diseases. Some associations who strongly predicted by our method are confirmed by public databases. These superior performances suggest that NDBM could be an effective and important tool for biomedical research.

Journal ArticleDOI
TL;DR: Application of SAMSVM to actual sequencing data resulted in filtration of misaligned reads and correction of variant calling, indicating that the model built usingSAMSVM was accurate in misalignment detection.
Abstract: Sequence alignment/map (SAM) formatted sequences [Li H, Handsaker B, Wysoker A et al., Bioinformatics 25(16):2078-2079, 2009.] have taken on a main role in bioinformatics since the development of massive parallel sequencing. However, because misalignment of sequences poses a significant problem in analysis of sequencing data that could lead to false positives in variant calling, the exclusion of misaligned reads is a necessity in analysis. In this regard, the multiple features of SAM-formatted sequences can be treated as vectors in a multi-dimension space to allow the application of a support vector machine (SVM). Applying the LIBSVM tools developed by Chang and Lin [Chang C-C, Lin C-J, ACM Trans Intell Syst Technol 2:1-27, 2011.] as a simple interface for support vector classification, the SAMSVM package has been developed in this study to enable misalignment filtration of SAM-formatted sequences. Cross-validation between two simulated datasets processed with SAMSVM yielded accuracies that ranged from 0.89 to 0.97 with F-scores ranging from 0.77 to 0.94 in 14 groups characterized by different mutation rates from 0.001 to 0.1, indicating that the model built using SAMSVM was accurate in misalignment detection. Application of SAMSVM to actual sequencing data resulted in filtration of misaligned reads and correction of variant calling.

Journal ArticleDOI
TL;DR: This study presents FQC, a fastq compression method that, in addition to providing significantly higher compression gains over GZIP, incorporates features necessary for universal adoption by data repositories/end-users and proposes a novel archival strategy which allows sequence repositories to simultaneously store and disseminate lossless as well as (multiple) lossy variants of fastq files, without necessitating any additional storage requirements.
Abstract: Sequence data repositories archive and disseminate fastq data in compressed format. In spite of having relatively lower compression efficiency, data repositories continue to prefer GZIP over available specialized fastq compression algorithms. Ease of deployment, high processing speed and portability are the reasons for this preference. This study presents FQC, a fastq compression method that, in addition to providing significantly higher compression gains over GZIP, incorporates features necessary for universal adoption by data repositories/end-users. This study also proposes a novel archival strategy which allows sequence repositories to simultaneously store and disseminate lossless as well as (multiple) lossy variants of fastq files, without necessitating any additional storage requirements. For academic users, Linux, Windows, and Mac implementations (both 32 and 64-bit) of FQC are freely available for download at: https://metagenomics.atc.tcs.com/compression/FQC .

Journal ArticleDOI
TL;DR: The Human microbiome can no longer be ignored as not only is there enough evidence correlating microbiome alterations and disease states, but also the return to healthy states once these alterations are reversed.
Abstract: Microbial communities thrive in close association among themselves and with the host, establishing protein-protein interactions (PPIs) with the latter, and thus being able to benefit (positively impact) or disturb (negatively impact) biological events in the host. Despite major collaborative efforts to sequence the Human microbiome, there is still a great lack of understanding their impact. We propose a computational methodology to predict the impact of microbial proteins in human biological events, taking into account the abundance of each microbial protein and its relation to all other microbial and human proteins. This alternative methodology is centered on an improved impact estimation algorithm that integrates PPIs between human and microbial proteins with Reactome pathway data. This methodology was applied to study the impact of 24 microbial phyla over different cellular events, within 10 different human microbiomes. The results obtained confirm findings already described in the literature and explore new ones. We believe the Human microbiome can no longer be ignored as not only is there enough evidence correlating microbiome alterations and disease states, but also the return to healthy states once these alterations are reversed.

Journal ArticleDOI
TL;DR: This work employs reverse-engineering approach to construct the most detailed computational model of p16-mediated pathway in higher eukaryotes and implements experimental data from the literature to validate the model, and under various assumptions predict the dynamic behavior of p 16 and other biological components by interpreting the simulation results.
Abstract: p16 is recognized as a tumor suppressor gene due to the prevalence of its genetic inactivation in all types of human cancers. Additionally, p16 gene plays a critical role in controlling aging, regulating cellular senescence, detection and maintenance of DNA damage. The molecular mechanism behind these events involves p16-mediated signaling pathway (or p16-Rb pathway), the focus of our study. Understanding functional dependence between dynamic behavior of biological components involved in the p16-mediated pathway and aforesaid molecular-level events might suggest possible implications in the diagnosis, prognosis and treatment of human cancer. In the present work, we employ reverse-engineering approach to construct the most detailed computational model of p16-mediated pathway in higher eukaryotes. We implement experimental data from the literature to validate the model, and under various assumptions predict the dynamic behavior of p16 and other biological components by interpreting the simulation results. The quantitative model of p16-mediated pathway is created in a systematic manner in terms of Petri net technologies.

Journal ArticleDOI
TL;DR: The stringent protocol EnzDP can cover up to 90% of enzyme families available in Swiss-Prot and achieves a high accuracy of 94.5% based on five-fold cross-validation, and serves as a reliable automated tool for enzyme annotation and metabolic network reconstruction.
Abstract: Determining the entire complement of enzymes and their enzymatic functions is a fundamental step for reconstructing the metabolic network of cells. High quality enzyme annotation helps in enhancing metabolic networks reconstructed from the genome, especially by reducing gaps and increasing the enzyme coverage. Currently, structure-based and network-based approaches can only cover a limited number of enzyme families, and the accuracy of homology-based approaches can be further improved. Bottom-up homology-based approach improves the coverage by rebuilding Hidden Markov Model (HMM) profiles for all known enzymes. However, its clustering procedure relies firmly on BLAST similarity score, ignoring protein domains/patterns, and is sensitive to changes in cut-off thresholds. Here, we use functional domain architecture to score the association between domain families and enzyme families (Domain-Enzyme Association Scoring, DEAS). The DEAS score is used to calculate the similarity between proteins, which is then used in clustering procedure, instead of using sequence similarity score. We improve the enzyme annotation protocol using a stringent classification procedure, and by choosing optimal threshold settings and checking for active sites. Our analysis shows that our stringent protocol EnzDP can cover up to 90% of enzyme families available in Swiss-Prot. It achieves a high accuracy of 94.5% based on five-fold cross-validation. EnzDP outperforms existing methods across several testing scenarios. Thus, EnzDP serves as a reliable automated tool for enzyme annotation and metabolic network reconstruction. Available at: www.comp.nus.edu.sg/~nguyennn/EnzDP.

Journal ArticleDOI
TL;DR: Comparing results demonstrate that HeteSim-SEQ is superior to existing methods including BDT, SVM and iGPS, suggesting the effectiveness of the proposed network-based method in predicting potential KSRs.
Abstract: Protein phosphorylation catalyzed by kinases plays essential roles in various intracellular processes. With an increasing number of phosphorylation sites verified experimentally by high-throughput ...

Journal ArticleDOI
TL;DR: A novel numerical representation of DNA sequences using genetic codon context (GCC) in which the numerical values are optimized by simulation annealing to maximize the 3-periodicity signal to noise ratio (SNR) and the results show the GCC method enhances the SNR values of exon sequences and thus increases the accuracy of predicting protein coding regions in genomes compared with the commonly used 4D binary representation.
Abstract: To apply digital signal processing (DSP) methods to analyze DNA sequences, the sequences first must be specially mapped into numerical sequences. Thus, effective numerical mappings of DNA sequences play key roles in the effectiveness of DSP-based methods such as exon prediction. Despite numerous mappings of symbolic DNA sequences to numerical series, the existing mapping methods do not include the genetic coding features of DNA sequences. We present a novel numerical representation of DNA sequences using genetic codon context (GCC) in which the numerical values are optimized by simulation annealing to maximize the 3-periodicity signal to noise ratio (SNR). The optimized GCC representation is then applied in exon and intron prediction by Short-Time Fourier Transform (STFT) approach. The results show the GCC method enhances the SNR values of exon sequences and thus increases the accuracy of predicting protein coding regions in genomes compared with the commonly used 4D binary representation. In addition, this study offers a novel way to reveal specific features of DNA sequences by optimizing numerical mappings of symbolic DNA sequences.

Journal ArticleDOI
TL;DR: The iterative approach initially proposes an idea that iteratively incorporates the interaction information of unannotated proteins into the protein function prediction and can be applied on existing prediction algorithms to improve prediction performance.
Abstract: Protein-protein interaction networks constructed by high throughput technologies provide opportunities for predicting protein functions. A lot of approaches and algorithms have been applied on PPI networks to predict functions of unannotated proteins over recent decades. However, most of existing algorithms and approaches do not consider unannotated proteins and their corresponding interactions in the prediction process. On the other hand, algorithms which make use of unannotated proteins have limited prediction performance. Moreover, current algorithms are usually one-off predictions. In this paper, we propose an iterative approach that utilizes unannotated proteins and their interactions in prediction. We conducted experiments to evaluate the performance and robustness of the proposed iterative approach. The iterative approach maximally improved the prediction performance by 50%-80% when there was a high proportion of unannotated neighborhood protein in the network. The iterative approach also showed robustness in various types of protein interaction network. Importantly, our iterative approach initially proposes an idea that iteratively incorporates the interaction information of unannotated proteins into the protein function prediction and can be applied on existing prediction algorithms to improve prediction performance.

Journal ArticleDOI
TL;DR: In the present work, rotational oscillations of nitrogenous bases in the DNA with the sequence of the gene coding interferon alpha 17 (IFNA17), are investigated using a system of two coupled nonlinear partial differential equations that takes into account effects of dissipation, action of external fields and dependence of the equation coefficients on the sequenceof bases.
Abstract: In the present work, rotational oscillations of nitrogenous bases in the DNA with the sequence of the gene coding interferon alpha 17 (IFNA17), are investigated. As a mathematical model simulating oscillations of the bases, we use a system of two coupled nonlinear partial differential equations that takes into account effects of dissipation, action of external fields and dependence of the equation coefficients on the sequence of bases. We apply the methods of the theory of oscillations to solve the equations in the linear approach and to construct the dispersive curves determining the dependence of the frequency of the plane waves (ω) on the wave vector (q). In the nonlinear case, the solutions in the form of kink are considered, and the main characteristics of the kink: the rest energy (E0), the rest mass (m0), the size (d) and sound velocity (C0), are calculated. With the help of the energetic method, the kink velocity (υ), the path (S), and the lifetime (τ) are also obtained.

Journal ArticleDOI
TL;DR: Findings of the present study confirm that the newly developed EP-RTF outperforms (in terms of classification accuracy, specificity, and specificity) the previously applied methods over four datasets in the field of human miRNA target.
Abstract: MicroRNAs (miRNAs) are small non-coding RNAs that have important functions in gene regulation. Since finding miRNA target experimentally is costly and needs spending much time, the use of machine learning methods is a growing research area for miRNA target prediction. In this paper, a new approach is proposed by using two popular ensemble strategies, i.e. Ensemble Pruning and Rotation Forest (EP-RTF), to predict human miRNA target. For EP, the approach utilizes Genetic Algorithm (GA). In other words, a subset of classifiers from the heterogeneous ensemble is first selected by GA. Next, the selected classifiers are trained based on the RTF method and then are combined using weighted majority voting. In addition to seeking a better subset of classifiers, the parameter of RTF is also optimized by GA. Findings of the present study confirm that the newly developed EP-RTF outperforms (in terms of classification accuracy, sensitivity, and specificity) the previously applied methods over four datasets in the field of human miRNA target. Diversity-error diagrams reveal that the proposed ensemble approach constructs individual classifiers which are more accurate and usually diverse than the other ensemble approaches. Given these experimental results, we highly recommend EP-RTF for improving the performance of miRNA target prediction.

Journal ArticleDOI
TL;DR: This work provides a context to the fine detail of individual gene expression differences in murine peritoneal macrophages during immunological challenge with high throughput RNA-Seq.
Abstract: Comprehensive and simultaneous analysis of all genes in a biological sample is a capability of RNA-Seq technology. Analysis of the entire transcriptome benefits from summarization of genes at the functional level. As a cellular response of interest not previously explored with RNA-Seq, peritoneal macrophages from mice under two conditions (control and immunologically challenged) were analyzed for gene expression differences. Quantification of individual transcripts modeled RNA-Seq read distribution and uncertainty (using a Beta Negative Binomial distribution), then tested for differential transcript expression (False Discovery Rate-adjusted p-value < 0.05). Enrichment of functional categories utilized the list of differentially expressed genes. A total of 2079 differentially expressed transcripts representing 1884 genes were detected. Enrichment of 92 categories from Gene Ontology Biological Processes and Molecular Functions, and KEGG pathways were grouped into 6 clusters. Clusters included defense and inflammatory response (Enrichment Score = 11.24) and ribosomal activity (Enrichment Score = 17.89). Our work provides a context to the fine detail of individual gene expression differences in murine peritoneal macrophages during immunological challenge with high throughput RNA-Seq.

Journal ArticleDOI
TL;DR: This study introduces a novel network-based computational method, site-modification network- based inference (SMNBI) to predict tyrosine phosphorylation, and extensively compare it with other sequence-based methods including SVM and Bayesian decision theory.
Abstract: Phosphorylation plays a great role in regulating a variety of cellular processes and the identification of tyrosine phosphorylation sites is fundamental for understanding the post-translational modification (PTM) regulation processes. Although a lot of computational methods have been developed, most of them only concern local sequence information and few studies focus on the tyrosine sites with in situ PTM information, which refers to different types of PTM occurring on the same modification site. In this study, by constructing the site-modification network that efficiently incorporates in situ PTM information, we introduce a novel network-based computational method, site-modification network-based inference (SMNBI) to predict tyrosine phosphorylation. In order to verify the effectiveness of the proposed method, we compare it with other network-based computational methods. The results clearly show the superior performance of SMNBI. Besides, we extensively compare SMNBI with other sequence-based methods including SVM and Bayesian decision theory. The evaluation demonstrates the power of site-modification network in predicting tyrosine phosphorylation. The proposed method is freely available at http://bioinformatics.ustc.edu.cn/smnbi/.

Journal ArticleDOI
TL;DR: By choosing the locations of the don't care symbols in the seed using quadratic residues modulo a prime number, single seeds are derived that when used with a threshold t > 1 have competitive sensitivity/selectivity trade-offs, indeed close to the best multiple seeds known in the literature.
Abstract: Spaced seeds are a fundamental tool for similarity search in biosequences. The best sensitivity/selectivity trade-offs are obtained using many seeds simultaneously: This is known as the multiple seed approach. Unfortunately, spaced seeds use a large amount of memory and the available RAM is a practical limit to the number of seeds one can use simultaneously. Inspired by some recent results on lossless seeds, we revisit the approach of using a single spaced seed and considering two regions homologous if the seed hits in at least t sufficiently close positions. We show that by choosing the locations of the don't care symbols in the seed using quadratic residues modulo a prime number, we derive single seeds that when used with a threshold t > 1 have competitive sensitivity/selectivity trade-offs, indeed close to the best multiple seeds known in the literature. In addition, the choice of the threshold t can be adjusted to modify sensitivity and selectivity a posteriori, thus enabling a more accurate search in the specific instance at issue. The seeds we propose also exhibit robustness and allow flexibility in usage.

Journal ArticleDOI
TL;DR: This paper introduces a new method for the MSA problem called biogeography-based optimization with multiple populations (BBOMP), based on a recent metaheuristic inspired from the mathematics of bioge geography named biogeographical- based optimization (BBO).
Abstract: The multiple sequence alignment (MSA) is one of the most challenging problems in bioinformatics, it involves discovering similarity between a set of protein or DNA sequences. This paper introduces a new method for the MSA problem called biogeography-based optimization with multiple populations (BBOMP). It is based on a recent metaheuristic inspired from the mathematics of biogeography named biogeography-based optimization (BBO). To improve the exploration ability of BBO, we have introduced a new concept allowing better exploration of the search space. It consists of manipulating multiple populations having each one its own parameters. These parameters are used to build up progressive alignments allowing more diversity. At each iteration, the best found solution is injected in each population. Moreover, to improve solution quality, six operators are defined. These operators are selected with a dynamic probability which changes according to the operators efficiency. In order to test proposed approach performance, we have considered a set of datasets from Balibase 2.0 and compared it with many recent algorithms such as GAPAM, MSA-GA, QEAMSA and RBT-GA. The results show that the proposed approach achieves better average score than the previously cited methods.

Journal ArticleDOI
TL;DR: A new technique presented here makes use of the fuzzy set logic for the initial gene selection and of the machine learning algorithm AdaBoost to retrieve a set of genes where expression profiles are the most different between the resistant and susceptible classes.
Abstract: The search for fast and reliable methods allowing for extraction of biomarker genes, e.g. responsible for a plant resistance to a certain pathogen, is one of the most important and highly exploited data mining problem in bioinformatics. Here we describe a simple and efficient method suitable for combining results from multiple single-channel microarray experiments for meta-analysis. A new technique presented here makes use of the fuzzy set logic for the initial gene selection and of the machine learning algorithm AdaBoost to retrieve a set of genes where expression profiles are the most different between the resistant and susceptible classes. As a proof of concept, our method has been applied to the analysis of a gene expression dataset composed of many independent microarray experiments on wheat head tissue, to identify genes that are biomarkers of resistance to the fungus Fusarium graminearum. We used microarray data from many experiments performed on wheat lines of various resistance level. The resulting set of genes was validated by qPCR experiments.