scispace - formally typeset
Search or ask a question

Showing papers in "Journal of Bioinformatics and Computational Biology in 2014"


Journal ArticleDOI
TL;DR: Zhang et al. as mentioned in this paper proposed an ensemble predictor based on support vector machine (SVM) for predicting clones with growth advantage from the Ph.D.-7 phage display peptide library.
Abstract: Phage display can rapidly discover peptides binding to any given target; thus, it has been widely used in basic and applied research. Each round of panning consists of two basic processes: Selection and amplification. However, recent studies have showed that the amplification step would decrease the diversity of phage display libraries due to different propagation capacity of phage clones. This may induce phages with growth advantage rather than specific affinity to appear in the final experimental results. The peptides displayed by such phages are termed as propagation-related target-unrelated peptides (PrTUPs). They would mislead further analysis and research if not removed. In this paper, we describe PhD7Faster, an ensemble predictor based on support vector machine (SVM) for predicting clones with growth advantage from the Ph.D.-7 phage display peptide library. By using reduced dipeptide composition (ReDPC) as features, an accuracy (Acc) of 79.67% and a Matthews correlation coefficient (MCC) of 0.595 were achieved in 5-fold cross-validation. In addition, the SVM-based model was demonstrated to perform better than several representative machine learning algorithms. We anticipate that PhD7Faster can assist biologists to exclude potential PrTUPs and accelerate the finding of specific binders from the popular Ph.D.-7 library. The web server of PhD7Faster can be freely accessed at http://immunet.cn/sarotup/cgi-bin/PhD7Faster.pl.

39 citations


Journal ArticleDOI
TL;DR: It is shown in this document that the SAW requirement considered when proving NP-completeness is different from the Saw requirement used in various prediction programs, and that they are different fromThe real biological requirement.
Abstract: To determine the 3D conformation of proteins is a necessity to understand their functions or interactions with other molecules. It is commonly admitted that, when proteins fold from their primary linear structures to their final 3D conformations, they tend to choose the ones that minimize their free energy. To find the 3D conformation of a protein knowing its amino acid sequence, bioinformaticians use various models of different resolutions and artificial intelligence tools, as the protein folding prediction problem is a NP complete one. More precisely, to determine the backbone structure of the protein using the low resolution models (2D HP square and 3D HP cubic), by finding the conformation that minimizes free energy, is intractable exactly. Both proofs of NP-completeness and the 2D prediction consider that acceptable conformations have to satisfy a self-avoiding walk (SAW) requirement, as two different amino acids cannot occupy a same position in the lattice. It is shown in this document that the SAW requirement considered when proving NP-completeness is different from the SAW requirement used in various prediction programs, and that they are different from the real biological requirement. Indeed, the proof of NP completeness and the predictions in silico consider conformations that are not possible in practice. Consequences of this fact are investigated in this research work.

35 citations


Journal ArticleDOI
TL;DR: The proposed PNMF is shown to outperform the deterministic NMF and the sparse NMF algorithms in clustering stability and classification accuracy and to apply to cluster and classify DNA microarrays data.
Abstract: Non-negative matrix factorization (NMF) has proven to be a useful decomposition technique for multivariate data, where the non-negativity constraint is necessary to have a meaningful physical interpretation. NMF reduces the dimensionality of non-negative data by decomposing it into two smaller non-negative factors with physical interpretation for class discovery. The NMF algorithm, however, assumes a deterministic framework. In particular, the effect of the data noise on the stability of the factorization and the convergence of the algorithm are unknown. Collected data, on the other hand, is stochastic in nature due to measurement noise and sometimes inherent variability in the physical process. This paper presents new theoretical and applied developments to the problem of non-negative matrix factorization (NMF). First, we generalize the deterministic NMF algorithm to include a general class of update rules that converges towards an optimal non-negative factorization. Second, we extend the NMF framework to the probabilistic case (PNMF). We show that the Maximum a posteriori (MAP) estimate of the non-negative factors is the solution to a weighted regularized non-negative matrix factorization problem. We subsequently derive update rules that converge towards an optimal solution. Third, we apply the PNMF to cluster and classify DNA microarrays data. The proposed PNMF is shown to outperform the deterministic NMF and the sparse NMF algorithms in clustering stability and classification accuracy.

25 citations


Journal ArticleDOI
TL;DR: This review article presents, under a bottom-up perspective, a hierarchy of approaches to modeling gene regulatory network dynamics, from microscopic descriptions at the single-molecule level in the spatial context of an individual cell to macroscopic models providing phenomenological description at the population-average level.
Abstract: A promising alternative for unraveling the principles under which the dynamic interactions among genes lead to cellular phenotypes relies on mathematical and computational models at different levels of abstraction, from the molecular level of protein-DNA interactions to the system level of functional relationships among genes. This review article presents, under a bottom–up perspective, a hierarchy of approaches to modeling gene regulatory network dynamics, from microscopic descriptions at the single-molecule level in the spatial context of an individual cell to macroscopic models providing phenomenological descriptions at the population-average level. The reviewed modeling approaches include Molecular Dynamics, Particle-Based Brownian Dynamics, the Master Equation approach, Ordinary Differential Equations, and the Boolean logic abstraction. Each of these frameworks is motivated by a particular biological context and the nature of the insight being pursued. The setting of gene network dynamic models from such frameworks involves assumptions and mathematical artifacts often ignored by the non-specialist. This article aims at providing an entry point for biologists new to the field and computer scientists not acquainted with some recent biophysically-inspired models of gene regulation. The connections promoting intuition between different abstraction levels and the role that approximations play in the modeling process are highlighted throughout the paper.

19 citations


Journal ArticleDOI
TL;DR: The inhibitory effect of H-NS is demonstrated using Δhns mutant of Escherichia coli and it is shown that deletion of dps, encoding another protein of bacterial nucleoid, tended to decrease rather than increase the amount of island-specific transcripts, which prevented consideration of promoter islands as sites for targeted heterochromatization only.
Abstract: Seventy-eight promoter islands with an extraordinarily high density of potential promoters have been recently found in the genome of Escherichia coli. It has been shown that RNA polymerase binds internal promoters of these islands and produces short oligonucleotides, while the synthesis of normal mRNAs is suppressed. This quenching may be biologically relevant, as most islands are associated with foreign genes, which expression may deplete cellular resources. However, a molecular mechanism of silencing with the participation of these promoter-rich regions remains obscure. It has been demonstrated that all islands interact with histone-like protein H-NS — a specific sentinel of foreign genes. In this study, we demonstrated the inhibitory effect of H-NS using Δhns mutant of Escherichia coli and showed that deletion of dps, encoding another protein of bacterial nucleoid, tended to decrease rather than increase the amount of island-specific transcripts. This observation precluded consideration of promoter islands as sites for targeted heterochromatization only and a computer search for the binding sites of 53 transcription factors (TFs) revealed six proteins, which may specifically regulate their transcriptional output.

17 citations


Journal ArticleDOI
TL;DR: The efficacy of the approach is shown on eight repeat families annotated in UniProt, comprising of both solenoid and nonsolenoid repeats with varied secondary structure architecture and repeat lengths and the performance compared with two repeat identification methods.
Abstract: Repetition of a structural motif within protein is associated with a wide range of structural and functional roles. In most cases the repeating units are well conserved at the structural level while at the sequence level, they are mostly undetectable suggesting the need for structure-based methods. Since most known methods require a training dataset, de novo approach is desirable. Here, we propose an efficient graph-based approach for detecting structural repeats in proteins. In a protein structure represented as a graph, interactions between inter- and intra-repeat units are well captured by the eigen spectra of adjacency matrix of the graph. These conserved interactions give rise to similar connections and a unique profile of the principal eigen spectra for each repeating unit. The efficacy of the approach is shown on eight repeat families annotated in UniProt, comprising of both solenoid and nonsolenoid repeats with varied secondary structure architecture and repeat lengths. The performance of the approach is also tested on other known benchmark datasets and the performance compared with two repeat identification methods. For a known repeat type, the algorithm also identifies the type of repeat present in the protein. A web tool implementing the algorithm is available at the URL http://bioinf.iiit.ac.in/PRIGSA/.

16 citations


Journal ArticleDOI
TL;DR: Comparative analysis of the community composition and bacterial diversity present in the Byron glacier in Alaska with other environments showed larger overlap with an Arctic soil than with a high Arctic lake, indicating patterns of community exchange and suggesting that these bacteria may play an important role in soil development during glacial retreat.
Abstract: The temperature in the Arctic region has been increasing in the recent past accompanied by melting of its glaciers. We took a snapshot of the current microbial inhabitation of an Alaskan glacier (which can be considered as one of the simplest possible ecosystems) by using metagenomic sequencing of 16S rRNA recovered from ice/snow samples. Somewhat contrary to our expectations and earlier estimates, a rich and diverse microbial population of more than 2,500 species was revealed including several species of Archaea that has been identified for the first time in the glaciers of the Northern hemisphere. The most prominent bacterial groups found were Proteobacteria, Bacteroidetes, and Firmicutes. Firmicutes were not reported in large numbers in a previously studied Alpine glacier but were dominant in an Antarctic subglacial lake. Representatives of Cyanobacteria, Actinobacteria and Planctomycetes were among the most numerous, likely reflecting the dependence of the ecosystem on the energy obtained through photosynthesis and close links with the microbial community of the soil. Principal component analysis (PCA) of nucleotide word frequency revealed distinct sequence clusters for different taxonomic groups in the Alaskan glacier community and separate clusters for the glacial communities from other regions of the world. Comparative analysis of the community composition and bacterial diversity present in the Byron glacier in Alaska with other environments showed larger overlap with an Arctic soil than with a high Arctic lake, indicating patterns of community exchange and suggesting that these bacteria may play an important role in soil development during glacial retreat.

16 citations


Journal ArticleDOI
TL;DR: A new and simple network-based approach using a reverse k-nearest neighbor ( R k NN) search to identify novel IBD-related proteins that were found over-represented in the IBD pathway and enriched in importantly functional pathways in IBD.
Abstract: Inflammatory bowel disease (IBD) is a chronic disease whose incidence and prevalence increase every year; however, the pathogenesis of IBD is still unclear. Thus, identifying IBD-related proteins is important for understanding its complex disease mechanism. Here, we propose a new and simple network-based approach using a reverse k-nearest neighbor ( R k NN ) search to identify novel IBD-related proteins. Protein-protein interactions (PPI) and Genome-Wide Association Studies (GWAS) were used in this study. After constructing the PPI network, the R k NN search was applied to all of the proteins to identify sets of influenced proteins among their k-nearest neighbors ( R k NNs ). An observed protein whose influenced proteins were mostly known IBD-related proteins was statistically identified as a novel IBD-related protein. Our method outperformed a random aspect, k NN search, and centrality measures based on the network topology. A total of 39 proteins were identified as IBD-related proteins. Of these proteins, 71% were reported at least once in the literature as related to IBD. Additionally, these proteins were found over-represented in the IBD pathway and enriched in importantly functional pathways in IBD. In conclusion, the R k NN search with the statistical enrichment test is a great tool to identify IBD-related proteins to better understand its complex disease mechanism.

13 citations


Journal ArticleDOI
TL;DR: A general heuristic for several problems in the genome rearrangement field is presented, able to improve results on the sorting by transpositions problem, which is a very special case because many efforts have been made to generate algorithms with good results in practice and some of these algorithms provide results that equal the optimum solutions in many cases.
Abstract: In this paper, we present a general heuristic for several problems in the genome rearrangement field. Our heuristic does not solve any problem directly, it is rather used to improve the solutions provided by any non-optimal algorithm that solve them. Therefore, we have implemented several algorithms described in the literature and several algorithms developed by ourselves. As a whole, we implemented 23 algorithms for 9 well known problems in the genome rearrangement field. A total of 13 algorithms were implemented for problems that use the notions of prefix and suffix operations. In addition, we worked on 5 algorithms for the classic problem of sorting by transposition and we conclude the experiments by presenting results for 3 approximation algorithms for the sorting by reversals and transpositions problem and 2 approximation algorithms for the sorting by reversals problem. Another algorithm with better approximation ratio can be found for the last genome rearrangement problem, but it is purely theoretical with no practical implementation. The algorithms we implemented in addition to our heuristic lead to the best practical results in each case. In particular, we were able to improve results on the sorting by transpositions problem, which is a very special case because many efforts have been made to generate algorithms with good results in practice and some of these algorithms provide results that equal the optimum solutions in many cases. Our source codes and benchmarks are freely available upon request from the authors so that it will be easier to compare new approaches against our results.

13 citations


Journal ArticleDOI
TL;DR: A hypothetical model of pair-wise protein interactions within the viral envelope was proposed and it was hypothesized that the amino acid residues located at the interface of two different proteins are under physical constraints and thus probably co-evolve.
Abstract: Interactions between integral membrane proteins hemagglutinin (HA), neuraminidase (NA), M2 and membrane-associated matrix protein M1 of influenza A virus are thought to be crucial for assembly of functionally competent virions. We hypothesized that the amino acid residues located at the interface of two different proteins are under physical constraints and thus probably co-evolve. To predict co-evolving residue pairs, the EvFold ( http://evfold.org ) program searching the (nontransitive) Direct Information scores was applied for large samplings of amino acid sequences from Influenza Research Database ( http://www.fludb.org/ ). Having focused on the HA, NA, and M2 cytoplasmic tails as well as C-terminal domain of M1 (being the less conserved among the protein domains) we captured six pairs of correlated positions. Among them, there were one, two, and three position pairs for HA-M2, HA-M1, and M2-M1 protein pairs, respectively. As expected, no co-varying positions were found for NA-HA, NA-M1, and NA-M2 pairs obviously due to high conservation of the NA cytoplasmic tail. The sum of frequencies calculated for two major amino acid patterns observed in pairs of correlated positions was up to 0.99 meaning their high to extreme evolutionary sustainability. Based on the predictions a hypothetical model of pair-wise protein interactions within the viral envelope was proposed.

11 citations


Journal ArticleDOI
TL;DR: Computer verification demonstrates that maximal repeats for a genome of several gigabases can be identified in a reasonable time, enabling the characterization of the power-law regime of sequenced genomes via maximal repeats identification and classification, an important task for the derivation of models that would help elucidate sequence duplication and genome evolution.
Abstract: We propose and implement a method to obtain all duplicated sequences (repeats) from a chromosome or whole genome. Unlike existing approaches our method makes it possible to simultaneously identify and classify repeats into super, local, and non-nested local maximal repeats. Computation verification demonstrates that maximal repeats for a genome of several gigabases can be identified in a reasonable time, enabling us to identified these maximal repeats for any sequenced genome. The algorithm used for the identification relies on enhanced suffix array data structure to achieve practical space and time efficiency, to identify and classify the maximal repeats, and to perform further post-processing on the identified duplicated sequences. The simplicity and effectiveness of the implementation makes the method readily extendible to more sophisticated computations. Maxmers can be exhaustively accounted for in few minutes for genome sequences of dozen megabases in length and in less than a day or two for genome sequences of few gigabases in length. One application of duplicated sequence identification is to the study of duplicated sequence length distributions, which our found to exhibit for large lengths a persistent power-law behavior. Variation of estimated exponents of this power law are studied among different species and successive assembly release versions of the same species. This makes the characterization of the power-law regime of sequenced genomes via maximal repeats identification and classification, an important task for the derivation of models that would help us to elucidate sequence duplication and genome evolution.

Journal ArticleDOI
TL;DR: The results indicate that the pareto-optimal barrier for compression rate and speed claimed by Bonfield and Mahoney (2013) does not apply for high coverage aligned data.
Abstract: With the release of the latest Next-Generation Sequencing (NGS) machine, the HiSeq X by Illumina, the cost of sequencing the whole genome of a human is expected to drop to a mere $1000. This milestone in sequencing history marks the era of affordable sequencing of individuals and opens the doors to personalized medicine. In accord, unprecedented volumes of genomic data will require storage for processing. There will be dire need not only of compressing aligned data, but also of generating compressed files that can be fed directly to downstream applications to facilitate the analysis of and inference on the data. Several approaches to this challenge have been proposed in the literature; however, focus thus far has been on the low coverage regime and most of the suggested compressors are not based on effective modeling of the data. We demonstrate the benefit of data modeling for compressing aligned reads. Specifically, we show that, by working with data models designed for the aligned data, we can improve considerably over the best compression ratio achieved by previously proposed algorithms. Our results indicate that the pareto-optimal barrier for compression rate and speed claimed by Bonfield and Mahoney (2013) [Bonfield JK and Mahoneys MV, Compression of FASTQ and SAM format sequencing data, PLOS ONE, 8(3):e59190, 2013.] does not apply for high coverage aligned data. Furthermore, our improved compression ratio is achieved by splitting the data in a manner conducive to operations in the compressed domain by downstream applications.

Journal ArticleDOI
TL;DR: This paper shows that supervised methods for predicting chromatin boundary elements are much more effective than the currently popular unsupervised methods, and can make accurate predictions of insulator positions.
Abstract: In eukaryotic cells, the DNA material is densely packed inside the nucleus in the form of a DNA-protein complex structure called chromatin. Since the actual conformation of the chromatin fiber defines the possible regulatory interactions between genes and their regulatory elements, it is very important to understand the mechanisms governing folding of chromatin. In this paper, we show that supervised methods for predicting chromatin boundary elements are much more effective than the currently popular unsupervised methods. Using boundary locations from published Hi-C experiments and modEncode tracks as features, we can tell the insulator elements from randomly selected background sequences with great accuracy. In addition to accurate predictions of the training boundary elements, our classifiers make new predictions. Many of them correspond to the locations of known insulator elements. The key features used for predicting boundary elements do not depend on the prediction method. Because of its miniscule size, chromatin state cannot be measured directly, we need to rely on indirect measurements, such as ChIP-Seq and fill in the gaps with computational models. Our results show that currently, at least in the model organisms, where we have many measurements including ChIP-Seq and Hi-C, we can make accurate predictions of insulator positions.

Journal ArticleDOI
TL;DR: Both simulated data and real experimental data suggest that STS04 provides the highest true positive rate (TPR) or F1 score, while BY01 has the highest positive predictive value (PPV) in network construction, while no significant effect of the network structure is found on FDR methods.
Abstract: Gaussian graphical model (GGM)-based method, a key approach to reverse engineering biological networks, uses partial correlation to measure conditional dependence between two variables by controlling the contribution from other variables. After estimating partial correlation coefficients, one of the most critical processes in network construction is to control the false discovery rate (FDR) to assess the significant associations among variables. Various FDR methods have been proposed mainly for biomarker discovery, but it still remains unclear which FDR method performs better for network construction. Furthermore, there is no study to see the effect of the network structure on network construction. We selected the six FDR methods, the linear step-up procedure (BH95), the adaptive linear step-up procedure (BH00), Efron's local FDR (LFDR), Benjamini–Yekutieli's step-up procedure (BY01), Storey's q-value procedure (Storey01), and Storey–Taylor–Siegmund's adaptive step-up procedure (STS04), to evaluate their performances on network construction. We further considered two network structures, random and scale-free networks, to investigate their influence on network construction. Both simulated data and real experimental data suggest that STS04 provides the highest true positive rate (TPR) or F1 score, while BY01 has the highest positive predictive value (PPV) in network construction. In addition, no significant effect of the network structure is found on FDR methods.

Journal ArticleDOI
TL;DR: Two algorithms, midpoint approximation and interval approximation, are presented for construction of efficient model abstractions with uncertainty in data for computational feasibility by posing queries in computation tree logic (CTL) on a prototype of extracellular-signal-regulated kinase (ERK) pathway.
Abstract: We describe a novel formalism representing a system of chemical reactions, with imprecise rates of reactions and concentrations of chemicals, and describe a model reduction method, pruning, based on the chemical properties. We present two algorithms, midpoint approximation and interval approximation, for construction of efficient model abstractions with uncertainty in data. We evaluate computational feasibility by posing queries in computation tree logic (CTL) on a prototype of extracellular-signal-regulated kinase (ERK) pathway.

Journal ArticleDOI
TL;DR: A new parallel implementation of nucleotide BLAST (MPI-blastn) and a new tool for taxonomic attachment of Basic Local Alignment Search Tool (BLAST) results that supports the NCBI taxonomy (NCBI-TaxCollector) are described.
Abstract: Metagenomic sequencing technologies are advancing rapidly and the size of output data from high-throughput genetic sequencing has increased substantially over the years. This brings us to a scenario where advanced computational optimizations are requested to perform a metagenomic analysis. In this paper, we describe a new parallel implementation of nucleotide BLAST (MPI-blastn) and a new tool for taxonomic attachment of Basic Local Alignment Search Tool (BLAST) results that supports the NCBI taxonomy (NCBI-TaxCollector). MPI-blastn obtained a high performance when compared to the mpiBLAST and ScalaBLAST. In our best case, MPI-blastn was able to run 408 times faster in 384 cores. Our evaluations demonstrated that NCBI-TaxCollector is able to perform taxonomic attachments 125 times faster and needs 120 times less RAM than the previous TaxCollector. Through our optimizations, a multiple sequence search that currently takes 37 hours can be performed in less than 6 min and a post processing with NCBI taxonomic data attachment, which takes 48 hours, now is able to run in 23 min.

Journal ArticleDOI
TL;DR: A new efficient heuristic algorithm for inferring hybridization networks from evolutionary distance matrices between species, using the famous Neighbor-Joining concept and the least-squares criterion for building networks.
Abstract: Several algorithms and software have been developed for inferring phylogenetic trees. However, there exist some biological phenomena such as hybridization, recombination, or horizontal gene transfer which cannot be represented by a tree topology. We need to use phylogenetic networks to adequately represent these important evolutionary mechanisms. In this article, we present a new efficient heuristic algorithm for inferring hybridization networks from evolutionary distance matrices between species. The famous Neighbor-Joining concept and the least-squares criterion are used for building networks. At each step of the algorithm, before joining two given nodes, we check if a hybridization event could be related to one of them or to both of them. The proposed algorithm finds the exact tree solution when the considered distance matrix is a tree metric (i.e. it is representable by a unique phylogenetic tree). It also provides very good hybrids recovery rates for large trees (with 32 and 64 leaves in our simulations) for both distance and sequence types of data. The results yielded by the new algorithm for real and simulated datasets are illustrated and discussed in detail.

Journal ArticleDOI
TL;DR: ODEion is a software module for structural identification of ordinary differential equations that implements computationally efficient algorithms that have been shown to efficiently handle sparse and noisy data and can run a range of realistic problems that previously required a supercomputer.
Abstract: In the systems biology field, algorithms for structural identification of ordinary differential equations (ODEs) have mainly focused on fixed model spaces like S-systems and/or on methods that require sufficiently good data so that derivatives can be accurately estimated. There is therefore a lack of methods and software that can handle more general models and realistic data. We present ODEion, a software module for structural identification of ODEs. Main characteristic features of the software are: • The model space is defined by arbitrary user-defined functions that can be nonlinear in both variables and parameters, such as for example chemical rate reactions. • ODEion implements computationally efficient algorithms that have been shown to efficiently handle sparse and noisy data. It can run a range of realistic problems that previously required a supercomputer. • ODEion is easy to use and provides SBML output. We describe the mathematical problem, the ODEion system itself, and provide several examples of how the system can be used. Available at: http://www.odeidentification.org.

Journal ArticleDOI
TL;DR: A novel ontology, named PIERO, is developed for annotating biochemical transformations, allowing the extraction of common partial reaction characteristics from given sets of orthologous genes and the elucidation of possible enzymes from the given transformations.
Abstract: Genomics is faced with the issue of many partially annotated putative enzyme-encoding genes for which activities have not yet been verified, while metabolomics is faced with the issue of many putative enzyme reactions for which full equations have not been verified. Knowledge of enzymes has been collected by IUBMB, and has been made public as the Enzyme List. To date, however, the terminology of the Enzyme List has not been assessed comprehensively by bioinformatics studies. Instead, most of the bioinformatics studies simply use the identifiers of the enzymes, i.e. the Enzyme Commission (EC) numbers. We investigated the actual usage of terminology throughout the Enzyme List, and demonstrated that the partial characteristics of reactions cannot be retrieved by simply using EC numbers. Thus, we developed a novel ontology, named PIERO, for annotating biochemical transformations as follows. First, the terminology describing enzymatic reactions was retrieved from the Enzyme List, and was grouped into those related to overall reactions and biochemical transformations. Consequently, these terms were mapped onto the actual transformations taken from enzymatic reaction equations. This ontology was linked to Gene Ontology (GO) and EC numbers, allowing the extraction of common partial reaction characteristics from given sets of orthologous genes and the elucidation of possible enzymes from the given transformations. Further future development of the PIERO ontology should enhance the Enzyme List to promote the integration of genomics and metabolomics.

Journal ArticleDOI
TL;DR: Results suggest that, different PSSM-based methods differ in their capability to identify different patterns of functional sites, and better combining PSSMs with the specific conservation patterns of residues would largely facilitate the prediction.
Abstract: Evolutionary conservation information included in position-specific scoring matrix (PSSM) has been widely adopted by sequence-based methods for identifying protein functional sites, because all functional sites, whether in ordered or disordered proteins, are found to be conserved at some extent. However, different functional sites have different conservation patterns, some of them are linear contextual, some of them are mingled with highly variable residues, and some others seem to be conserved independently. Every value in PSSMs is calculated independently of each other, without carrying the contextual information of residues in the sequence. Therefore, adopting the direct output of PSSM for prediction fails to consider the relationship between conservation patterns of residues and the distribution of conservation scores in PSSMs. In order to demonstrate the importance of combining PSSMs with the specific conservation patterns of functional sites for prediction, three different PSSM-based methods for identifying three kinds of functional sites have been analyzed. Results suggest that, different PSSM-based methods differ in their capability to identify different patterns of functional sites, and better combining PSSMs with the specific conservation patterns of residues would largely facilitate the prediction.

Journal ArticleDOI
TL;DR: Analysis of repertoire of kinases in zebrafish proteome to provide insights into various cellular components and high conservation of functionally important residues with a few organism specific variations is revealed.
Abstract: In recent times, zebrafish has garnered lot of popularity as model organism to study human cancers Despite high evolutionary divergence from humans, zebrafish develops almost all types of human tumors when induced However, mechanistic details of tumor formation have remained largely unknown Present study is aimed at analysis of repertoire of kinases in zebrafish proteome to provide insights into various cellular components Annotation using highly sensitive remote homology detection methods revealed "substantial expansion" of Ser/Thr/Tyr kinase family in zebrafish compared to humans, constituting over 3% of proteome Subsequent classification of kinases into subfamilies revealed presence of large number of CAMK group of kinases, with massive representation of PIM kinases, important for cell cycle regulation and growth Extensive sequence comparison between human and zebrafish PIM kinases revealed high conservation of functionally important residues with a few organism specific variations There are about 300 PIM kinases in zebrafish kinome, while human genome codes for only about 500 kinases altogether PIM kinases have been implicated in various human cancers and are currently being targeted to explore their therapeutic potentials Hence, in depth analysis of PIM kinases in zebrafish has opened up new avenues of research to verify the model organism status of zebrafish

Journal ArticleDOI
TL;DR: This work presents a novel computational algorithm to efficiently predict signaling pathways from PPI networks given a starting protein and an ending protein and shows that the approach has higher accuracy and efficiency than previous methods.
Abstract: Reconstruction of signaling pathways is crucial for understanding cellular mechanisms. A pathway is represented as a path of a signaling cascade involving a series of proteins to perform a particular function. Since a protein pair involved in signaling and response have a strong interaction, putative pathways can be detected from protein-protein interaction (PPI) networks. However, predicting directed pathways from the undirected genome-wide PPI networks has been challenging. We present a novel computational algorithm to efficiently predict signaling pathways from PPI networks given a starting protein and an ending protein. Our approach integrates topological analysis of PPI networks and semantic analysis of PPIs using Gene Ontology data. An advanced semantic similarity measure is used for weighting each interacting protein pair. Our distance-wise algorithm iteratively selects an adjacent protein from a PPI network to build a pathway based on a distance condition. On each iteration, the strength of a hypothetical path passing through a candidate edge is estimated by a local heuristic. We evaluate the performance by comparing the resultant paths to known signaling pathways on yeast. The results show that our approach has higher accuracy and efficiency than previous methods.

Journal ArticleDOI
TL;DR: This analysis shows that there is a clear distinction between conserved epitopes and nonconserved epitopes in terms of AAACS, and this method provides an excellent classification performance on an independent dataset.
Abstract: A conserved epitope is an epitope retained by multiple strains of influenza as the key target of a broadly neutralizing antibody. Identification of conserved epitopes is of strong interest to help design broad-spectrum vaccines against influenza. Conservation score measures the evolutionary conservation of an amino acid position in a protein based on the phylogenetic relationships observed amongst homologous sequences. Here, Average Amino Acid Conservation Score (AAACS) is proposed as a method to identify HA's conserved epitopes. Our analysis shows that there is a clear distinction between conserved epitopes and nonconserved epitopes in terms of AAACS. This method also provides an excellent classification performance on an independent dataset. In contrast, alignment-based comparison methods do not work well for this problem, because conserved epitopes to the same broadly neutralizing antibody are usually not identical or similar. Location-based methods are not successful either, because conserved epitopes are located at both the less-conserved globular head (HA1) and the more-conserved stem (HA2). As a case study, two conserved epitopes on HA are predicted for the influenza A virus H7N9: One should match the broadly neutralizing antibodies CR9114 or FI6v3, while the other is new and requires validation by wet-lab experiments.

Journal ArticleDOI
TL;DR: A novel method to identify PPIs through semantic similarity measures among protein mentions based on the page counts retrieved from the MEDLINE database is proposed and the results suggest that the approach could extract novel protein-protein interactions.
Abstract: Protein–protein interactions (PPIs) are involved in the majority of biological processes. Identification of PPIs is therefore one of the key aims of biological research. Although there are many databases of PPIs, many other unidentified PPIs could be buried in the biomedical literature. Therefore, automated identification of PPIs from biomedical literature repositories could be used to discover otherwise hidden interactions. Search engines, such as Google, have been successfully applied to measure the relatedness among words. Inspired by such approaches, we propose a novel method to identify PPIs through semantic similarity measures among protein mentions. We define six semantic similarity measures as features based on the page counts retrieved from the MEDLINE database. A machine learning classifier, Random Forest, is trained using the above features. The proposed approach achieve an averaged micro-F of 71.28% and an averaged macro-F of 64.03% over five PPI corpora, an improvement over the results of using only the conventional co-occurrence feature (averaged micro-F of 68.79% and an averaged macro-F of 60.49%). A relation-word reinforcement further improves the averaged micro-F to 71.3% and averaged macro-F to 65.12%. Comparing the results of the current work with other studies on the AIMed corpus (ranging from 77.58% to 85.1% in micro-F, 62.18% to 76.27% in macro-F), we show that the proposed approach achieves micro-F of 81.88% and macro-F of 64.01% without the use of sophisticated feature extraction. Finally, we manually examine the newly discovered PPI pairs based on a literature review, and the results suggest that our approach could extract novel protein–protein interactions.

Journal ArticleDOI
TL;DR: It is found that the cis-regulatory region of the hunchback gene tends to readily evolve modularity, and the CRM-domain correspondence seen in Drosophila evolves with a high probability in the model, supporting the biological relevance of the approach.
Abstract: Biological development depends on the coordinated expression of genes in time and space. Developmental genes have extensive cis-regulatory regions which control their expression. These regions are organized in a modular manner, with different modules controlling expression at different times and locations. Both how modularity evolved and what function it serves are open questions. We present a computational model for the cis-regulation of the hunchback (hb) gene in the fruit fly (Drosophila). We simulate evolution (using an evolutionary computation approach from computer science) to find the optimal cis-regulatory arrangements for fitting experimental hb expression patterns. We find that the cis-regulatory region tends to readily evolve modularity. These cis-regulatory modules (CRMs) do not tend to control single spatial domains, but show a multi-CRM/multi-domain correspondence. We find that the CRM-domain correspondence seen in Drosophila evolves with a high probability in our model, supporting the biological relevance of the approach. The partial redundancy resulting from multi-CRM control may confer some biological robustness against corruption of regulatory sequences. The technique developed on hb could readily be applied to other multi-CRM developmental genes.

Journal ArticleDOI
TL;DR: An analysis of bacterial cell-cycle models implementing different strategies to coordinately regulate genome replication and cell growth dynamics shows that the problem of coupling these processes does not depend directly on the dynamics of cell volume expansion, but does depend on the type of cell growth law.
Abstract: In this paper, we perform an analysis of bacterial cell-cycle models implementing different strategies to coordinately regulate genome replication and cell growth dynamics. It has been shown that the problem of coupling these processes does not depend directly on the dynamics of cell volume expansion, but does depend on the type of cell growth law. Our analysis has distinguished two types of cell growth laws, "exponential" and "linear", each of which may include both exponential and linear patterns of cell growth. If a cell grows following a law of the "exponential" type, including the exponential V(t) = V(0) exp (kt) and linear V(t) = V(0)(1 + kt) dynamic patterns, then the cell encounters the problem of coupling growth rates and replication. It has been demonstrated that to solve the problem, it is sufficient for a cell to have a repressor mechanism to regulate DNA replication initiation. For a cell expanding its volume by a law of the "linear" type, including exponential V(t) = V(0) + V(1) exp (kt) and linear V(t) = V(0) + kt dynamic patterns, the problem of coupling growth rates and replication does not exist. In other words, in the context of the coupling problem, a repressor mechanism to regulate DNA replication, and cell growth laws of the "linear" type displays the attributes of universality. The repressor-type mechanism allows a cell to follow any growth dynamic pattern, while the "linear" type growth law allows a cell to use any mechanism to regulate DNA replication.

Journal ArticleDOI
TL;DR: The proposed weight coefficient method aims to learn a weight coefficient for each node in the network from the quantitative measure such as gene expression data, and compares the predictive performance of each network marker group across gene expression datasets.
Abstract: Network is a powerful structure which reveals valuable characteristics of the underlying data However, previous work on evaluating the predictive performance of network-based biomarkers does not take nodal connectedness into account We argue that it is necessary to maximize the benefit from the network structure by employing appropriate techniques To address this, we aim to learn a weight coefficient for each node in the network from the quantitative measure such as gene expression data The weight coefficients are computed from an optimization problem which minimizes the total weighted difference between nodes in a network structure; this can be expressed in terms of graph Laplacian After obtaining the coefficient vector for the network markers, we can then compute the corresponding network predictor We demonstrate the effectiveness of the proposed method by conducting experiments using published breast cancer biomarkers with three patient cohorts Network markers are first grouped based on GO terms related to cancer hallmarks We compare the predictive performance of each network marker group across gene expression datasets We also evaluate the network predictor against the average method for feature aggregation The reported results show that the predictive performance of network markers is generally not consistent across patient cohorts

Journal ArticleDOI
TL;DR: A novel method, called Repeated Simulated Annealing of Partitions of Proteins (ReSAPP), which predicts protein complexes from weighted PPIs by repeatedly applying a simulated annealing based optimization algorithm to the PPIs.
Abstract: Many proteins are known to perform their own functions when they form particular groups of proteins, called protein complexes. With the advent of large-scale protein–protein interaction (PPI) studies, it has been a challenging problem in systems biology to predict protein complexes from PPIs. In this paper, we propose a novel method, called Repeated Simulated Annealing of Partitions of Proteins (ReSAPP), which predicts protein complexes from weighted PPIs. ReSAPP, in the first stage, generates multiple (possibly different) partitions of all proteins of given PPIs by repeatedly applying a simulated annealing based optimization algorithm to the PPIs. In the second stage, all different clusters of size two or more in those multiple partitions are merged into a collection of those clusters, which are outputted as predicted protein complexes. In performance comparison of ReSAPP with our previous algorithm, PPSampler2, as well as other various tools, MCL, MCODE, DPClus, CMC, COACH, RRW, NWE, and PPSampler1, ReS...

Journal ArticleDOI
TL;DR: The experimental results showed that the reliability scores assigned by the data fusion method can effectively classify highly reliable PPIs from multiple information sources, with substantial improvement in scoring over conventional approach such as the Adjust CD-Distance approach.
Abstract: Protein–protein interactions (PPIs) are important for understanding the cellular mechanisms of biological functions, but the reliability of PPIs extracted by high-throughput assays is known to be low. To address this, many current methods use multiple evidence from different sources of information to compute reliability scores for such PPIs. However, they often combine the evidence without taking into account the uncertainty of the evidence values, potential dependencies between the information sources used and missing values from some information sources. We propose to formulate the task of scoring PPIs using multiple information sources as a multi-criteria decision making problem that can be solved using data fusion to model potential interactions between the multiple information sources. Using data fusion, the amount of contribution from each information source can be proportioned accordingly to systematically score the reliability of PPIs. Our experimental results showed that the reliability scores as...

Journal ArticleDOI
TL;DR: This work tested an optimization strategy on a Markov chain and a recently introduced Hidden Markov Model (HMM) with reduced state-space topology that demonstrated that the fold classification accuracy of the optimized HMM was substantially higher compared to that of theMarkov chain or the reduced state -space HMM approaches.
Abstract: Protein fold classification is a challenging task strongly associated with the determination of proteins' structure. In this work, we tested an optimization strategy on a Markov chain and a recently introduced Hidden Markov Model (HMM) with reduced state-space topology. The proteins with unknown structure were scored against both these models. Then the derived scores were optimized following a local optimization method. The Protein Data Bank (PDB) and the annotation of the Structural Classification of Proteins (SCOP) database were used for the evaluation of the proposed methodology. The results demonstrated that the fold classification accuracy of the optimized HMM was substantially higher compared to that of the Markov chain or the reduced state-space HMM approaches. The proposed methodology achieved an accuracy of 41.4% on fold classification, while Sequence Alignment and Modeling (SAM), which was used for comparison, reached an accuracy of 38%.