scispace - formally typeset
Search or ask a question

Showing papers in "Journal of Bioinformatics and Computational Biology in 2008"


Journal ArticleDOI
TL;DR: A method in which all direct and indirect interactions are first weighted using topological weight (FS-Weight), which estimates the strength of functional association and can be used to improve the precision of clusters predicted by various existing clustering algorithms.
Abstract: Protein complexes are fundamental for understanding principles of cellular organizations. As the sizes of protein–protein interaction (PPI) networks are increasing, accurate and fast protein complex prediction from these PPI networks can serve as a guide for biological experiments to discover novel protein complexes. However, it is not easy to predict protein complexes from PPI networks, especially in situations where the PPI network is noisy and still incomplete. Here, we study the use of indirect interactions between level-2 neighbors (level-2 interactions) for protein complex prediction. We know from previous work that proteins which do not interact but share interaction partners (level-2 neighbors) often share biological functions. We have proposed a method in which all direct and indirect interactions are first weighted using topological weight (FS-Weight), which estimates the strength of functional association. Interactions with low weight are removed from the network, while level-2 interactions with high weight are introduced into the interaction network. Existing clustering algorithms can then be applied to this modified network. We have also proposed a novel algorithm that searches for cliques in the modified network, and merge cliques to form clusters using a “partial clique merging” method. Experiments show that (1) the use of indirect interactions and topological weight to augment protein–protein interactions can be used to improve the precision of clusters predicted by various existing clustering algorithms; and (2) our complex-finding algorithm performs very well on interaction networks modified in this way. Since no other information except the original PPI network is used, our approach would be very useful for protein complex prediction, especially for prediction of novel protein complexes.

156 citations


Journal ArticleDOI
TL;DR: New algorithms for matching two polygonal chains in two dimensions to minimize their discrete Fréchet distance under translation and rotation, and an effective heuristic for matching three-dimensional chains in three dimensions are presented.
Abstract: Matching two geometric objects in two-dimensional (2D) and three-dimensional (3D) spaces is a central problem in computer vision, pattern recognition, and protein structure prediction. In particular, the problem of aligning two polygonal chains under translation and rotation to minimize their distance has been studied using various distance measures. It is well known that the Hausdorff distance is useful for matching two point sets, and that the Frechet distance is a superior measure for matching two polygonal chains. The discrete Frechet distance closely approximates the (continuous) Frechet distance, and is a natural measure for the geometric similarity of the folded 3D structures of biomolecules such as proteins. In this paper, we present new algorithms for matching two polygonal chains in two dimensions to minimize their discrete Frechet distance under translation and rotation, and an effective heuristic for matching two polygonal chains in three dimensions. We also describe our empirical results on the application of the discrete Frechet distance to protein structure-structure alignment.

90 citations


Journal ArticleDOI
TL;DR: The proposed coupling scheme is a compromise between learning networks from the different subsets separately, whereby no information between the different experiments is shared and does not provide any mechanism for uncovering differences between the network structures associated with the different experimental conditions.
Abstract: There have been various attempts to improve the reconstruction of gene regulatory networks from microarray data by the systematic integration of biological prior knowledge. Our approach is based on pioneering work by Imoto et al.11 where the prior knowledge is expressed in terms of energy functions, from which a prior distribution over network structures is obtained in the form of a Gibbs distribution. The hyperparameters of this distribution represent the weights associated with the prior knowledge relative to the data. We have derived and tested a Markov chain Monte Carlo (MCMC) scheme for sampling networks and hyperparameters simultaneously from the posterior distribution, thereby automatically learning how to trade off information from the prior knowledge and the data. We have extended this approach to a Bayesian coupling scheme for learning gene regulatory networks from a combination of related data sets, which were obtained under different experimental conditions and are therefore potentially associated with different active subpathways. The proposed coupling scheme is a compromise between (1) learning networks from the different subsets separately, whereby no information between the different experiments is shared; and (2) learning networks from a monolithic fusion of the individual data sets, which does not provide any mechanism for uncovering differences between the network structures associated with the different experimental conditions. We have assessed the viability of all proposed methods on data related to the Raf signaling pathway, generated both synthetically and in cytometry experiments.

67 citations


Journal ArticleDOI
TL;DR: CLePAPS distinguishes itself from other existing algorithms by the use of conformational letters, which are discretized states of 3D segmental structural states, which can be used to superimpose the structure pairs under comparison.
Abstract: Fast, efficient, and reliable algorithms for pairwise alignment of protein structures are in ever-increasing demand for analyzing the rapidly growing data on protein structures. CLePAPS is a tool developed for this purpose. It distinguishes itself from other existing algorithms by the use of conformational letters, which are discretized states of 3D segmental structural states. A letter corresponds to a cluster of combinations of the three angles formed by Cα pseudobonds of four contiguous residues. A substitution matrix called CLESUM is available to measure the similarity between any two such letters. CLePAPS regards an aligned fragment pair (AFP) as an ungapped string pair with a high sum of pairwise CLESUM scores. Using CLESUM scores as the similarity measure, CLePAPS searches for AFPs by simple string comparison. The transformation which best superimposes a highly similar AFP can be used to superimpose the structure pairs under comparison. A highly scored AFP which is consistent with several other AFPs determines an initial alignment. CLePAPS then joins consistent AFPs guided by their similarity scores to extend the alignment by several "zoom-in" iteration steps. A follow-up refinement produces the final alignment. CLePAPS does not implement dynamic programming. The utility of CLePAPS is tested on various protein structure pairs.

37 citations


Journal ArticleDOI
TL;DR: In this article, the authors propose a network inference methodology based on the directed information (DTI) criterion that incorporates the biology of transcription within the framework so as to enable experimentally verifiable inference.
Abstract: The systematic inference of biologically relevant influence networks remains a challenging problem in computational biology. Even though the availability of high-throughput data has enabled the use of probabilistic models to infer the plausible structure of such networks, their true interpretation of the biology of the process is questionable. In this work, we propose a network inference methodology, based on the directed information (DTI) criterion, that incorporates the biology of transcription within the framework so as to enable experimentally verifiable inference. We use publicly available embryonic kidney and T-cell microarray datasets to demonstrate our results. We present two variants of network inference via DTI — supervised and unsupervised — and the inferred networks relevant to mammalian nephrogenesis and T-cell activation. Conformity of the obtained interactions with the literature as well as comparison with the coefficient of determination (CoD) method are demonstrated. Apart from network inference, the proposed framework enables the exploration of specific interactions, not just those revealed by data. To illustrate the latter point, a DTI-based framework to resolve interactions between transcription factor modules and target coregulated genes is proposed. Additionally, we show that DTI can be used in conjunction with mutual information to infer higher-order influence networks involving cooperative gene interactions.

37 citations


Journal ArticleDOI
TL;DR: A hidden Markov model (HMM) approach for predicting the LPXTG-anchored cell wall proteins of Gram-positive bacteria was developed and compared against existing methods, finding a number that is significantly higher compared to those obtained by other available methods.
Abstract: Surface proteins in Gram-positive bacteria are frequently implicated in virulence. We have focused on a group of extracellular cell wall-attached proteins (CWPs), containing an LPXTG motif for clea...

36 citations


Journal ArticleDOI
TL;DR: The NVAR model is applied to estimate nonlinear gene regulatory networks based entirely on gene expression profiles obtained from DNA microarray experiments and the results obtained are shown.
Abstract: In cells, molecular networks such as gene regulatory networks are the basis of biological complexity. Therefore, gene regulatory networks have become the core of research in systems biology. Understanding the processes underlying the several extracellular regulators, signal transduction, protein-protein interactions, and differential gene expression processes requires detailed molecular description of the protein and gene networks involved. To understand better these complex molecular networks and to infer new regulatory associations, we propose a statistical method based on vector autoregressive models and Granger causality to estimate nonlinear gene regulatory networks from time series microarray data. Most of the models available in the literature assume linearity in the inference of gene connections; moreover, these models do not infer directionality in these connections. Thus, a priori biological knowledge is required. However, in pathological cases, no a priori biological information is available. To overcome these problems, we present the nonlinear vector autoregressive (NVAR) model. We have applied the NVAR model to estimate nonlinear gene regulatory networks based entirely on gene expression profiles obtained from DNA microarray experiments. We show the results obtained by NVAR through several simulations and by the construction of three actual gene regulatory networks (p53, NF-kappaB, and c-Myc) for HeLa cells.

30 citations


Journal ArticleDOI
TL;DR: The inference of the evolutionary tree of class II aminoacyl-tRNA synthetase shows the potential for TALI in estimating protein structural evolution and in identifying structural divergence among homologous structures.
Abstract: Torsion angle alignment (TALI) is a novel approach to local structural motif alignment, based on backbone torsion angles (ϕ, ψ) rather than the more traditional atomic distance matrices. Representa...

29 citations


Journal ArticleDOI
TL;DR: It is shown in this paper that glycan de novo sequencing is NP-hard, and a heuristic algorithm is provided and a software program is developed to solve the problem in practical cases.
Abstract: Determining glycan structures is vital to comprehend cell-matrix, cell-cell, and even intracellular biological events. Glycan sequencing, which determines the primary structure of a glycan using tandem mass spectrometry (MS/MS), remains one of the most important tasks in proteomics. Analogous to peptide de novo sequencing, glycan de novo sequencing determines the structure without the aid of a known glycan database. We show in this paper that glycan de novo sequencing is NP-hard. We then provide a heuristic algorithm and develop a software program to solve the problem in practical cases. Experiments on real MS/MS data of glycopeptides demonstrate that our heuristic algorithm gives satisfactory results on practical data.

28 citations


Journal ArticleDOI
TL;DR: A novel statistical method for inferring the relative abundance of related members of protein families from tryptic peptide intensities is implemented, and this pipeline has been used to analyze quantitative LC-MS data from multiple biomarker discovery projects.
Abstract: Liquid chromatography-mass spectrometry (LC-MS)-based proteomics is becoming an increasingly important tool in characterizing the abundance of proteins in biological samples of various types and across conditions. Effects of disease or drug treatments on protein abundance are of particular interest for the characterization of biological processes and the identification of biomarkers. Although state-of-the-art instrumentation is available to make high-quality measurements and commercially available software is available to process the data, the complexity of the technology and data presents challenges for bioinformaticians and statisticians. Here, we describe a pipeline for the analysis of quantitative LC-MS data. Key components of this pipeline include experimental design (sample pooling, blocking, and randomization) as well as deconvolution and alignment of mass chromatograms to generate a matrix of molecular abundance profiles. An important challenge in LC-MS-based quantitation is to be able to accurately identify and assign abundance measurements to members of protein families. To address this issue, we implement a novel statistical method for inferring the relative abundance of related members of protein families from tryptic peptide intensities. This pipeline has been used to analyze quantitative LC-MS data from multiple biomarker discovery projects. We illustrate our pipeline here with examples from two of these studies, and show that the pipeline constitutes a complete workable framework for LC-MS-based differential quantitation. Supplementary material is available at http://iec01.mie.utoronto.ca/~thodoros/Bukhman/.

25 citations


Journal ArticleDOI
TL;DR: It is demonstrated here that protein compactness, which is defined as the ratio of the accessible surface area of a protein to that of the ideal sphere of the same volume, is one of the factors determining the mechanism of protein folding.
Abstract: We have demonstrated here that protein compactness, which we define as the ratio of the accessible surface area of a protein to that of the ideal sphere of the same volume, is one of the factors determining the mechanism of protein folding. Proteins with multi-state kinetics, on average, are more compact (compactness is 1.49+/-0.02 for proteins within the size range of 101-151 amino acid residues) than proteins with two-state kinetics (compactness is 1.59+/-0.03 for proteins within the same size range of 101-151 amino acid residues). We have shown that compactness for homologous proteins can explain both the difference in folding rates and the difference in folding mechanisms.

Journal ArticleDOI
TL;DR: The feasibility of using formal concept analysis (FCA) as a tool for microarray data analysis is investigated and the preliminary results show the promise of the method as a tools for micro array data analysis.
Abstract: Microarray technologies, which can measure tens of thousands of gene expression values simultaneously in a single experiment, have become a common research method for biomedical researchers. Computational tools to analyze microarray data for biological discovery are needed. In this paper, we investigate the feasibility of using formal concept analysis (FCA) as a tool for microarray data analysis. The method of FCA builds a (concept) lattice from the experimental data together with additional biological information. For microarray data, each vertex of the lattice corresponds to a subset of genes that are grouped together according to their expression values and some biological information related to gene function. The lattice structure of these gene sets might reflect biological relationships in the dataset. Similarities and differences between experiments can then be investigated by comparing their corresponding lattices according to various graph measures. We apply our method to microarray data derived from influenza-infected mouse lung tissue and healthy controls. Our preliminary results show the promise of our method as a tool for microarray data analysis.

Journal ArticleDOI
TL;DR: This work developed a method to predict the topology and sequence alignment for the skeleton helices of protein complexes using the Rosetta ab initio structure prediction method, and analyzed the use of the skeletons as a clustering tool for the decoy structures generated by Rosetta.
Abstract: Cryoelectron microscopy (cryoEM) is an experimental technique to determine the three-dimensional (3D) structure of large protein complexes. Currently, this technique is able to generate protein den...

Journal ArticleDOI
TL;DR: This paper reviews statistical and combinatorial haplotyping algorithms using pedigree data, unrelated individuals, or pooled samples for tightly linked markers such as SNPs.
Abstract: Two grand challenges in the postgenomic era are to develop a detailed understanding of heritable variation in the human genome, and to develop robust strategies for identifying the genetic contribution to diseases and drug responses. Haplotypes of single nucleotide polymorphisms (SNPs) have been suggested as an effective representation of human variation, and various haplotype-based association mapping methods for complex traits have been proposed in the literature. However, humans are diploid and, in practice, genotype data instead of haplotype data are collected directly. Therefore, efficient and accurate computational methods for haplotype reconstruction are needed and have recently been investigated intensively, especially for tightly linked markers such as SNPs. This paper reviews statistical and combinatorial haplotyping algorithms using pedigree data, unrelated individuals, or pooled samples.

Journal ArticleDOI
TL;DR: It is shown that given a three-dimensional fold of a protein chain, the closest lattice approximation of this fold is found, which is NP-complete for the cubic lattice with side close to 3.8 A and coordinate root mean square deviation.
Abstract: It is known that folding a protein chain into a cubic lattice is an NP-complete problem. We consider a seemingly easier problem: given a three-dimensional (3D) fold of a protein chain (coordinates of its Cα atoms), we want to find the closest lattice approximation of this fold. This problem has been studied under names such as "lattice approximation of a protein chain", "the protein chain fitting problem", and "building of protein lattice models". We show that this problem is NP-complete for the cubic lattice with side close to 3.8 A and coordinate root mean square deviation.

Journal ArticleDOI
TL;DR: A novel integrative domain-based method for predicting PPIs using inductive logic programming (ILP), which predicts PPIs better than other computational methods in terms of typical performance measures and can be applied to predict DDIs with high sensitivity and specificity.
Abstract: Protein-protein interactions (PPIs) are intrinsic to almost all cellular processes. Different computational methods offer new chances to study PPIs. To predict PPIs, while the integrative methods use multiple data sources instead of a single source, the domain-based methods often use only protein domain features. Integration of both protein domain features and genomic/proteomic features from multiple databases can more effectively predict PPIs. Moreover, it allows discovering the reciprocal relationships between PPIs and biological features of their interacting partners. We developed a novel integrative domain-based method for predicting PPIs using inductive logic programming (ILP). Two principal domain features used were domain fusions and domain-domain interactions (DDIs). Various relevant features of proteins were exploited from five popular genomic and proteomic databases. By integrating these features, we constructed biologically significant ILP background knowledge of more than 278,000 ground facts. The experimental results through multiple 10-fold cross-validations demonstrated that our method predicts PPIs better than other computational methods in terms of typical performance measures. The proposed ILP framework can be applied to predict DDIs with high sensitivity and specificity. The induced ILP rules gave us many interesting, biologically reciprocal relationships among PPIs, protein domains, and PPI-related genomic/proteomic features. Supplementary material is available at (http://www.jaist.ac.jp/~s0560205/PPIandDDI/).

Journal ArticleDOI
TL;DR: A complete catalytic cycle has been reconstructed based on available information on the oligomeric structure of the enzyme and kinetic mechanism of its monomer and the model developed can be used in the kinetic modeling of biochemical pathways containing phosphofructokinase-1.
Abstract: This paper presents a kinetic model of phosphofructokinase-1 from Escherichia coli. A complete catalytic cycle has been reconstructed based on available information on the oligomeric structure of the enzyme and kinetic mechanism of its monomer. Applying the generalization of the Monod–Wyman–Changeux approach proposed by Popova and Sel'kov35–37 to the reconstructed catalytic cycle rate equation has been derived. Dependence of the reaction rate on pH, magnesium, and effectors has been taken into account. Kinetic parameters have been estimated via fitting the rate equation against experimentally measured dependencies of initial rate on substrates, products, effectors, and pH available from the literature. The model of phosphofructokinase-1 predicts (1) cooperativity of binding both fructose-6-phosphate and ATPMg2-, (2) significant inhibition of the enzyme resulting from an increase in total concentration of ATP under the condition of fixed concentration of Mg2+ ions, and (3) dual effect of ADP consisting of ...

Journal ArticleDOI
TL;DR: This paper extends MSOAR to multiple (closely related) genomes and proposes an ortholog clustering method, called MultiMSOAR, to infer main orthologs in multiple genomes, which gives more detailed and accurate orthology information.
Abstract: The identification of orthologous genes shared by multiple genomes is critical for both functional and evolutionary studies in comparative genomics. While it is usually done by sequence similarity search and reconciled tree construction in practice, recently a new combinatorial approach and high-throughput system MSOAR for ortholog identification between closely related genomes based on genome rearrangement and gene duplication has been proposed in Fu et al.1 MSOAR assumes that orthologous genes correspond to each other in the most parsimonious evolutionary scenario, minimizing the number of genome rearrangement and (postspeciation) gene duplication events. However, the parsimony approach used by MSOAR limits it to pairwise genome comparisons. In this paper, we extend MSOAR to multiple (closely related) genomes and propose an ortholog clustering method, called MultiMSOAR, to infer main orthologs in multiple genomes. As a preliminary experiment, we apply MultiMSOAR to rat, mouse, and human genomes, and val...

Journal ArticleDOI
TL;DR: A computational survey of structured non-coding RNAs in teleost genomes focuses on the fate of fish-specific duplicates, finding evidence of a large number of structured RNAs, most of which are clade-specific or evolve so fast that their tetrapod homologs cannot be detected.
Abstract: Teleost fishes share a duplication of their entire genomes. We report here on a computational survey of structured non-coding RNAs (ncRNAs) in teleost genomes, focusing on the fate of fish-specific duplicates. As in other metazoan groups, we find evidence of a large number (11,543) of structured RNAs, most of which (~86%) are clade-specific or evolve so fast that their tetrapod homologs cannot be detected. In surprising contrast to protein-coding genes, the fish-specific genome duplication did not lead to a large number of paralogous ncRNAs: only 188 candidates, mostly microRNAs, appear in a larger copy number in teleosts than in tetrapods, suggesting that large-scale gene duplications do not play a major role in the expansion of the vertebrate ncRNA inventory.

Journal ArticleDOI
TL;DR: The results suggest that transport mechanisms in this transporter family should probably not be assumed to be conserved simply based on standard structural homology considerations, and raise the possibility that, while the "rocker switch" may apply to certain MFS transporters, intermediate "tilted" states may exist under certain circumstances or as transitional structures.
Abstract: Many major facilitator superfamily (MFS) transporters have similar 12-transmembrane alpha-helical topologies with two six-helix halves connected by a long loop. In humans, these transporters participate in key physiological processes and are also, as in the case of members of the organic anion transporter (OAT) family, of pharmaceutical interest. Recently, crystal structures of two bacterial representatives of the MFS family--the glycerol-3-phosphate transporter (GlpT) and lac-permease (LacY)--have been solved and, because of assumptions regarding the high structural conservation of this family, there is hope that the results can be applied to mammalian transporters as well. Based on crystallography, it has been suggested that a major conformational "switching" mechanism accounts for ligand transport by MFS proteins. This conformational switch would then allow periodic changes in the overall transporter configuration, resulting in its cyclic opening to the periplasm or cytoplasm. Following this lead, we have modeled a possible "switch" mechanism in GlpT, using the concept of rotation of protein domains as in the DynDom program17 and membranephilic constraints predicted by the MAPAS program.(23) We found that the minima of energies of intersubunit interactions support two alternate positions consistent with their transport properties. Thus, for GlpT, a "tilt" of 9 degrees -10 degrees rotation had the most favorable energetics of electrostatic interaction between the two halves of the transporter; moreover, this confirmation was sufficient to suggest transport of the ligand across the membrane. We conducted steered molecular dynamics simulations of the GlpT-ligand system to explore how glycerol-3-phosphate would be handled by the "tilted" structure, and obtained results generally consistent with experimental mutagenesis data. While biochemical data remain most consistent with a single-site alternating access model, our results raise the possibility that, while the "rocker switch" may apply to certain MFS transporters, intermediate "tilted" states may exist under certain circumstances or as transitional structures. Although wet lab experimental confirmation is required, our results suggest that transport mechanisms in this transporter family should probably not be assumed to be conserved simply based on standard structural homology considerations. Furthermore, steered molecular dynamics elucidating energetic interactions of ligands with amino acid residues in an appropriately modeled transporter may have predictive value in understanding the impact of mutations and/or polymorphisms on transporter function.

Journal ArticleDOI
TL;DR: It is demonstrated that an improved model based on an optimized set of features reduces the number of false positives by 58% relative to the model which used only search engine scores, at the same sensitivity score of 0.8.
Abstract: Tandem mass spectrometry (MS/MS) combined with protein database searching has been widely used in protein identification. A validation procedure is generally required to reduce the number of false positives. Advanced tools using statistical and machine learning approaches may provide faster and more accurate validation than manual inspection and empirical filtering criteria. In this study, we use two feature selection algorithms based on random forest and support vector machine to identify peptide properties that can be used to improve validation models. We demonstrate that an improved model based on an optimized set of features reduces the number of false positives by 58% relative to the model which used only search engine scores, at the same sensitivity score of 0.8. In addition, we develop classification models based on the physicochemical properties and protein sequence environment of these peptides without using search engine scores. The performance of the best model based on the support vector machine algorithm is at 0.8 AUC, 0.78 accuracy, and 0.7 specificity, suggesting a reasonably accurate classification. The identified properties important to fragmentation and ionization can be either used in independent validation tools or incorporated into peptide sequencing and database search algorithms to improve existing software programs.

Journal ArticleDOI
TL;DR: PPiClust is presented, to systematically encode, cluster, and analyze similar 3D interface patterns in protein complexes efficiently, and is effective in discovering visually consistent and statistically significant clusters of interfaces, and sufficiently time-efficient to be performed on a single computer.
Abstract: The biological mechanisms through which proteins interact with one another are best revealed by studying the structural interfaces between interacting proteins. Protein–protein interfaces can be extracted from three-dimensional (3D) structural data of protein complexes and then clustered to derive biological insights. However, conventional protein interface clustering methods lack computational scalability and statistical support. In this work, we present a new method named "PPiClust" to systematically encode, cluster, and analyze similar 3D interface patterns in protein complexes efficiently. Experimental results showed that our method is effective in discovering visually consistent and statistically significant clusters of interfaces, and at the same time sufficiently time-efficient to be performed on a single computer. The interface clusters are also useful for uncovering the structural basis of protein interactions. Analysis of the resulting interface clusters revealed groups of structurally diverse proteins having similar interface patterns. We also found, in some of the interface clusters, the presence of well-known linear binding motifs which were noncontiguous in the primary sequences. These results suggest that PPiClust can discover not only statistically significant, but also biologically significant, protein interface clusters from protein complex structural data.

Journal ArticleDOI
TL;DR: This paper presents a reassortment identification method based on distance measurement using complete composition vector (CCV) and segment clustering using a minimum spanning tree (MST) algorithm that identified 34 potential reassortment clusters among 2,641 PB2 segments of influenza A viruses.
Abstract: The influenza A virus is a negative-stranded RNA virus composed of eight segmented RNA molecules, including polymerases (PB2, PB1, PA), hemagglutinin (HA), nucleoprotein (NP), neuraminidase (NA), matrix protein (MP), and nonstructure gene (NS). The influenza A viruses are notorious for rapid mutations, frequent reassortments, and possible recombinations. Among these evolutionary events, reassortments refer to exchanges of discrete RNA segments between co-infected influenza viruses, and they have facilitated the generation of pandemic and epidemic strains. Thus, identification of reassortments will be critical for pandemic and epidemic prevention and control. This paper presents a reassortment identification method based on distance measurement using complete composition vector (CCV) and segment clustering using a minimum spanning tree (MST) algorithm. By applying this method, we identified 34 potential reassortment clusters among 2,641 PB2 segments of influenza A viruses. Among the 83 serotypes tested, at least 56 (67.46%) exchanged their fragments with another serotype of influenza A viruses. These identified reassortments involve 1,957 H2N1 and 1,968 H3N2 influenza pandemic strains as well as H5N1 avian influenza virus isolates, which have generated the potential for a future pandemic threat. More frequent reassortments were found to occur in wild birds, especially migratory birds. This MST clustering program is written in Java and will be available upon request.

Journal ArticleDOI
TL;DR: The ultimate goal is to construct an affinity database that will provide crucial information obtained using the affinity evaluation and prediction system to cell biologists and drug designers.
Abstract: A system was developed to evaluate and predict the interaction between protein pairs by using the widely used shape complementarity search method as the algorithm for docking simulations between the proteins. This system, which we call the affinity evaluation and prediction (AEP) system, was used to evaluate the interaction between 20 protein pairs. The system first executes a "round robin" shape complementarity search of the target protein group, and evaluates the interaction of the complex structures obtained by shape complementarity search. These complex structures are selected by using a statistical procedure that we developed called "grouping". At a low prevalence of 5.0%, our AEP system predicted protein–protein interaction with 65.0% recall, 15.1% precision, 80.0% accuracy, and had an area under the curve (AUC) of 0.74. By optimizing the grouping process, our AEP system successfully predicted 13 protein pairs (among 20 pairs) that were biologically significant combinations. Our ultimate goal is to construct an affinity database that will provide crucial information obtained using our AEP system to cell biologists and drug designers.

Journal ArticleDOI
TL;DR: A new algorithm, called ClusFCM, is introduced, which combines techniques of clustering and fuzzy cognitive maps (FCM) for prediction of protein functions, and predicts protein functions with high recall while not lowering precision.
Abstract: We introduce a new algorithm, called ClusFCM, which combines techniques of clustering and fuzzy cognitive maps (FCM) for prediction of protein functions. ClusFCM takes advantage of protein homologies and protein interaction network topology to improve low recall predictions associated with existing prediction methods. ClusFCM exploits the fact that proteins of known function tend to cluster together and deduce functions not only through their direct interaction with other proteins, but also from other proteins in the network. We use ClusFCM to annotate protein functions for Saccharomyces cerevisiae (yeast), Caenorhabditis elegans (worm), and Drosophila melanogaster (fly) using protein–protein interaction data from the General Repository for Interaction Datasets (GRID) database and functional labels from Gene Ontology (GO) terms. The algorithm's performance is compared with four state-of-the-art methods for function prediction — Majority, χ2 statistics, Markov random field (MRF), and FunctionalFlow — using measures of Matthews correlation coefficient, harmonic mean, and area under the receiver operating characteristic (ROC) curves. The results indicate that ClusFCM predicts protein functions with high recall while not lowering precision. Supplementary information is available at .

Journal ArticleDOI
TL;DR: The proposed modification of the empirical Bayes method leads to significant improvements in its performance and the new paradigm arising from the existence of the delta-sequence in biological data offers considerable scope for future developments.
Abstract: The currently practiced methods of significance testing in microarray gene expression profiling are highly unstable and tend to be very low in power. These undesirable properties are due to the nature of multiple testing procedures, as well as extremely strong and long-ranged correlations between gene expression levels. In an earlier publication, we identified a special structure in gene expression data that produces a sequence of weakly dependent random variables. This structure, termed the δ-sequence, lies at the heart of a new methodology for selecting differentially expressed genes in nonoverlapping gene pairs. The proposed method has two distinct advantages: (1) it leads to dramatic gains in terms of the mean numbers of true and false discoveries, and in the stability of the results of testing; and (2) its outcomes are entirely free from the log-additive array-specific technical noise. We demonstrate the usefulness of this approach in conjunction with the nonparametric empirical Bayes method. The proposed modification of the empirical Bayes method leads to significant improvements in its performance. The new paradigm arising from the existence of the δ-sequence in biological data offers considerable scope for future developments in this area of methodological research.

Journal ArticleDOI
TL;DR: In this article, a general preprocessing scheme for peptide sequencing is proposed, which performs binning, pseudo-peak introduction, and noise removal, and present theoretical and experimental analyses on each of the components.
Abstract: Peptide sequencing plays a fundamental role in proteomics. Tandem mass spectrometry, being sensitive and efficient, is one of the most commonly used techniques in peptide sequencing. Many computational models and algorithms have been developed for peptide sequencing using tandem mass spectrometry. In this paper, we investigate general issues in de novo sequencing, and present results that can be used to improve current de novo sequencing algorithms. We propose a general preprocessing scheme that performs binning, pseudo-peak introduction, and noise removal, and present theoretical and experimental analyses on each of the components. Then, we study the antisymmetry problem and current assumptions related to it, and propose a more realistic way to handle the antisymmetry problem based on analysis of some datasets. We integrate our findings on preprocessing and the antisymmetry problem with some current models for peptide sequencing. Experimental results show that our findings help to improve accuracies for de novo sequencing.

Journal ArticleDOI
TL;DR: A program OrthoFocus is developed, which employs an extended reciprocal best hit approach to quickly search for orthologs in a pair of genomes and generates a multiple alignment of orthologics so that it can further be used in phylogenetic analysis.
Abstract: The identification of orthologs to a set of known genes is often the starting point for evolutionary studies focused on gene families of interest. To date, the existing orthology detection tools (C...

Journal ArticleDOI
TL;DR: An algorithm is suggested that inputs a protein sequence and outputs a decomposition of the protein chain into a regular part including secondary structures and a nonregular part corresponding to loop regions that can be used to find patterns of rigid and flexible loops as possible candidates to play a structure/function role as well as a role of antigenic determinants.
Abstract: We suggest an algorithm that inputs a protein sequence and outputs a decomposition of the protein chain into a regular part including secondary structures and a nonregular part corresponding to loo...

Journal ArticleDOI
TL;DR: It was found that the nuclear wave packet motion induced on the potential energy surface of the excited state of the primary electron donor P* by approximately 20 fs excitation leads to a coherent formation of the states P+Phi(B)(-) and P-A)(-) (B(A) is a bacteriochlorophyll monomer in the A-branch of cofactors).
Abstract: Transient absorption difference spectroscopy with ~20 femtosecond (fs) resolution was applied to study the time and spectral evolution of low-temperature (90 K) absorbance changes in isolated react...