scispace - formally typeset
Search or ask a question

Showing papers in "Journal of Bioinformatics and Computational Biology in 2003"


Journal ArticleDOI
TL;DR: Large scale benchmark test for fold recognition shows that RAPTOR significantly outperforms other programs at the fold similarity level, and also performs very well in recognizing the hard Homology Modeling (HM) targets.
Abstract: This paper presents a novel linear programming approach to do protein 3-dimensional (3D) structure prediction via threading. Based on the contact map graph of the protein 3D structure template, the protein threading problem is formulated as a large scale integer programming (IP) problem. The IP formulation is then relaxed to a linear programming (LP) problem, and then solved by the canonical branch-and-bound method. The final solution is globally optimal with respect to energy functions. In particular, our energy function includes pairwise interaction preferences and allowing variable gaps which are two key factors in making the protein threading problem NP-hard. A surprising result is that, most of the time, the relaxed linear programs generate integral solutions directly. Our algorithm has been implemented as a software package RAPTOR-RApid Protein Threading by Operation Research technique. Large scale benchmark test for fold recognition shows that RAPTOR significantly outperforms other programs at the fold similarity level. The CAFASP3 evaluation, a blind and public test by the protein structure prediction community, ranks RAPTOR as top 1, among individual prediction servers, in terms of the recognition capability and alignment accuracy for Fold Recognition (FR) family targets. RAPTOR also performs very well in recognizing the hard Homology Modeling (HM) targets. RAPTOR was implemented at the University of Waterloo and it can be accessed at http://www.cs.uwaterloo.ca/~j3xu/RAPTOR_form.htm.

273 citations


Journal ArticleDOI
TL;DR: A new statistical method for constructing a genetic network from microarray gene expression data by using a Bayesian network is proposed and a new graph selection criterion from Bayesian approach in general situations is theoretically derived.
Abstract: We propose a new statistical method for constructing a genetic network from microarray gene expression data by using a Bayesian network. An essential point of Bayesian network construction is the estimation of the conditional distribution of each random variable. We consider fitting nonparametric regression models with heterogeneous error variances to the microarray gene expression data to capture the nonlinear structures between genes. Selecting the optimal graph, which gives the best representation of the system among genes, is still a problem to be solved. We theoretically derive a new graph selection criterion from Bayes approach in general situations. The proposed method includes previous methods based on Bayesian networks. We demonstrate the effectiveness of the proposed method through the analysis of Saccharomyces cerevisiae gene expression data newly obtained by disrupting 100 genes.

169 citations


Journal ArticleDOI
TL;DR: This work describes a methodology, as well as some related data mining tools, for analyzing sequence data, and discusses how to generate and select features that are useful for understanding the distinction between ATG sites that are translation initiation sites and those that are not.
Abstract: We describe a methodology, as well as some related data mining tools, for analyzing sequence data. The methodology comprises three steps: (a) generating candidate features from the sequences, (b) selecting relevant features from the candidates, and (c) integrating the selected features to build a system to recognize specific properties in sequence data. We also give relevant techniques for each of these three steps. For generating candidate features, we present various types of features based on the idea of k-grams. For selecting relevant features, we discuss signal-to-noise, t-statistics, and entropy measures, as well as a correlation-based feature selection method. For integrating selected features, we use machine learning methods, including C4.5, SVM, and Naive Bayes. We illustrate this methodology on the problem of recognizing translation initiation sites. We discuss how to generate and select features that are useful for understanding the distinction between ATG sites that are translation initiation sites and those that are not. We also discuss how to use such features to build reliable systems for recognizing translation initiation sites in DNA sequences.

89 citations


Journal ArticleDOI
TL;DR: This work has analyzed trace data from conventional sequencing equipment and found an applicable rule to discern SNPs from noise and developed software that integrates this function to automatically identify SNPs.
Abstract: The single nucleotide polymorphism (SNP) is the difference of the DNA sequence between individuals and provides abundant information about genetic variation. Large scale discovery of high frequency SNPs is being undertaken using various methods. However, the publicly available SNP data sometimes need to be verified. If only a particular gene locus is concerned, locus-specific polymerase chain reaction amplification may be useful. Problem of this method is that the secondary peak has to be measured. We have analyzed trace data from conventional sequencing equipment and found an applicable rule to discern SNPs from noise. The rule is applied to multiply aligned sequences with a trace and the peak height of the traces are compared between samples. We have developed software that integrates this function to automatically identify SNPs. The software works accurately for high quality sequences and also can detect SNPs in low quality sequences. Further, it can determine allele frequency, display this information as a bar graph and assign corresponding nucleotide combinations. It is also designed for a person to verify and edit sequences easily on the screen. It is very useful for identifying de novo SNPs in a DNA fragment of interest.

81 citations


Journal ArticleDOI
TL;DR: It is proved that the problem of finding a minimum-recombinant haplotype configuration (MRHC) is in general NP-hard, the first complexity result concerning the problem to the authors' knowledge.
Abstract: We study haplotype reconstruction under the Mendelian law of inheritance and the minimum recombination principle on pedigree data. We prove that the problem of finding a minimum-recombinant haplotype configuration (MRHC) is in general NP-hard. This is the first complexity result concerning the problem to our knowledge. An iterative algorithm based on blocks of consecutive resolved marker loci (called block-extension) is proposed. It is very efficient and can be used for large pedigrees with a large number of markers, especially for those data sets requiring few recombinants (or recombination events). A polynomial-time exact algorithm for haplotype reconstruction without recombinants is also presented. This algorithm first identifies all the necessary constraints based on the Mendelian law and the zero recombinant assumption, and represents them using a system of linear equations over the cyclic group Z2. By using a simple method based on Gaussian elimination, we could obtain all possible feasible haplotype configurations. A C++ implementation of the block-extension algorithm, called PedPhase, has been tested on both simulated data and real data. The results show that the program performs very well on both types of data and will be useful for large scale haplotype inference projects.

80 citations


Journal ArticleDOI
TL;DR: A family of signed permutations is described which proves a quadratic lower bound on the number of affected vertices in the overlap/interleaving graph during any optimal sorting scenario, which implies, in particular, an Omega(n3) lower bound for Bergeron's algorithm.
Abstract: A central problem in genome rearrangement is finding a most parsimonious rearrangement scenario using certain rearrangement operations. An important problem of this type is sorting a signed genome by reversals and translocations (SBRT). Hannenhalli and Pevzner presented a duality theorem for SBRT which leads to a polynomial time algorithm for sorting a multi-chromosomal genome using a minimum number of reversals and translocations. However, there is one case for which their theorem and algorithm fail. We describe that case and suggest a correction to the theorem and the polynomial algorithm. The solution of SBRT uses a reduction to the problem of sorting a signed permutation by reversals (SBR). The best extant algorithms for SBR require quadratic time. The common approach to solve SBR is by finding a safe reversal using the overlap graph or the interleaving graph of a permutation. We describe a family of signed permutations which proves a quadratic lower bound on the number of affected vertices in the overlap/interleaving graph during any optimal sorting scenario. This implies, in particular, an Omega(n3) lower bound for Bergeron's algorithm.

63 citations


Journal ArticleDOI
TL;DR: The technological capabilities of mass spectrometry and bioinformatics for mining the cellular proteome in the context of discovery programs aimed at trace-level protein identification and expression from microgram amounts of protein extracts from human tissues are presented.
Abstract: Proteomics research programs typically comprise the identification of protein content of any given cell, their isoforms, splice variants, post-translational modifications, interacting partners and higher-order complexes under different conditions. These studies present significant analytical challenges owing to the high proteome complexity and the low abundance of the corresponding proteins, which often requires highly sensitive and resolving techniques. Mass spectrometry plays an important role in proteomics and has become an indispensable tool for molecular and cellular biology. However, the analysis of mass spectrometry data can be a daunting task in view of the complexity of the information to decipher, the accuracy and dynamic range of quantitative analysis, the availability of appropriate bioinformatics software and the overwhelming size of data files. The past ten years have witnessed significant technological advances in mass spectrometry-based proteomics and synergy with bioinformatics is vital to fulfill the expectations of biological discovery programs. We present here the technological capabilities of mass spectrometry and bioinformatics for mining the cellular proteome in the context of discovery programs aimed at trace-level protein identification and expression from microgram amounts of protein extracts from human tissues.

62 citations


Journal ArticleDOI
TL;DR: A heuristic algorithm of computing a constrained multiple sequence alignment (CMSA for short) for guaranteeing that the generated alignment satisfies the user-specified constraints that some particular residues should be aligned together.
Abstract: In this paper, we design a heuristic algorithm of computing a constrained multiple sequence alignment (CMSA for short) for guaranteeing that the generated alignment satisfies the user-specified constraints that some particular residues should be aligned together. If the number of residues needed to be aligned together is a constant alpha, then the time-complexity of our CMSA algorithm for aligning K sequences is O(alphaKn(4)), where n is the maximum of the lengths of sequences. In addition, we have built up such a CMSA software system and made several experiments on the RNase sequences, which mainly function in catalyzing the degradation of RNA molecules. The resulting alignments illustrate the practicability of our method.

52 citations


Journal ArticleDOI
TL;DR: This review summarizes some of the common themes in DNA microarray data analysis, including data normalization and detection of differential expression, which contributes to the implementation of more efficient computational protocols for the given data obtained through microarray experiments.
Abstract: Microarray analysis has become a widely used method for generating gene expression data on a genomic scale Microarrays have been enthusiastically applied in many fields of biological research, even though several open questions remain about the analysis of such data A wide range of approaches are available for computational analysis, but no general consensus exists as to standard for microarray data analysis protocol Consequently, the choice of data analysis technique is a crucial element depending both on the data and on the goals of the experiment Therefore, basic understanding of bioinformatics is required for optimal experimental design and meaningful interpretation of the results This review summarizes some of the common themes in DNA microarray data analysis, including data normalization and detection of differential expression Algorithms are demonstrated by analyzing cDNA microarray data from an experiment monitoring gene expression in T helper cells Several computational biology strategies, along with their relative merits, are overviewed and potential areas for additional research discussed The goal of the review is to provide a computational framework for applying and evaluating such bioinformatics strategies Solid knowledge of microarray informatics contributes to the implementation of more efficient computational protocols for the given data obtained through microarray experiments

51 citations


Journal ArticleDOI
Sven Rahmann1
TL;DR: A fast method that selects oligonucleotide probes for microarray experiments on a truly large scale and shows how to incorporate constraints such as oligo length, melting temperature, and self-complementarity into the selection process at a postprocessing stage is presented.
Abstract: We present a fast method that selects oligonucleotide probes (such as DNA 25-mers) for microarray experiments on a truly large scale For example, reliable oligos for human genes can be found within four days, a speedup of one to two orders of magnitude compared to previous approaches This speed is attained by using the longest common substring as a specificity measure for candidate oligos We present a space- and time-efficient algorithm, based on a suffix array with additional information, to compute matching statistics (lengths of longest matches) between all candidate oligos and all remaining sequences With the matching statistics available, we show how to incorporate constraints such as oligo length, melting temperature, and self-complementarity into the selection process at a postprocessing stage As a result, we can now design custom oligos for any sequenced genome, just as the technology for on-site chip synthesis is becoming increasingly mature

50 citations


Journal ArticleDOI
TL;DR: A novel algorithm for solving the problem of identifying data clusters from a noisy background is presented and it is proved that a cluster identification problem can be rigorously and efficiently solved through searching for substrings with special properties in a linear sequence.
Abstract: Transcription factor binding sites are short fragments in the upstream regions of genes, to which transcription factors bind to regulate the transcription of genes into mRNA Computational identification of transcription factor binding sites remains an unsolved challenging problem though a great amount of effort has been put into the study of this problem We have recently developed a novel technique for identification of binding sites from a set of upstream regions of genes, that could possibly be transcriptionally co-regulated and hence might share similar transcription factor binding sites By utilizing two key features of such binding sites (ie their high sequence similarities and their relatively high frequencies compared to other sequence fragments), we have formulated this problem as a cluster identification problem That is to identify and extract data clusters from a noisy background While the classical data clustering problem (partitioning a data set into clusters sharing common or similar features) has been extensively studied, there is no general algorithm for solving the problem of identifying data clusters from a noisy background In this paper, we present a novel algorithm for solving such a problem We have proved that a cluster identification problem, under our definition, can be rigorously and efficiently solved through searching for substrings with special properties in a linear sequence We have also developed a method for assessing the statistical significance of each identified cluster, which can be used to rule out accidental data clusters We have implemented the cluster identification algorithm and the statistical significance analysis method as a computer software CUBIC Extensive testing on CUBIC has been carried out We present here a few applications of CUBIC on challenging cases of binding site identification

Journal ArticleDOI
TL;DR: This analysis represents the most detailed simultaneous comparison of prokaryotic genes and species available to date and demonstrates that many of the SVD-derived right basis vectors represent particular conserved protein families, while many ofThe corresponding left basis vectors describe conserved motifs within these families as sets of correlated peptides (copeps).
Abstract: As whole genome sequences continue to expand in number and complexity, effective methods for comparing and categorizing both genes and species represented within extremely large datasets are required. Methods introduced to date have generally utilized incomplete and likely insufficient subsets of the available data. We have developed an accurate and efficient method for producing robust gene and species phylogenies using very large whole genome protein datasets. This method relies on multidimensional protein vector definitions supplied by the singular value decomposition (SVD) of a large sparse data matrix in which each protein is uniquely represented as a vector of overlapping tetrapeptide frequencies. Quantitative pairwise estimates of species similarity were obtained by summing the protein vectors to form species vectors, then determining the cosines of the angles between species vectors. Evolutionary trees produced using this method confirmed many accepted prokaryotic relationships. However, several unconventional relationships were also noted. In addition, we demonstrate that many of the SVD-derived right basis vectors represent particular conserved protein families, while many of the corresponding left basis vectors describe conserved motifs within these families as sets of correlated peptides (copeps). This analysis represents the most detailed simultaneous comparison of prokaryotic genes and species available to date.

Journal ArticleDOI
TL;DR: The improved FEATURE system is reported which recognizes functional sites in protein structures and can characterize and recognize geometrically complex and asymmetric sites such as ATP-binding sites and disulfide bond-forming sites.
Abstract: The increase in known three-dimensional protein structures enables us to build statistical profiles of important functional sites in protein molecules. These profiles can then be used to recognize sites in large-scale automated annotations of new protein structures. We report an improved FEATURE system which recognizes functional sites in protein structures. FEATURE defines multi-level physico-chemical properties and recognizes sites based on the spatial distribution of these properties in the sites' microenvironments. It uses a Bayesian scoring function to compare a query region with the statistical profile built from known examples of sites and control nonsites. We have previously shown that FEATURE can accurately recognize calcium-binding sites and have reported interesting results scanning for calcium-binding sites in the entire Protein Data Bank. Here we report the ability of the improved FEATURE to characterize and recognize geometrically complex and asymmetric sites such as ATP-binding sites and disulfide bond-forming sites. FEATURE does not rely on conserved residues or conserved residue geometry of the sites. We also demonstrate that, in the absence of a statistical profile of the sites, FEATURE can use an artificially constructed profile based on a priori knowledge to recognize the sites in new structures, using redoxin active sites as an example.

Journal ArticleDOI
TL;DR: A new method for identifying and validating drug targets by using gene networks, which are estimated from cDNA microarray gene expression profile data, is proposed, which uses the Bayesian network model.
Abstract: We propose a new method for identifying and validating drug targets by using gene networks, which are estimated from cDNA microarray gene expression profile data. We created novel gene disruption and drug response microarray gene expression profile data libraries for the purpose of drug target elucidation. We use two types of microarray gene expression profile data for estimating gene networks and then identifying drug targets. The estimated gene networks play an essential role in understanding drug response data and this information is unattainable from clustering methods, which are the standard for gene expression analysis. In the construction of gene networks, we use the Bayesian network model. We use an actual example from analysis of the Saccharomyces cerevisiae gene expression profile data to express a concrete strategy for the application of gene network information to drug discovery.

Journal ArticleDOI
TL;DR: The approach involves object identification, reference resolution, ontology and synonym discovery, and extracting object-object relationships, and results are promising for multi-object identification and relationship finding from biological documents.
Abstract: The biological literature databases continue to grow rapidly with vital information that is important for conducting sound biomedical research and development. The current practices of manually searching for information and extracting pertinent knowledge are tedious, time-consuming tasks even for motivated biological researchers. Accurate and computationally efficient approaches in discovering relationships between biological objects from text documents are important for biologists to develop biological models. The term "object" refers to any biological entity such as a protein, gene, cell cycle, etc. and relationship refers to any dynamic action one object has on another, e.g. protein inhibiting another protein or one object belonging to another object such as, the cells composing an organ. This paper presents a novel approach to extract relationships between multiple biological objects that are present in a text document. The approach involves object identification, reference resolution, ontology and synonym discovery, and extracting object-object relationships. Hidden Markov Models (HMMs), dictionaries, and N-Gram models are used to set the framework to tackle the complex task of extracting object-object relationships. Experiments were carried out using a corpus of one thousand Medline abstracts. Intermediate results were obtained for the object identification process, synonym discovery, and finally the relationship extraction. For the thousand abstracts, 53 relationships were extracted of which 43 were correct, giving a specificity of 81 percent. These results are promising for multi-object identification and relationship finding from biological documents.

Journal ArticleDOI
TL;DR: This paper focuses on coarse grained protein contact maps, a representation that describes the spatial neighborhood relation between secondary structure elements such as helices, beta sheets, and random coils, and its methodology is based on searching the graph space.
Abstract: Prediction of topological representations of proteins that are geometrically invariants can contribute towards the solution of fundamental open problems in structural genomics like folding. In this paper we focus on coarse grained protein contact maps, a representation that describes the spatial neighborhood relation between secondary structure elements such as helices, beta sheets, and random coils. Our methodology is based on searching the graph space. The search algorithm is guided by an adaptive evaluation function computed by a specialized noncausal recursive connectionist architecture. The neural network is trained using candidate graphs generated during examples of successful searches. Our results demonstrate the viability of the approach for predicting coarse contact maps.

Journal ArticleDOI
TL;DR: Current approaches to the mathematical modeling of biological systems are addressed and the potential impact of predictive biosimulation on drug discovery and development is assessed.
Abstract: Systems biology is creating a context for interpreting the vast amounts of genomic and proteomic data being produced by pharmaceutical companies in support of drug development. While major data collection efforts capitalize on technical advances in miniaturization and automation and represent an industrialization of existing laboratory research, the transition from mental models to predictive computer simulations is setting the pace for advances in this field. This article addresses current approaches to the mathematical modeling of biological systems and assesses the potential impact of predictive biosimulation on drug discovery and development.

Journal ArticleDOI
TL;DR: Software is developed that fully utilizes lookup-tables to detect the start- and endpoints of an EST within a given DNA sequence efficiently, and subsequently promptly identify exons and introns, and simultaneously attains high sensitivity and accuracy against a clean dataset of documented genes.
Abstract: There is a pressing need to align the growing set of expressed sequence tags (ESTs) with the newly sequenced human genome. However, the problem is complicated by the exon/intron structure of eukaryotic genes, misread nucleotides in ESTs, and the millions of repetitive sequences in genomic sequences. To solve this problem, algorithms that use dynamic programming have been proposed. In reality, however, these algorithms require an enormous amount of processing time. In an effort to improve the computational efficiency of these classical DP algorithms, we developed software that fully utilizes lookup-tables to detect the start- and endpoints of an EST within a given DNA sequence efficiently, and subsequently promptly identify exons and introns. In addition, the locations of all splice sites must be calculated correctly with high sensitivity and accuracy, while retaining high computational efficiency. This goal is hard to accomplish in practice, due to misread nucleotides in ESTs and repetitive sequences in the genome. Nevertheless, we present two heuristics that effectively settle this issue. Experimental results confirm that our technique improves the overall computation time by orders of magnitude compared with common tools, such as SIM4 and BLAT, and simultaneously attains high sensitivity and accuracy against a clean dataset of documented genes.

Journal ArticleDOI
TL;DR: It is shown that the NN approach is able to yield promising prediction results despite using only the most fundamental network structures, and a gene marker involved in breaking down lipids has been found to be the most correlated to CAD.
Abstract: This paper presents a novel approach for complex disease prediction that we have developed, exemplified by a study on risk of coronary artery disease (CAD). This multi-disciplinary approach straddles fields of microarray technology and genetics, neural networks (NN), data mining and machine learning, as well as traditional statistical analysis techniques, namely principal components analysis (PCA) and factor analysis (FA). A description of the biological background of the study is given, followed by a detailed description of how the problem has been modeled for analyses by neural networks and FA. A committee learning approach for NN has been used to improve generalization rates. We show that our NN approach is able to yield promising prediction results despite using only the most fundamental network structures. More interestingly, through the statistical analysis process, genes of similar biological functions have been clustered. In addition, a gene marker involved in breaking down lipids has been found to be the most correlated to CAD.

Journal ArticleDOI
TL;DR: Results demonstrate that amino acid motif databases like BLOCKS and InterPro are useful tools for investigating how alternative transcript structure affects gene function.
Abstract: Understanding how alternative splicing affects gene function is an important challenge facing modern-day molecular biology. Using homology-based, protein sequence analysis methods, it should be possible to investigate how transcript diversity impacts protein function. To test this, high-quality exon-intron structures were deduced for over 8000 human genes, including over 1300 (17 percent) that produce multiple transcript variants. A data mining technique (DiffMotif) was developed to identify genes in which transcript variation coincides with changes in conserved motifs between variants. Applying this method, we found that 30 percent of the multi-variant genes in our test set exhibited a differential profile of conserved InterPro and/or BLOCKS motifs across different mRNA variants. To investigate these, a visualization tool (ProtAnnot) that displays amino acid motifs in the context of genomic sequence was developed. Using this tool, genes revealed by the DiffMotif method were analyzed, and when possible, hypotheses regarding the potential role of alternative transcript structure in modulating gene function were developed. Examples of these, including: MEOX1, a homeobox-containing protein; AIRE, involved in auto-immune disease; PLAT, tissue type plasminogen activator; and CD79b, a component of the B-cell receptor complex, are presented. These results demonstrate that amino acid motif databases like BLOCKS and InterPro are useful tools for investigating how alternative transcript structure affects gene function.

Journal ArticleDOI
TL;DR: The function of immune system is a complicated balancing act based on the ability to respond to previously seen as well as to unknown foreign agents, which is of critical importance for basic and applied life sciences, particularly for health care.
Abstract: The immune system comprises a complex network of organs, specialized tissues, cells, and molecules. Its main function is to protect the organism from external and internal challenges and provides the interface between the organism and its environment. Foreign agents such as viruses, bacteria, fungi, or parasites may cause infection and disease. Foreign chemicals can cause toxic effects and pathogenic mutations. A malfunction of the immune system may lead to cancers, autoimmunity, or susceptibility to infections. The function of immune system is a complicated balancing act based on the ability to respond to previously seen as well as to unknown foreign agents. The study of the immune system is of critical importance for basic and applied life sciences, particularly for health care. Immunological data are growing at an exponential rate. The main sources of immunological data are public databases, various “omics” data, and published articles. The majority of entries in public database have relevance to the immune system processes. Foreign antigens represent potential targets of the immune response and can be classified into pathogenic antigens and tolerable environmental antigens. Some environmental antigens, such as allergens, carry potential for causing undesirable immune responses. Specialist immunological databases contain wellannotated data of immunological interest with detailed annotation. Genomics and proteomics have provided enormous stimuli to biological sciences. They have provided huge amounts of new biological data and induced a major paradigm shift in modern life sciences. The classic hypothesis-driven research has been complemented with various omics approaches which focus on large-scale study of biological molecules in aggregate. The great scientific conquest of the 20th

Journal ArticleDOI
TL;DR: This paper presents a new computational method based on the combination of a suite of algorithms for automating the assignment process, particularly the process of backbone resonance peak assignment, formulated as a constrained weighted bipartite matching problem.
Abstract: NMR resonance assignment is one of the key steps in solving an NMR protein structure. The assignment process links resonance peaks to individual residues of the target protein sequence, providing the prerequisite for establishing intra- and inter-residue spatial relationships between atoms. The assignment process is tedious and time-consuming, which could take many weeks. Though there exist a number of computer programs to assist the assignment process, many NMR labs are still doing the assignments manually to ensure quality. This paper presents a new computational method based on the combination of a suite of algorithms for automating the assignment process, particularly the process of backbone resonance peak assignment. We formulate the assignment problem as a constrained weighted bipartite matching problem. While the problem, in the most general situation, is NP-hard, we present an efficient solution based on a branch-and-bound algorithm with effective bounding techniques using two recently introduced approximation algorithms. We also devise a greedy filtering algorithm for reducing the search space. Our experimental results on 70 instances of (pseudo) real NMR data derived from 14 proteins demonstrate that the new solution runs much faster than a recently introduced (exhaustive) two-layer algorithm and recovers more correct peak assignments than the two-layer algorithm. Our result demonstrates that integrating different algorithms can achieve a good tradeoff between backbone assignment accuracy and computation time.

Journal ArticleDOI
TL;DR: This method shows the utility of using model parameters as a metric in cluster analysis, and reveals how a set of genes influence the expression of other genes activated during different cell cycle phases.
Abstract: Cluster analysis has proven to be a valuable statistical method for analyzing whole genome expression data. Although clustering methods have great utility, they do represent a lower level statistical analysis that is not directly tied to a specific model. To extend such methods and to allow for more sophisticated lines of inference, we use cluster analysis in conjunction with a specific model of gene expression dynamics. This model provides phenomenological dynamic parameters on both linear and non-linear responses of the system. This analysis determines the parameters of two different transition matrices (linear and nonlinear) that describe the influence of one gene expression level on another. Using yeast cell cycle microarray data as test set, we calculated the transition matrices and used these dynamic parameters as a metric for cluster analysis. Hierarchical cluster analysis of this transition matrix reveals how a set of genes influence the expression of other genes activated during different cell cycle phases. Most strikingly, genes in different stages of cell cycle preferentially activate or inactivate genes in other stages of cell cycle, and this relationship can be readily visualized in a two-way clustering image. The observation is prior to any knowledge of the chronological characteristics of the cell cycle process. This method shows the utility of using model parameters as a metric in cluster analysis.

Journal ArticleDOI
TL;DR: This work presents the design and motivation behind a domain specific language, called phi LOG, to enable biologists to program solutions to phylogenetic inference problems at a very high level of abstraction.
Abstract: Domain experts think and reason at a high level of abstraction when they solve problems in their domain of expertise. We present the design and motivation behind a domain specific language, called ΦLOG, to enable biologists to program solutions to phylogenetic inference problems at a very high level of abstraction. The implementation infrastructure (interpreter, compiler, debugger) for the DSL is automatically obtained through a software engineering framework based on Denotational Semantics and Logic Programming.

Journal ArticleDOI
TL;DR: A web server BTEVAL is developed for assessing the performance of newly developed beta- turn prediction method and it's ranking with respect to other existing beta-turn prediction methods.
Abstract: This paper describes a web server BTEVAL, developed for assessing the performance of newly developed beta-turn prediction method and it's ranking with respect to other existing beta-turn prediction methods. Evaluation of a method can be carried out on a single protein or a number of proteins. It consists of clean data set of 426 non-homologous proteins with seven subsets of these proteins. Users can evaluate their method on any subset or a complete set of data. The method is assessed at amino acid level and performance is evaluated in terms of Qtotal, Qpredicted, Qobserved and MCC measures. The server also compares the performance of the method with other existing beta-turn prediction methods such as Chou-Fasman algorithm, Thornton's algorithm, GORBTURN, 1-4 and 2-3 Correlation model, Sequence coupled model and BTPRED. The server is accessible from http://imtech.res.in/raghava/bteval/

Journal ArticleDOI
TL;DR: It is pointed out that the more variable the sequences within a multiple alignment, the more informative the multiple alignment is, and the results support multiple alignments usefulness for predictions of structural features.
Abstract: We present an original strategy, that involves a bioinformatic software structure, in order to perform an exhaustive and objective statistical analysis of three-dimensional structures of proteins. We establish the relationship between multiple sequences alignments and various structural features of proteins. We show that amino acids implied in disulfide bonds, salt bridges and hydrophobic interactions have been studied. Furthermore, we point out that the more variable the sequences within a multiple alignment, the more informative the multiple alignment. The results support multiple alignments usefulness for predictions of structural features.

Journal ArticleDOI
Yuan Liu1, Yuhong Wang1, Kimberly Folander1, Guochun Xie1, Richard A. Blevins1 
TL;DR: The study described in this paper demonstrates the use of GenoA to study human brain hyperpolarization-activated cation channel genes HCN1 and HCN3.
Abstract: Genome Analyzer (GenoA) with a relational database back-end, was developed to extract information from mammalian genomic sequences This data mining and visualization tool-set enables laboratory bench scientists to identify and assemble virtual cDNA from genomic exon sequences, and provides a starting point to identify potential alternative splice variants and polymorphisms in silico The study described in this paper demonstrates the use of GenoA to study human brain hyperpolarization-activated cation channel genes HCN1 and HCN3