scispace - formally typeset
Search or ask a question

Showing papers in "Bioinformatics in 1998"


Journal ArticleDOI
TL;DR: The program MODELTEST uses log likelihood scores to establish the model of DNA evolution that best fits the data.
Abstract: Summary: The program MODELTEST uses log likelihood scores to establish the model of DNA evolution that best fits the data. Availability: The MODELTEST package, including the source code and some documentation is available at http://bioag.byu.edu/zoology/crandall―lab/modeltest.html. Contact: dp47@email.byu.edu.

20,105 citations


Journal ArticleDOI
TL;DR: Profile HMM methods performed comparably to threading methods in the CASP2 structure prediction exercise and complement standard pairwise comparison methods for large-scale sequence analysis.
Abstract: Summary : The recent literature on profile hidden Markov model (profile HMM) methods and software is reviewed. Profile HMMs turn a multiple sequence alignment into a position-specific scoring system suitable for searching databases for remotely homologous sequences. Profile HMM analyses complement standard pairwise comparison methods for large-scale sequence analysis. Several software implementations and two large libraries of profile HMMs of common protein domains are available, HMM methods performed comparably to threading methods in the CASP2 structure prediction exercise. Contact: eddy@genetics.wustl.edu.

5,171 citations


Journal ArticleDOI
TL;DR: The system SOSUI for the discrimination of membrane proteins and soluble ones together with the prediction of transmembrane helices was developed, in which the accuracy of the classification of proteins was 99% and the corresponding value for the trans Membrane helix prediction was 97%.
Abstract: UNLABELLED The system SOSUI for the discrimination of membrane proteins and soluble ones together with the prediction of transmembrane helices was developed, in which the accuracy of the classification of proteins was 99% and the corresponding value for the transmembrane helix prediction was 97%. AVAILABILITY The system SOSUI is available through internet access: http://www.tuat.ac.jp/mitaku/sosui/. CONTACT sosui@biophys.bio.tuat. ac.jp.

1,871 citations


Journal ArticleDOI
TL;DR: SplitsTree is an interactive program, for analyzing and visualizing evolutionary data, that implements the method of split decomposition, and supports a number of distances transformations, the computation of parsimony splits, spectral analysis and bootstrapping.
Abstract: Motivation Real evolutionary data often contain a number of different and sometimes conflicting phylogenetic signals, and thus do not always clearly support a unique tree. To address this problem, Bandelt and Dress (Adv. Math., 92, 47-05, 1992) developed the method of split decomposition. For ideal data, this method gives rise to a tree, whereas less ideal data are represented by a tree-like network that may indicate evidence for different and conflicting phylogenies. Results SplitsTree is an interactive program, for analyzing and visualizing evolutionary data, that implements this approach. It also supports a number of distances transformations, the computation of parsimony splits, spectral analysis and bootstrapping.

1,476 citations


Journal ArticleDOI
TL;DR: It is demonstrated that sorting sequences by this p-value effectively combines the information present in multiple motifs, leading to highly accurate and sensitive sequence homology searches.
Abstract: Motivation: To illustrate an intuitive and statistically valid method for combining independent sources of evidence that yields a p-value for the complete evidence, and to apply it to the problem of detecting simultaneous matches to multiple patterns in sequence homology searches. Results: In sequence analysis, two or more (approximately) independent measures of the membership ofa sequence (or sequence region) in some class are often available. We would like to estimate the likelihood of the sequence being a member of the class in view ofall the available evidence. An example is estimating the significance of the observed match of a macromolecular sequence (DNA or protein) to a set of patterned (motifs) that characterize a biological sequence family. An intuitive way to do this is to express each piece of evidence as a p-value, and then use the product of these p-values as the measure of membership in the family. We derive a formula and algorithm (QFAST) for calculating the statistical distribution of the product of n independent p-values. We demonstrate that sorting sequences by this p-value effectively combines the information present in multiple motifs, leading to highly accurate and sensitive sequence homology searches.

1,194 citations


Journal ArticleDOI
TL;DR: A new hidden Markov model method (SAM-T98) for finding remote homologs of protein sequences is described and evaluated, which is optimized to recognize superfamilies, and would require parameter adjustment to be used to find family or fold relationships.
Abstract: MOTIVATION A new hidden Markov model method (SAM-T98) for finding remote homologs of protein sequences is described and evaluated. The method begins with a single target sequence and iteratively builds a hidden Markov model (HMM) from the sequence and homologs found using the HMM for database search. SAM-T98 is also used to construct model libraries automatically from sequences in structural databases. METHODS We evaluate the SAM-T98 method with four datasets. Three of the test sets are fold-recognition tests, where the correct answers are determined by structural similarity. The fourth uses a curated database. The method is compared against WU-BLASTP and against DOUBLE-BLAST, a two-step method similar to ISS, but using BLAST instead of FASTA. RESULTS SAM-T98 had the fewest errors in all tests-dramatically so for the fold-recognition tests. At the minimum-error point on the SCOP (Structural Classification of Proteins)-domains test, SAM-T98 got 880 true positives and 68 false positives, DOUBLE-BLAST got 533 true positives with 71 false positives, and WU-BLASTP got 353 true positives with 24 false positives. The method is optimized to recognize superfamilies, and would require parameter adjustment to be used to find family or fold relationships. One key to the performance of the HMM method is a new score-normalization technique that compares the score to the score with a reversed model rather than to a uniform null model. AVAILABILITY A World Wide Web server, as well as information on obtaining the Sequence Alignment and Modeling (SAM) software suite, can be found at http://www.cse.ucsc.edu/research/compbi o/ CONTACT karplus@cse.ucsc.edu; http://www.cse.ucsc.edu/karplus

1,169 citations


Journal ArticleDOI
TL;DR: An interactive protein secondary structure prediction Internet server is presented that simplifies the use of current prediction algorithms and allows conservation patterns important to structure and function to be identified.
Abstract: UNLABELLED An interactive protein secondary structure prediction Internet server is presented. The server allows a single sequence or multiple alignment to be submitted, and returns predictions from six secondary structure prediction algorithms that exploit evolutionary information from multiple sequences. A consensus prediction is also returned which improves the average Q3 accuracy of prediction by 1% to 72.9%. The server simplifies the use of current prediction algorithms and allows conservation patterns important to structure and function to be identified. AVAILABILITY http://barton.ebi.ac.uk/servers/jpred.h tml CONTACT geoff@ebi.ac.uk

1,044 citations


Journal ArticleDOI
TL;DR: A generic approach to combine numerical optimization methods with biochemical kinetic simulations is described, suitable for use in the rational design of improved metabolic pathways with industrial significance and for solving the inverse problem of metabolic pathways.
Abstract: MOTIVATION The simulation of biochemical kinetic systems is a powerful approach that can be used for: (i) checking the consistency of a postulated model with a set of experimental measurements, (ii) answering 'what if?' questions and (iii) exploring possible behaviours of a model. Here we describe a generic approach to combine numerical optimization methods with biochemical kinetic simulations, which is suitable for use in the rational design of improved metabolic pathways with industrial significance (metabolic engineering) and for solving the inverse problem of metabolic pathways, i.e. the estimation of parameters from measured variables. RESULTS We discuss the suitability of various optimization methods, focusing especially on their ability or otherwise to find global optima. We recommend that a suite of diverse optimization methods should be available in simulation software as no single one performs best for all problems. We describe how we have implemented such a simulation-optimization strategy in the biochemical kinetics simulator Gepasi and present examples of its application. AVAILABILITY The new version of Gepasi (3.20), incorporating the methodology described here, is available on the Internet at http://gepasi.dbs.aber.ac.uk/softw/Gepasi. html. CONTACT prm@aber.ac.uk

722 citations


Journal ArticleDOI
Isidore Rigoutsos1, Aris Floratos
TL;DR: A new algorithm for the discovery of rigid patterns (motifs) in biological sequences that is combinatorial in nature and able to produce all patterns that appear in at least a (user-defined) minimum number of sequences, yet it manages to be very efficient by avoiding the enumeration of the entire pattern space.
Abstract: Motivation The discovery of motifs in biological sequences is an important problem. Results This paper presents a new algorithm for the discovery of rigid patterns (motifs) in biological sequences. Our method is combinatorial in nature and able to produce all patterns that appear in at least a (user-defined) minimum number of sequences, yet it manages to be very efficient by avoiding the enumeration of the entire pattern space. Furthermore, the reported patterns are maximal: any reported pattern cannot be made more specific and still keep on appearing at the exact same positions within the input sequences. The effectiveness of the proposed approach is showcased on a number of test cases which aim to: (i) validate the approach through the discovery of previously reported patterns; (ii) demonstrate the capability to identify automatically highly selective patterns particular to the sequences under consideration. Finally, experimental analysis indicates that the algorithm is output sensitive, i.e. its running time is quasi-linear to the size of the generated output.

607 citations


Journal ArticleDOI
TL;DR: A model for a new type of topic-specific overview resource that provides efficient access to distributed information is developed, which is a freely accessible Web resource that offers one hypertext 'card' for each of the more than 7000 human genes that have an approved gene symbol published by the HUGO/GDB nomenclature committee.
Abstract: Motivation: Modem biology is shifting from the 'one gene one postdoc' approach to genomic analyses that include the simultaneous monitoring of thousands of genes. The importance of efficient access to concise and integrated biomedical information to support data analysis and decision making is therefore increasing rapidly, in both academic and industrial research. However, knowledge discovery in the widely scattered resources relevant for biomedical research is often a cumbersome and non-trivial task, one that requires a significant amount of training and effort. Results: To develop a model for a new type of topic-specific overview resource that provides efficient access to distributed information, we designed a database called 'GeneCards'. It is a freely accessible Web resource that offers one hypertext 'card' for each of the more than 7000 human genes that currently have an approved gene symbol published by the HUGO/GDB nomenclature committee. The presented information aims at giving immediate insight into current knowledge about the respective gene, including a focus on its functions in health and disease. It is compiled by Perl scripts that automatically extract relevant information from several databases, including SWISS-PROT, OMIM, Genatlas and GDB. Analyses of the interactions of users with the Web interface of GeneCards triggered development of easy-to-scan displays optimized for human browsing. Also, we developed algorithms that offer 'ready-to-click' query reformulation support, to facilitate information retrieval and exploration. Many of the long-term users turn to GeneCards to quickly access information about the function of very large sets of genes, for example in the realm of large-scale expression studies using 'DNA chip' technology or two-dimensional protein electrophoresis. Availability: Freely available at http://bioinformatics.weizmann.ac.il/cards/ Contact: cards@bioinformatics.weizmann.ac. il.

402 citations


Journal ArticleDOI
TL;DR: The JOY representation now constitutes an essential part of the two databases of protein structure alignments: HOMSTRAD (http://www-cryst.bioc.cam.ac.uk/homstrad ) and CAMPASS ( http:// www-crySt.biOC.com/campass).
Abstract: Motivation JOY is a program to annotate protein sequence alignments with three-dimensional (3D) structural features. It was developed to display 3D structural information in a sequence alignment and to help understand the conservation of amino acids in their specific local environments. Results : The JOY representation now constitutes an essential part of the two databases of protein structure alignments: HOMSTRAD (http://www-cryst.bioc.cam.ac.uk/homstrad ) and CAMPASS (http://www-cryst.bioc.cam.ac. uk/campass). It has also been successfully used for identifying distant evolutionary relationships. Availability The program can be obtained via anonymous ftp from torsa.bioc.cam.ac.uk from the directory /pub/joy/. The address for the JOY server is http://www-cryst.bioc.cam.ac.uk/cgi-bin/joy.cgi. Contact kenji@cryst.bioc.cam.ac.uk

Journal ArticleDOI
TL;DR: This document is a tool for converting the results of a sequence database search into the form of a coloured multiple alignment of hits stacked against the query, and an existing multiple alignment can be processed.
Abstract: UNLABELLED: MView is a tool for converting the results of a sequence database search into the form of a coloured multiple alignment of hits stacked against the query. Alternatively, an existing multiple alignment can be processed. In either case, the output is simply HTML, so the result is platform independent and does not require a separate application or applet to be loaded. AVAILABILITY: Free from http://www.sander.ebi.ac.uk/mview/ subject to copyright restrictions. CONTACT: brown@ebi.ac.uk

Journal ArticleDOI
TL;DR: Three servers have been made available to the scientific community: a homology modeling server, a model quality evaluation server and a server that evaluates models built of proteins for which the structure is already known, thereby implicitly evaluating the quality of the modeling program.
Abstract: MOTIVATION Homology modeling is rapidly becoming the method of choice for obtaining three-dimensional coordinates for proteins because genome projects produce sequences at a much higher rate than NMR and X-ray laboratories can solve the three-dimensional structures. The quality of protein models will not be immediately clear to novices and support with the evaluation seems to be needed. Expert users are sometimes interested in evaluating the quality of modeling programs rather than the quality of the models themselves. RESULTS Three servers have been made available to the scientific community: a homology modeling server, a model quality evaluation server and a server that evaluates models built of proteins for which the structure is already known, thereby implicitly evaluating the quality of the modeling program. AVAILABILITY The modeling-related servers and several structure analysis servers are freely available at http://swift.embl-heidelberg.de/servers/ CONTACT gert.vriend@embl-heidelberg.de

Journal ArticleDOI
TL;DR: This work clusters closely similar sequences to yield a covering of sequence space by a representative subset of sequences, derived by an exhaustive search for close similarities in the sequence database in which the need for explicit sequence alignment is significantly reduced by applying deca- and pentapeptide composition filters.
Abstract: MOTIVATION To maximize the chances of biological discovery, homology searching must use an up-to-date collection of sequences. However, the available sequence databases are growing rapidly and are partially redundant in content. This leads to increasing strain on CPU resources and decreasing density of first-hand annotation. RESULTS These problems are addressed by clustering closely similar sequences to yield a covering of sequence space by a representative subset of sequences. No pair of sequences in the representative set has >90% mutual sequence identity. The representative set is derived by an exhaustive search for close similarities in the sequence database in which the need for explicit sequence alignment is significantly reduced by applying deca- and pentapeptide composition filters. The algorithm was applied to the union of the Swissprot, Swissnew, Trembl, Tremblnew, Genbank, PIR, Wormpep and PDB databases. The all-against-all comparison required to generate a representative set at 90% sequence identity was accomplished in 2 days CPU time, and the removal of fragments and close similarities yielded a size reduction of 46%, from 260 000 unique sequences to 140 000 representative sequences. The practical implications are (i) faster homology searches using, for example, Fasta or Blast, and (ii) unified annotation for all sequences clustered around a representative. As tens of thousands of sequence searches are performed daily world-wide, appropriate use of the non-redundant database can lead to major savings in computer resources, without loss of efficacy. AVAILABILITY A regularly updated non-redundant protein sequence database (nrdb90), a server for homology searches against nrdb90, and a Perl script (nrdb90.pl) implementing the algorithm are available for academic use from http://www.embl-ebi.ac. uk/holm/nrdb90. CONTACT holm@embl-ebi.ac.uk

Journal ArticleDOI
TL;DR: A bioinformatic method was developed for the prediction of peptide binding to MHC class II molecules and its application to the identification of potential immunotherapeutic peptides illustrates the synergy between experimentation and computer modeling.
Abstract: Motivation: Prediction methods for identifying binding peptides could minimize the number of peptides required to be synthesized and assayed, and thereby facilitate the identification of potential T-cell epitopes. We developed a bioinformatic method for the prediction of peptide binding to MHC class II molecules. Results: Experimental binding data and expert knowledge of anchor positions and binding motifs were combined with an evolutionary algorithm (EA) and an artificial neural network (ANN): binding data extraction --> peptide alignment --> ANN training and classification. This method, termed PERUN, was implemented for the prediction of peptides that bind to HLA-DR4(B1*0401). The respective positive predictive values of PERUN predictions of high-, moderate-, low- and zero-affinity binder-a were assessed as 0.8, 0.7, 0.5 and 0.8 by cross-validation, and 1.0, 0.8, 0.3 and 0.7 by experimental binding. This illustrates the synergy between experimentation and computer modeling, and its application to the identification of potential immunotheraaeutic peptides.

Journal ArticleDOI
TL;DR: A new probabilistic model of the evolution of RNA-, DNA-, or protein-like sequences and a software tool, Rose, that implements this model, suitable for the evaluation of methods in multiple sequence alignment computation and the prediction of phylogenetic relationships is presented.
Abstract: Motivation We present a new probabilistic model of the evolution of RNA-, DNA-, or protein-like sequences and a software tool, Rose, that implements this model. Guided by an evolutionary tree, a family of related sequences is created from a common ancestor sequence by insertion, deletion and substitution of characters. During this artificial evolutionary process, the 'true' history is logged and the 'correct' multiple sequence alignment is created simultaneously. The model also allows for varying rates of mutation within the sequences, making it possible to establish so-called sequence motifs. Results The data created by Rose are suitable for the evaluation of methods in multiple sequence alignment computation and the prediction of phylogenetic relationships. It can also be useful when teaching courses in or developing models of sequence evolution and in the study of evolutionary processes. Availability Rose is available on the Bielefeld Bioinformatics WebServer under the following URL: http://bibiserv.TechFak.Uni-Bielefeld.DE/rose/ The source code is available upon request. Contact folker@TechFak.Uni-Bielefeld.DE

Journal ArticleDOI
TL;DR: It is shown that multiple sequence alignments can be optimized for their COFFEE score with the genetic algorithm package SAGA and given a library of structure-based pairwise alignments extracted from FSSP, SAG a can produce high-quality multiple sequencealignments.
Abstract: MOTIVATION In order to increase the accuracy of multiple sequence alignments, we designed a new strategy for optimizing multiple sequence alignments by genetic algorithm. We named it COFFEE (Consistency based Objective Function For alignmEnt Evaluation). The COFFEE score reflects the level of consistency between a multiple sequence alignment and a library containing pairwise alignments of the same sequences. RESULTS We show that multiple sequence alignments can be optimized for their COFFEE score with the genetic algorithm package SAGA. The COFFEE function is tested on 11 test cases made of structural alignments extracted from 3D_ali. These alignments are compared to those produced using five alternative methods. Results indicate that COFFEE outperforms the other methods when the level of identity between the sequences is low. Accuracy is evaluated by comparison with the structural alignments used as references. We also show that the COFFEE score can be used as a reliability index on multiple sequence alignments. Finally, we show that given a library of structure-based pairwise sequence alignments extracted from FSSP, SAGA can produce high-quality multiple sequence alignments. The main advantage of COFFEE is its flexibility. With COFFEE, any method suitable for making pairwise alignments can be extended to making multiple alignments. AVAILABILITY The package is available along with the test cases through the WWW: http://www. ebi.ac.uk/cedric CONTACT cedric.notredame@ebi.ac.uk

Journal ArticleDOI
TL;DR: GeneTree is a program for comparing gene and species trees using reconciled trees and can compute the cost of embedding a gene tree within a species tree, visually display the location and number of gene duplications and losses, and search for optimal species trees.
Abstract: Summary: GeneTree is a program for comparing gene and species trees using reconciled trees. The program can compute the cost of embedding a gene tree within a species tree, visually display the location and number of gene duplications and losses, and search for optimal species trees. Availability: The program is free and is available at {{http://taxonomy.zoology.gla.ac.uk/rod/genetree/gene-tree.html}}. Contact: r.page@bio.gla.ac.uk

Journal ArticleDOI
TL;DR: The first prototype of a system for the automatic annotation of protein function is presented, triggered by collections of s related to a given protein, and it is able to extract biological information directly from scientific literature, i.e. MEDLINE abstracts.
Abstract: MOTIVATION Annotation of the biological function of different protein sequences is a time-consuming process currently performed by human experts Genome analysis tools encounter great difficulty in performing this task Database curators, developers of genome analysis tools and biologists in general could benefit from access to tools able to suggest functional annotations and facilitate access to functional information APPROACH We present here the first prototype of a system for the automatic annotation of protein function The system is triggered by collections of s related to a given protein, and it is able to extract biological information directly from scientific literature, ie MEDLINE abstracts Relevant keywords are selected by their relative accumulation in comparison with a domain-specific background distribution Simultaneously, the most representative sentences and MEDLINE abstracts are selected and presented to the end-user Evolutionary information is considered as a predominant characteristic in the domain of protein function Our system consequently extracts domain-specific information from the analysis of a set of protein families RESULTS The system has been tested with different protein families, of which three examples are discussed in detail here: 'ataxia-telangiectasia associated protein', 'ran GTPase' and 'carbonic anhydrase' We found generally good correlation between the amount of information provided to the system and the quality of the annotations Finally, the current limitations and future developments of the system are discussed AVAILABILITY The current system can be considered as a prototype system As such, it can be accessed as a server at http://columbaebiac uk:8765/andrade/abx The system accepts text related to the protein or proteins to be evaluated (optimally, the result of a MEDLINE search by keyword) and the results are returned in the form of Web pages for keywords, sentences and s SUPPLEMENTARY INFORMATION Web pages containing full information on the examples mentioned in the text are available at: http://wwwcnbuames/ approximately cnbprot/keywords/ CONTACT valencia@cnbuames

Journal ArticleDOI
TL;DR: The program General Codon Usage Analysis (GCUA) has been developed for analysing codon and amino acid usage patterns and is free to academic use, commercial users should contact the author.
Abstract: UNLABELLED The program General Codon Usage Analysis (GCUA) has been developed for analysing codon and amino acid usage patterns. AVAILABILITY ftp://ftp.nhm.ac.uk/pub/gcua. Freely available for academic use, commercial users should contact the author. CONTACT J.McInerney@nhm.ac.uk

Journal ArticleDOI
TL;DR: The E(N:K) framework enables the generation of families of genetic models that incorporate the effects of genotype-by-environment (G x E) interactions and epistasis and the structure of the QU-GENE simulation software is explained and demonstrated.
Abstract: Classical quantitative genetics theory makes a number of simplifying assumptions in order to develop mathematical expressions that describe the mean and variation (genetic and phenotypic) within and among populations, and to predict how these are expected to change under the influence of external forces. These assumptions are often necessary to render the development of many aspects of the theory mathematically tractable. The availability of high-speed computers today provides opportunity for the use of computer simulation methodology to investigate the implications of relaxing many of the assumptions that are commonly made. QU-GENE (QUantitative-GENEtics) was developed as a flexible computer simulation platform for the quantitative analysis of genetic models. Three features of the QU-GENE software that contribute to its flexibility are (i) the core E(N:K) genetic model, where E is the number of types of environment, N is the number of genes, K indicates the level of epistasis and the parentheses indicate that different N:K genetic models can be nested within types of environments, (ii) the use of a two-stage architecture that separates the definition of the genetic model and genotype-environment system from the detail of the individual simulation experiments and (iii) the use of a series of interactive graphical windows that monitor the progress of the simulation experiments. The E(N:K) framework enables the generation of families of genetic models that incorporate the effects of genotype-by-environment (G x E) interactions and epistasis. By the design of appropriate application modules, many different simulation experiments can be conducted for any genotype-environment system. The structure of the QU-GENE simulation software is explained and demonstrated by way of two examples. The first concentrates on some aspects of the influence of G x E interactions on response to selection in plant breeding, and the second considers the influence of multiple-peak epistasis on the evolution of a four-gene epistatic network. QU-GENE is available over the Internet at (http://pig.ag.uq.edu.au/qu-gene/) m.cooper@mailbox.uq.edu. au

Journal ArticleDOI
TL;DR: A linear discriminant approach that predicts this completeness of the protein coding region of a sequence by estimating the probability of each ATG being the initiation codon by exploiting the criterion that only a single prediction is allowed for each sequence.
Abstract: MOTIVATION In cDNA sequencing projects, it is vital to know whether the protein coding region of a sequence is complete, or whether errors have occurred during library construction Here we present a linear discriminant approach that predicts this completeness by estimating the probability of each ATG being the initiation codon RESULTS Because of the current shortage of full-length cDNA data on which to base this work, tests were performed on a non-redundant set of 660 initiation codon-containing DNA sequences that had been conceptually spliced into mRNA/cDNA We also used an edited set of the same sequences that only contained the region following the initiation codon as a negative control Using the criterion that only a single prediction is allowed for each sequence, a cut-off was selected at which discrimination of both positive and negative sets was equal At this cut-off, 67% of each set could be correctly distinguished, with the correct ATG codon also being identified in the positive set Reliability could be increased further by raising the cut-off or including homologues, the relative merits of which are discussed AVAILABILITY The prediction program, called ATGpr, and other data are available at http://wwwhricojp/atgpr CONTACT swintech@hricojp

Journal ArticleDOI
TL;DR: Improved performance in MWM structure prediction was achieved in two ways, and new ways of calculating base pair likelihoods have been developed, and accuracy was improved by developing techniques for filtering out spurious base pairs predicted by the MWM program.
Abstract: Motivation Recently, we described a Maximum Weighted Matching (MWM) method for RNA structure prediction. The MWM method is capable of detecting pseudoknots and other tertiary base-pairing interactions in a computationally efficient manner (Cary and Stormo, Proceedings of the Third International Conference on Intelligent Systems for Molecular Biology, pp. 75-80, 1995). Here we report on the results of our efforts to improve the MWM method's predictive accuracy, and show how the method can be extended to detect base interactions formerly inaccessible to automated RNA modeling techniques. Results Improved performance in MWM structure prediction was achieved in two ways. First, new ways of calculating base pair likelihoods have been developed. These allow experimental data and combined statistical and thermodynamic information to be used by the program. Second, accuracy was improved by developing techniques for filtering out spurious base pairs predicted by the MWM program. We also demonstrate here a means by which the MWM folding method may be used to detect the presence of base triples in RNAs. Availability http://www.cshl.org/mzhanglab/tabaska/j axpage. html Contact tabaska@cshl.org

Journal ArticleDOI
TL;DR: LIGAND chemical database is an extension of previous studies, and it is a flat-file representation of 3303 enzymes and 2976 enzymatic reactions in the chemical equation format that can be parsed by machine, thus providing the linkage between chemical and biological databases.
Abstract: LIGAND is a composite database comprising three sections: ENZYME for the information of enzyme molecules and enzymatic reactions, COMPOUND for the information of metabolites and other chemical compounds, and REACTION for the collection of substrate‐product relations. The current release includes 3390 enzymes, 5645 compounds and 5207 reactions. The database is indispensable for the reconstruction of metabolic pathways in the completely sequenced organisms. The LIGAND database can be accessed through the WWW (http://www.genome. ad.jp/dbget/ligand.html ) or may be downloaded by anonymous FTP (ftp://kegg.genome.ad.jp/molecules/ ligand/ ).

Journal ArticleDOI
TL;DR: An algorithm, the 'Bayes block aligner', which bypasses the requirement of a fixed set of parameter settings, and returns the Bayesian posterior probability for the number of gaps and for the scoring matrices in any series of interest.
Abstract: The selection of a scoring matrix and gap penalty parameters continues to be an important problem in sequence alignment. We describe here an algorithm, the 'Bayes block aligner, which bypasses this requirement. Instead of requiring a fixed set of parameter settings, this algorithm returns the Bayesian posterior probability for the number of gaps and for the scoring matrices in any series of interest. Furthermore, instead of returning the single best alignment for the chosen parameter settings, this algorithm returns the posterior distribution of all alignments considering the full range of gapping and scoring matrices selected, weighing each in proportion to its probability based on the data. We compared the Bayes aligner with the popular Smith-Waterman algorithm with parameter settings from the literature which had been optimized for the identification of structural neighbors, and found that the Bayes aligner correctly identified more structural neighbors. In a detailed examination of the alignment of a pair of kinase and a pair of GTPase sequences, we illustrate the algorithm's potential to identify subsequences that are conserved to different degrees. In addition, this example shows that the Bayes aligner returns an alignment-free assessment of the distance between a pair of sequences.

Journal ArticleDOI
TL;DR: The marriage of high-throughput nucleotide sequencing with computational methods for the analysis of nucleotide and protein sequences have ushered in a new era of molecular biology, and the growing wealth of information within the sequence databases provides a foundation for the biology of the 21st Century.
Abstract: The marriage of high-throughput nucleotide sequencing with computational methods for the analysis of nucleotide and protein sequences have ushered in a new era of molecular biology. Entire genomes are deposited into the sequence DBs at a growing rate. Typically, investigators can use computational sequence analysis to assign functions to the majority of the open reading frames in genome sequences. That analysis can identify a surprisingly large fraction of the genes within the organism. That fraction is increasing over time as the sequence databases contain a larger fraction of all functional domains. The growing wealth of information within the sequence databases provides a foundation for the biology of the 21st Century. We will mine these data for decades to come, developing complex and incredibly accurate cellular models that can predict the behavior of living systems by integrating across the functions of their molecular parts. Or will we? Although the preceding scenario is the likely one, we would be irresponsible to not consider another possible outcome: an explosion of incorrect annotations within the sequence databases. Each new sequence deposited in the public databases has been annotated with respect to those same databases. Functional annotations are propagated repeatedly from one sequence to the next, to the next, with no record made of the source of a given annotation, leading to a potential transitive catastrophe of erroneous annotations. Investigators who later attempt to separate the wheat from the chaff will discover that they cannot simply retreat to the safety of experimentally annotated sequences by ignoring the computationally annotated sequences, because the public DBs do not explicitly distinguish the two sets. In fact, the public sequence DBs keep virtually no tracking information about the methods used to annotate their data. Can we rule this possibility out on any objective grounds? No. We have no reliable data regarding either the current rate of errors (incorrect functional annotations) within the public DBs, nor on the rate of change of that error rate (we do not even know if it is increasing or decreasing each year). Many years of research have led to the development of detailed statistical models for sequence-similarity searching algorithms such as FASTA and the BLAST family of programs. Researchers employ these algorithms to identify the functions of novel sequences in two phases. In phase I, they identify homologs of a novel sequence. In phase II, they infer the function of the novel sequence with …

Journal ArticleDOI
TL;DR: The problem of defining curriculum is approached with a bias that bioinformatics is not simply a proper subset of biology orComputer science, but has a growing and independent base of scientific tenets that requires specific training not appropriate for either biology or computer science alone.
Abstract: There seems to be wide agreement within both industry and academia that there are not enough scientists adequately trained in bioinformatics or computational biology. This sentiment stems principally from the difficulties in finding employees, graduate students and post-docs with appropriate skills for joining research and/or development teams in this field. The recent drain of academics into industry threatens to reduce our ability to provide the training needed to meet the demand of the job markets. An obvious question is 'What is the proper curriculum for bioinformatics professionals?' At first, the idea of defining a curriculum for bioinformatics may seem premature. The very definition of bioinformatics is still the matter of some debate. Although some interpret it narrowly as the information science techniques needed to support genome analysis, many have begun to use it synonymously with 'computational molecular biology' or even all of 'computational biology'. For this discussion of curriculum, bioinformatics addresses problems related to the storage, retrieval and analysis of information about biological structure, sequence and function. There are currently two models for training. In the first model, post-doctoral fellows with core training in a technical field (such as computer science) or in a subdiscipline of biology receive speciality training in computational biology in order to become a 'computer scientist who specializes in biology' or a 'biologist who specializes in computer science'. While a valuable strategy, the post-doctoral model suffers because it is an expensive way (both in time and resources) to train individuals — learning the 'other' field is in many cases like going back to graduate school. In the second model, therefore, graduate students are trained primarily in bioinformatics or computational biology, without a preliminary training in one of the contributing disciplines. The curriculum for these students must provide them with a skill set that is long-lived and endures beyond the current fads of what is considered 'hot'. These students will not have traditional biological science or technical training to fall back on, and so it is critical that we provide the next generation with skills to solve industrial and academic problems that we cannot anticipate. I approach the problem of defining curriculum with a bias that bioinformatics is not simply a proper subset of biology or computer science, but has a growing and independent base of scientific tenets that requires specific training not appropriate for either biology or computer science alone. An appropriate academic curriculum for the …

Journal ArticleDOI
TL;DR: The nearest-neighbour criterion has been used to estimate the predictive accuracy of the classification based on the selected features, and it was found that the classification according to the first nearest neighbour is correct for 80% of the test samples.
Abstract: MOTIVATION Most of the existing methods for genetic sequence classification are based on a computer search for homologies in nucleotide or amino acid sequences. The standard sequence alignment programs scale very poorly as the number of sequences increases or the degree of sequence identity is <30%. Some new computationally inexpensive methods based on nucleotide or amino acid compositional analysis have been proposed, but prediction results are still unsatisfactory and depend on the features chosen to represent the sequences. RESULTS In this paper, a feature selection method based on the Gamma (or near-neighbour) test is proposed. If there is a continuous or smooth map from feature space to the classification target values, the Gamma test gives an estimate for the mean-squared error of the classification, despite the fact that one has no a priori knowledge of the smooth mapping. We can search a large space of possible feature combinations for a combination which gives a smallest estimated mean-squared error using a genetic algorithm. The method was used for feature selection and classification of the large subunits of rRNA according to RDP (Ribosomal Database Project) phylogenetic classes. The sequences were represented by dinucleotide frequency distribution. The nearest-neighbour criterion has been used to estimate the predictive accuracy of the classification based on the selected features. For examples discussed, we found that the classification according to the first nearest neighbour is correct for 80% of the test samples. If we consider the set of the 10 nearest neighbours, then 94% of the test samples are classified correctly. AVAILABILITY The principal novel component of this method is the Gamma test and this can be downloaded compiled for Unix Sun 4, Windows 95 and MS-DOS from http://www.cs.cf.ac.uk/ec/ CONTACT s.margetts@cs.cf.ac.uk

Journal ArticleDOI
TL;DR: A new 3D substructure matching algorithm based on geometric hashing techniques that allows us to reduce drastically the complexity of the recognition and allow us to find smaller similarities than previous methods.
Abstract: MOTIVATION Most biological actions of proteins depend on some typical parts of their three-dimensional structure, called 3D motifs. It is desirable to find automatically common geometric substructures between proteins to discover similarities in new structures or to model precisely a particular motif. Most algorithms for structural comparison of proteins deal with large (fold) similarities. Here, we focus on small but precise similarities. RESULTS We propose a new 3D substructure matching algorithm based on geometric hashing techniques. The key feature of the method is the introduction of a 3D reference frame attached to each residue. This allows us to reduce drastically the complexity of the recognition. Our experimental results confirm the validity of the approach and allow us to find smaller similarities than previous methods. AVAILABILITY The program uses commercial libraries and thus cannot be completely freely distributed. It can be found at ftp://www.inria.fr in the directory epidaure/Outgoing/xpennec/Prospect, but it requires a key to be run, available by request to xavier.pennec@sophia.inria.fr CONTACT Xavier.Pennec@sophia.inria.fr; Nicholas.Ayache@sophia.inria.fr

Journal ArticleDOI
TL;DR: An EXCEL template has been developed for the calculation of enzyme kinetic parameters by non-linear regression techniques and is accurate, inexpensive, as well as easy to use and modify.
Abstract: Motivation An EXCEL template has been developed for the calculation of enzyme kinetic parameters by non-linear regression techniques. The tool is accurate, inexpensive, as well as easy to use and modify. Availability The program is available from http://www.ebi.ac.uk/biocat/biocat.html Contact agustin. hernandez@bio.kuleuven.ac.be