scispace - formally typeset
Search or ask a question

Showing papers in "Bioinformatics in 2000"


Journal ArticleDOI
TL;DR: The PSIPRED protein structure prediction server allows users to submit a protein sequence, perform a prediction of their choice and receive the results of the prediction both textually via e-mail and graphically via the web.
Abstract: The PSIPRED protein structure prediction server allows users to submit a protein sequence, perform a prediction of their choice and receive the results of the prediction both textually via e-mail and graphically via the web. The user may select one of three prediction methods to apply to their sequence: PSIPRED, a highly accurate secondary structure prediction method; MEMSAT 2, a new version of a widely used transmembrane topology prediction method; or GenTHREADER, a sequence profile based fold recognition method.

3,381 citations


Journal ArticleDOI
TL;DR: Artemis is a DNA sequence visualization and annotation tool that allows the results of any analysis or sets of analyses to be viewed in the context of the sequence and its six-frame translation.
Abstract: Summary: Artemis is a DNA sequence visualization and annotation tool that allows the results of any analysis or sets of analyses to be viewed in the context of the sequence and its six-frame translation. Artemis is especially useful in analysing the compact genomes of bacteria, archaea and lower eukaryotes, and will cope with sequences of any size from small genes to whole genomes. It is implemented in Java, and can be run on any suitable platform. Sequences and annotation can be read and written directly in EMBL, GenBank and GFF format. Availability: Artemis is available under the GNU General Public License from http:// www.sanger.ac.uk/ Software/ Artemis

3,080 citations


Journal ArticleDOI
TL;DR: A new method to analyse tissue samples using support vector machines for mis-labeled or questionable tissue results and shows that other machine learning methods also perform comparably to the SVM on many of those datasets.
Abstract: Motivation: DNA microarray experiments generating thousands of gene expression measurements, are being used to gather information from tissue and cell samples regarding gene expression differences that will be useful in diagnosing disease. We have developed a new method to analyse this kind of data using support vector machines (SVMs). This analysis consists of both classification of the tissue samples, and an exploration of the data for mis-labeled or questionable tissue results. Results: We demonstrate the method in detail on samples consisting of ovarian cancer tissues, normal ovarian tissues, and other normal tissues. The dataset consists of expression experiment results for 97 802 cDNAs for each tissue. As a result of computational analysis, a tissue sample is discovered and confirmed to be wrongly labeled. Upon correction of this mistake and the removal of an outlier, perfect classification of tissues is achieved, but not with high confidence. We identify and analyse a subset of genes from the ovarian dataset whose expression is highly differentiated between the types of tissues. To show robustness of the SVM method, two previously published datasets from other types of tissues or cells are analysed. The results are comparable to those previously obtained. We show that other machine learning methods also perform comparably to the SVM on many of those datasets. Availability: The SVM software is available at http:// www. cs.columbia.edu/ ∼bgrundy/ svm.

2,464 citations


Journal ArticleDOI
TL;DR: A unified overview of methods that currently are widely used to assess the accuracy of prediction algorithms, from raw percentages, quadratic error measures and other distances, and correlation coefficients, and to information theoretic measures such as relative entropy and mutual information are provided.
Abstract: We provide a unified overview of methods that currently are widely used to assess the accuracy of prediction algorithms, from raw percentages, quadratic error measures and other distances, and correlation coefficients, and to information theoretic measures such as relative entropy and mutual information. We briefly discuss the advantages and disadvantages of each approach. For classification tasks, we derive new learning algorithms for the design of prediction systems by directly optimising the correlation coefficient. We observe and prove several results relating sensitivity and specificity of optimal systems. While the principles are general, we illustrate the applicability on specific problems such as protein secondary structure and signal peptide prediction.

1,972 citations


Journal ArticleDOI
TL;DR: The purpose of this article is to provide a brief history of the development and application of computer algorithms for the analysis and prediction of DNA binding sites.
Abstract: The purpose of this article is to provide a brief history of the development and application of computer algorithms for the analysis and prediction of DNA binding sites. This problem can be conveniently divided into two subproblems. The first is, given a collection of known binding sites, develop a representation of those sites that can be used to search new sequences and reliably predict where additional binding sites occur. The second is, given a set of sequences known to contain binding sites for a common factor, but not knowing where the sites are, discover the location of the sites in each sequence and a representation for the specificity of the protein.

1,556 citations


Journal ArticleDOI
TL;DR: Recombination Detection Program is a program that applies a pairwise scanning approach to the detection of recombination amongst a group of aligned DNA sequences.
Abstract: Recombination Detection Program (RDP) is a program that applies a pairwise scanning approach to the detection of recombination amongst a group of aligned DNA sequences. The software runs under Windows95 and combines highly automated screening of large numbers of sequences with a highly interactive interface for examining the results of the analyses.

1,400 citations


Journal ArticleDOI
TL;DR: It is concluded that the combination of predictive modeling with systematic experimental verification will be required to gain a deeper insight into living organisms, therapeutic targeting and bioengineering.
Abstract: Advances in molecular biological, analytical, and computational technologies are enabling us to systematically investigate the complex molecular processes underlying biological systems. In particular, using high-throughput gene expression assays, we are able to measure the output of the gene regulatory network. We aim here to review datamining and modeling approaches for conceptualizing and unraveling the functional relationships implicit in these datasets. Clustering of co-expression profiles allows us to infer shared regulatory inputs and functional pathways. We discuss various aspects of clustering, ranging from distance measures to clustering algorithms and multiple-duster memberships. More advanced analysis aims to infer causal connections between genes directly, i.e., who is regulating whom and how. We discuss several approaches to the problem of reverse engineering of genetic networks, from discrete Boolean networks, to continuous linear and non-linear models. We conclude that the combination of predictive modeling with systematic experimental verification will be required to gain a deeper insight into living organisms, therapeutic targeting, and bioengineering.

1,010 citations


Journal ArticleDOI
TL;DR: A method that, unlike available methods, directly measures variations in phylogenetic signals in gene sequences that result from recombination, tests the significance of the signal variations and distinguishes misleading signals is developed, called sister-scanning.
Abstract: Motivation: To devise a method that, unlike available methods, directly measures variations in phylogenetic signals in gene sequences that result from recombination, tests the significance of the signal variations and distinguishes misleading signals. Results: We have developed a method, that we call ‘sisterscanning’, for assessing phylogenetic and compositional signals in the various patterns of identity that occur between four nucleotide sequences. A Monte Carlo randomization is done for all columns (positions) within a window and Z -scores are obtained for four real sequences or three real sequences with an outlier that is also randomized. The usefulness of the approach is demonstrated using tobamovirus and luteovirus sequences. Contradictory phylogenetic signals were distinguished in both datasets, as were regions of sequence that contained no clear signal or potentially misleading signals related to compositional similarities. In the tobamovirus dataset, contradictory phylogenetic signals were separated by coding sequences up to a kilobase long that contained no clear signal. Our re-analysis of this dataset using sister-scanning also yielded the first evidence known to us of an interspecies recombination site within a viral RNA-dependent RNA polymerase gene together with evidence of an unusual pattern of conservation in the three codon positions. Availability: A program package, SiScan, for use under MS-DOS can be downloaded from http:// life.anu.edu.au/ with test data and instructions.

975 citations


Journal ArticleDOI
TL;DR: Vista is a program for visualizing global DNA sequence alignments of arbitrary length that has a clean output, allowing for easy identification of similarity, and is easily configurable, enabling the visualization of alignments at different levels of resolution.
Abstract: VISTA is a program for visualizing global DNA sequence alignments of arbitrary length. It has a clean output, allowing for easy identification of similarity, and is easily configurable, enabling the visualization of alignments of various lengths at different levels of resolution. It is currently available on the web, thus allowing for easy access by all researchers. Availability: VISTA server is available on the web at http://www-gsd.lbl.gov/vista. The source code is available upon request.

956 citations


Journal ArticleDOI
TL;DR: DaliLite is a program for pairwise structure comparison and for structure database searching and a web interface is provided to view the results, multiple alignments and 3D superimpositions of structures.
Abstract: Summary: DaliLite is a program for pairwise structure comparison and for structure database searching. It is a standalone version of the search engine of the popular Dali server. A web interface is provided to view the results, multiple alignments and 3D superimpositions of structures. Availability: DaliLite has been ported to the Linux and Irix operating systems and can be compiled in many other UNIX operating systems. It is found at http:// www. embl-ebi.ac.uk/ dali/ DaliLite.

939 citations


Journal ArticleDOI
TL;DR: A software package that provides a graphical summary of linkage disequilibrium in human genetic data that allows for theAnalysis of family data and is well suited to the analysis of dense genetic maps is described.
Abstract: Summary: We describe a software package that provides a graphical summary of linkage disequilibrium in human genetic data. It allows for the analysis of family data and is well suited to the analysis of dense genetic maps. Availability: http:// www.well.ox.ac.uk/ asthma/ GOLD Contact: goncalo@well.ox.ac.uk Precise estimates of the location of complex disease genes should permit their identification through positional cloning, even when understanding of the underlying biochemical pathways is limited (Collins, 1992). Public and private genome projects are investing a great deal of effort in the identification of polymorphic sites in the human population. These efforts are cataloguing increasing numbers of single-nucleotide polymorphisms (SNPs) which are well suited to automated high-throughput analysis. A dense genetic map of the human genome should be provided by SNPs in the near future. Traditional linkage analysis, based on allele sharing between relatives, identifies broad chromosomal regions that are likely to contain disease genes. However, the resolution of these methods is limited by the number of recombination events in typical pedigrees and impractical for positional cloning efforts in complex disease. Finemapping within the broad regions identified by allelesharing methods is a major challenge. Gene mapping strategies based on linkage disequilibrium are expected to have much greater resolution, and should be able to capitalize on dense SNP maps as they become available (Risch and Merikangas, 1996). As ancestral haplotypes propagate through a population, their physical length is reduced by recombination events. Recombination events between markers separated by very short distances are very rare. Individuals inheriting a disease mutation from a common, but possibly distant, ancestor are expected to share a region of the ancestral haplotype in which the mutation originated. Markers within this shared haplotype are non-randomly associated

Journal ArticleDOI
TL;DR: A WWW server for the on-line prediction of the biological activity spectra of substances has been constructed and a WWW interface for the PASS software is developed.
Abstract: The concept of the biological activity spectrum was introduced to describe the properties of biologically active substances. The PASS (prediction of activity spectra for substances) software product, which predicts more than 300 pharmacological effects and biochemical mechanisms on the basis of the structural formula of a substance, may be efficiently used to find new targets (mechanisms) for some ligands and, conversely, to reveal new ligands for some biological targets. We have developed a WWW interface for the PASS software. A WWW server for the on-line prediction of the biological activity spectra of substances has been constructed.

Journal ArticleDOI
TL;DR: Inferring genetic network architecture from time series data of gene expression patterns is an important topic in bioinformatics and inference algorithms based on the Boolean network were proposed, but were not sufficient as a model of a genetic network.
Abstract: Motivation: Inferring genetic network architecture from time series data of gene expression patterns is an important topic in bioinformatics. Although inference algorithms based on the Boolean network were proposed, the Boolean network was not sufficient as a model of a genetic network. Results: First, a Boolean network model with noise is proposed, together with an inference algorithm for it. Next, a qualitative network model is proposed, in which regulation rules are represented as qualitative rules and embedded in the network structure. Algorithms are also presented for inferring qualitative relations from time series data. Then, an algorithm for inferring S-systems (synergistic and saturable systems) from time series data is presented, where S-systems are based on a particular kind of nonlinear differential equation and have been applied to the analysis of various biological systems. Theoretical results are shown for Boolean networks with noises and simple qualitative networks. Computational results are shown for Boolean networks with noises and S-systems, where real data are not used because the proposed models are still conceptual and the quantity and quality of currently available data are not enough for the application of the proposed methods.

Journal ArticleDOI
TL;DR: The program provides a maximum likelihood estimate of the rate and also the associated date of the most recent common ancestor of the sequences, under a model which assumes a constant rate of substitution (molecular clock) but which accommodates the dates of isolation.
Abstract: Motivation: TipDate is a program that will use sequences that have been isolated at different dates to estimate their rate of molecular evolution. The program provides a maximum likelihood estimate of the rate and also the associated date of the most recent common ancestor of the sequences, under a model which assumes a constant rate of substitution (molecular clock) but which accommodates the dates of isolation. Confidence intervals for these parameters are also estimated. Results: The approach was applied to a sample of 17 dengue virus serotype 4 sequences, isolated at dates ranging from 1956 to 1994. The rate of substitution for this serotype was estimated to be 7.91×10 −4 substitutions per site per year (95% confidence intervals of 6.07×10 −4 , 9.86 × 10 −4 ). This is compatible with a date of 1922 (95% confidence intervals of 1900‐1936) for the most recent common ancestor of these sequences. Availability: TipDate can be obtained by WWW from {http: //evolve.zoo.ox.ac.uk/software}. The package includes the source code, manual and example files. Both UNIX and Apple Macintosh versions are available from the same

Journal ArticleDOI
TL;DR: UNLABELLED RRTree is a user-friendly program for comparing substitution rates between lineages of protein or DNA sequences, relative to an outgroup, through relative rate tests.
Abstract: Summary: RRTree is a user-friendly program for comparing substitution rates between lineages of protein or DNA sequences, relative to an outgroup, through relative rate tests. Genetic diversity is taken into account through use of several sequences, and phylogenetic relations are integrated by topological weighting. Availability: The ANSI C source code of RRTree, and compiled versions for Macintosh, MS-DOS/Windows, SUN Solaris, and CGI, are freely available at http://pbil.univ-lyon1.fr/software/rrtree.html Contact: marc.robinson@ens-lyon.fr.

Journal ArticleDOI
TL;DR: A scanning algorithm for detecting noncoding RNA genes in genome sequences is developed, using a fully probabilistic version of the Zuker minimum-energy folding algorithm, which concludes that although a distinct, stable secondary structure is undoubtedly important in mostnoncoding RNAs, the stability of most nonc coding RNA secondary structures is not sufficiently different from the predicted stability of a random sequence to be useful as a general genefinding approach.
Abstract: Motivation Several results in the literature suggest that biologically interesting RNAs have secondary structures that are more stable than expected by chance. Based on these observations, we developed a scanning algorithm for detecting noncoding RNA genes in genome sequences, using a fully probabilistic version of the Zuker minimum-energy folding algorithm. Results Preliminary results were encouraging, but certain anomalies led us to do a carefully controlled investigation of this class of methods. Ultimately, our results argue that for the probabilistic model there is indeed a statistical effect, but it comes mostly from local base-composition bias and not from RNA secondary structure. For the thermodynamic implementation (which evaluates statistical significance by doing Monte Carlo shuffling in fixed-length sequence windows, thus eliminating the base-composition effect) the signals for noncoding RNAs are still usually indistinguishable from noise, especially when certain statistical artifacts resulting from local base-composition inhomogeneity are taken into account. We conclude that although a distinct, stable secondary structure is undoubtedly important in most noncoding RNAs, the stability of most noncoding RNA secondary structures is not sufficiently different from the predicted stability of a random sequence to be useful as a general genefinding approach.

Journal ArticleDOI
TL;DR: T E Xshade is the first T E X-based alignment shading software featuring, in addition to standard identity and similarity shading, special modes for the display of functional aspects such as charge, hydropathy or solvent accessibility.
Abstract: Motivation: Typesetting, shading and labeling of nucleotide and peptide alignments using standard word processing or graphics software is time consuming. Available automatic sequence shading programs usually do not allow manual application of additional shadings or labels. Hence, a flexible alignment shading package was designed for both calculated and manual shading, using the macro language of the scientific typesetting software L AT E X2 e. Results: T E Xshade is the first T E X-based alignment shading software featuring, in addition to standard identity and similarity shading, special modes for the display of functional aspects such as charge, hydropathy or solvent accessibility. A plenitude of commands for manual shading, graphical labels, re-arrangements of the sequence order, numbering, legends etc. is implemented. Further, TEXshade allows the inclusion and display of secondary structure predictions in the DSSP-, STRIDEand PHD-format. Availability: From http:// homepages.uni-tuebingen.de/ beitz/ tse.html (macro package and on-line documentation)

Journal ArticleDOI
TL;DR: This program compares sequence sets to a reference sequence, tallies G --> A hypermutations, and presents the results in various tables and graphs, which include dinucleotide context, summaries of all observed nucleotide changes, and stop codons introduced by hypermutation.
Abstract: Summary This program compares sequence sets to a reference sequence, tallies G --> A hypermutations, and presents the results in various tables and graphs, which include dinucleotide context, summaries of all observed nucleotide changes, and stop codons introduced by hypermutation. Availability www.hiv.lanl.gov/HYPERMUT/hypermut.html

Journal ArticleDOI
TL;DR: InterPro is a new integrated documentation resource for protein families, domains and functional sites, developed initially as a means of rationalising the complementary efforts of the PROSITE, PRINTS, Pfam and ProDom database projects.
Abstract: MOTIVATION: InterPro is a new integrated documentation resource for protein families, domains and functional sites, developed initially as a means of rationalising the complementary efforts of the PROSITE, PRINTS, Pfam and ProDom database projects. RESULTS: Merged annotations from PRINTS, PROSITE and Pfam form the InterPro core. Each combined InterPro entry includes functional descriptions and literature references, and links are made back to the relevant parent database(s), allowing users to see at a glance whether a particular family or domain has associated patterns, profiles, fingerprints, etc. Merged and individual entries (i.e. those that have no counterpart in the companion resources) are assigned unique accession numbers. Release 1.2 of InterPro (June 2000) contains over 3000 entries, representing families, domains, repeats and sites of post-translational modification (PTMs) encoded by 6581 different regular expressions, profiles, fingerprints and Hidden Markov Models (HMMs). Each InterPro entry lists all the matches against SWISS-PROT and TrEMBL (more than 1000000 hits from 264333 different proteins out of 384572 in SWISS-PROT and TrEMBL).

Journal ArticleDOI
TL;DR: A fast implementation of the Smith-Waterman sequence-alignment algorithm using Single-Instruction, Multiple-Data (SIMD) technology is presented, based on the MultiMedia eXtensions (MMX) and Streaming SIMD Extensions (SSE) technology that is embedded in Intel's latest microprocessors.
Abstract: Motivation: Sequence database searching is among the most important and challenging tasks in bioinformatics. The ultimate choice of sequence-search algorithm is that of Smith–Waterman. However, because of the computationally demanding nature of this method, heuristic programs or special-purpose hardware alternatives have been developed. Increased speed has been obtained at the cost of reduced sensitivity or very expensive hardware. Results: A fast implementation of the Smith–Waterman sequence-alignment algorithm using Single-Instruction, Multiple-Data (SIMD) technology is presented. This implementation is based on the MultiMedia eXtensions (MMX) and Streaming SIMD Extensions (SSE) technology that is embedded in Intel’s latest microprocessors. Similar technology exists also in other modern microprocessors. Six-fold speed-up relative to the fastest previously known Smith–Waterman implementation on the same hardware was achieved by an optimized 8-way parallel processing approach. A speed of more than 150 million cell updates per second was obtained on a single Intel Pentium III 500 MHz microprocessor. This is probably the fastest implementation of this algorithm on a single general-purpose microprocessor described to date. Availability: Online searches with the software are available at http:// dna.uio.no/ search/

Journal ArticleDOI
TL;DR: An optimization procedure improving upon traditional Fourier analysis performance in distinguishing coding from noncoding regions in DNA sequences is provided and it is demonstrated that color spectrograms can visually provide significant information about biomolecular sequences, thus facilitating understanding of local nature, structure and function.
Abstract: Motivation: Frequency-domain analysis of biomolecular sequences is hindered by their representation as strings of characters. If numerical values are assigned to each of these characters, then the resulting numerical sequences are readily amenable to digital signal processing. Results: We introduce new computational and visual tools for biomolecular sequences analysis. In particular, we provide an optimization procedure improving upon traditional Fourier analysis performance in distinguishing coding from noncoding regions in DNA sequences. We also show that the phase of a properly defined Fourier transform is a powerful predictor of the reading frame of protein coding regions. Resulting color maps help in visually identifying not only the existence of protein coding areas for both DNA strands, but also the coding direction and the reading frame for each of the exons. Furthermore, we demonstrate that color spectrograms can visually provide, in the form of local ‘texture’, significant information about biomolecular sequences, thus facilitating understanding of local nature, structure and function. Availability: All software for techniques described in this paper is available from the author upon request.

Journal ArticleDOI
TL;DR: The IMpRH database is the official database for submission of new results and queries and not only permits the sharing of public data but also semi-private and private data.
Abstract: Summary: The INRA-Minnesota Porcine Radiation Hybrid (IMpRH) Server provides both a mapping tool (IMpRH mapping tool) and a database (IMpRH database) of officially submitted results. The mapping tool permits the mapping of a new marker relatively to markers previously mapped on the IMpRH panel. The IMpRH database is the official database for submission of new results and queries. The database not only permits the sharing of public data but also semi-private and private data. Availability: http:// imprh.toulouse.inra.fr

Journal ArticleDOI
TL;DR: MaskerAid is a software enhancement to RepeatMasker that increased the speed of masking more than 30-fold at the most sensitive setting, creating a costly bottleneck in large-scale analyses.
Abstract: Summary: Identifying and masking repetitive elements is usually the first step when analyzing vertebrate genomic sequence. Current repeat identification software is sensitive but slow, creating a costly bottleneck in large-scale analyses. We have developed MaskerAid, a software enhancement to RepeatMasker that increased the speed of masking more than 30-fold at the most sensitive setting. Availability: On request from the authors (see http:// sapiens.wustl.edu/ MaskerAid).

Journal ArticleDOI
Peter D. Karp1
TL;DR: The article explores the notion of computing with function, and explains the importance of ontologies of function to bioinformatics, and presents the functional ontology developed for the EcoCyc database.
Abstract: Motivations: A number of important bioinformatics computations involve computing with function: executing computational operations whose inputs or outputs are descriptions of the functions of biomolecules. Examples include performing functional queries to sequence and pathway databases, and determining functional equality to evaluate algorithms that predict function from sequence. A prerequisite to computing with function is the existence of an ontology that provides a structured semantic encoding of function. Functional bioinformatics is an emerging subfield of bioinformatics that is concerned with developing ontologies and algorithms for computing with biological function. Results: The article explores the notion of computing with function, and explains the importance of ontologies of function to bioinformatics. The functional ontology developed for the EcoCyc database is presented. This ontology can encode a diverse array of biochemical processes, including enzymatic reactions involving smallmolecule substrates and macromolecular substrates, signal-transduction processes, transport events, and mechanisms of regulation of gene expression. The ontology is validated through its use to express complex functional queries for the EcoCyc DB.

Journal ArticleDOI
TL;DR: A new algorithm for the automatic clustering of protein sequence datasets has been developed that represents all similarity relationships within the dataset in a binary matrix and can hence quickly and accurately cluster large protein datasets into families.
Abstract: Motivation: Efficient, accurate and automatic clustering of large protein sequence datasets, such as complete proteomes, into families, according to sequence similarity. Detection and correction of false positive and negative relationships with subsequent detection and resolution of multi-domain proteins. Results: A new algorithm for the automatic clustering of protein sequence datasets has been developed. This algorithm represents all similarity relationships within the dataset in a binary matrix. Removal of false positives is achieved through subsequent symmetrification of the matrix using a Smith‐Waterman dynamic programming alignment algorithm. Detection of multi-domain protein families and further false positive relationships within the symmetrical matrix is achieved through iterative processing of matrix elements with successive rounds of Smith‐Waterman dynamic programming alignments. Recursive single-linkage clustering of the corrected matrix allows efficient and accurate family representation for each protein in the dataset. Initial clusters containing multi-domain families, are split into their constituent clusters using the information obtained by the multidomain detection step. This algorithm can hence quickly and accurately cluster large protein datasets into families. Problems due to the presence of multi-domain proteins are minimized, allowing more precise clustering information to be obtained automatically. Availability: GeneRAGE (version 1.0) executable binaries for most platforms may be obtained from the authors on request. The system is available to academic users free of charge under license.

Journal ArticleDOI
TL;DR: It is found that domain boundary predictions are surprisingly successful for sequences up to 400 residues long and that guessing domain boundaries in this way can improve the sensitivity of threading analysis.
Abstract: Motivation: The sizes of protein domains observed in the 3D-structure database follow a surprisingly narrow distribution. Structural domains are furthermore formed from a single-chain continuous segment in over 80% of instances. These observations imply that some choices of domain boundaries on an otherwise uncharacterized sequence are more likely than others, based solely on the size and segment number of predicted domains. This property might be used to guess the locations of protein domain boundaries. Results: To test this possibility we enumerate putative domain boundaries and calculate their relative likelihood under a probability model that considers only the size and segment number of predicted domains. We ask, in a cross-validated test using sequences with known 3D structure, whether the most likely guesses agree with the observed domain structure. We find that domain boundary predictions are surprisingly successful for sequences up to 400 residues long and that guessing domain boundaries in this way can improve the sensitivity of threading analysis. Availability: The DGS algorithm, for ‘Domain Guess by Size’, is available as a web service at http:// www.ncbi. nlm.nih.gov/ dgs. This site also provides the DGS source code.

Journal ArticleDOI
TL;DR: A complete data specification in ASN.1 that can describe information about biomolecular interactions, complexes and pathways is defined and used in the Biomolecular Interaction Network Database (BIND).
Abstract: Motivation: Proteomics is gearing up towards highthroughput methods for identifying and characterizing all of the proteins, protein domains and protein interactions in a cell and will eventually create more recorded biological information than the Human Genome Project. Each protein expressed in a cell can interact with various other proteins and molecules in the course of its function. A standard data specification is required that can describe and store this information in all its detail and allow efficient cross-platform transfer of data. A complete specification must be the basis for any database or tool for managing and analysing this information. Results:We have defined a complete data specification in ASN.1 that can describe information about biomolecular interactions, complexes and pathways. Our group is using this data specification in our database, the Biomolecular Interaction Network Database (BIND). An interaction record is based on the interaction between two objects. An object can be a protein, DNA, RNA, ligand, molecular complex or an interaction. Interaction description encompasses cellular location, experimental conditions used to observe the interaction, conserved sequence, molecular location, chemical action, kinetics, thermodynamics, and chemical state. Molecular complexes are defined as collections of more than two interactions that form a complex, with extra descriptive information such as complex topology. Pathways are defined as collections of more than two interactions that form a pathway, with additional descriptive information such as cell cycle stage. A request for proposal of a human readable flat-file format that mirrors the BIND data specification is also tendered for interested parties.

Journal ArticleDOI
TL;DR: This paper compares three statistical tests for detecting significant changes of gene expression in SAGE experiments and shows that the Chi-square test has the best power and robustness.
Abstract: MOTIVATION: The Serial Analysis of Gene Expression (SAGE) technology determines the expression level of a gene by measuring the frequency of a sequence tag derived from the corresponding mRNA transcript. Several statistical tests have been developed to detect significant differences in tag frequency between two samples. However, which one of these tests has the greatest power to detect real changes remains undetermined. RESULTS: This paper compares three statistical tests for detecting significant changes of gene expression in SAGE experiments. The comparison makes use of Monte Carlo simulation that, in essence, generates "virtual" SAGE experiments. Our analysis shows that the Chi-square test has the best power and robustness. Since the POWER_ SAGE program can easily run "virtual" SAGE studies with different combinations of sample size and tag frequency and determine the power for each combination, it can serve as a useful tool for planning SAGE experiments. AVAILABILITY: The POWER_ SAGE software is available upon request from the authors. CONTACT: michael.man@pfizer.com

Journal ArticleDOI
TL;DR: A one-to-one correspondence is shown between a polynomial time dynamic programming algorithm and a formal transformational grammar for RNA secondary structure with pseudoknots, which encompasses the context-free grammars and goes beyond to generate pseudoknotted structures.
Abstract: Motivation: In a previous paper, we presented a polynomial time dynamic programming algorithm for predicting optimal RNA secondary structure including pseudoknots. However, a formal grammatical representation for RNA secondary structure with pseudoknots was still lacking. Results: Here we show a one-to-one correspondence between that algorithm and a formal transformational grammar. This grammar class encompasses the contextfree grammars and goes beyond to generate pseudoknotted structures. The pseudoknot grammar avoids the use of general context-sensitive rules by introducing a small number of auxiliary symbols used to reorder the strings generated by an otherwise context-free grammar. This formal representation of the residue correlations in RNA structure is important because it means we can build full probabilistic models of RNA secondary structure, including pseudoknots, and use them to optimally parse sequences in polynomial time.

Journal ArticleDOI
TL;DR: A novel algorithm for low-complexity region detection and selective masking based on multiple-pass Smith-Waterman comparison of the query sequence against twenty homopolymers with infinite gap penalties that is sufficient for masking database query sequences without generating false positives.
Abstract: Motivation: Sensitive detection and masking of lowcomplexity regions in protein sequences. Filtered sequences can be used in sequence comparison without the risk of matching compositionally biased regions. The main advantage of the method over similar approaches is the selective masking of single residue types without affecting other, possibly important, regions. Results: A novel algorithm for low-complexity region detection and selective masking. The algorithm is based on multiple-pass Smith–Waterman comparison of the query sequence against twenty homopolymers with infinite gap penalties. The output of the algorithm is both the masked query sequence for further analysis, e.g. database searches, as well as the regions of low complexity. The detection of low-complexity regions is highly specific for single residue types. It is shown that this approach is sufficient for masking database query sequences without generating false positives. The algorithm is benchmarked against widely available algorithms using the 210 genes of Plasmodium falciparum chromosome 2, a dataset known to contain a large number of low-complexity regions. Availability: CAST (version 1.0) executable binaries are available to academic users free of charge under license. Web site entry point, server and additional material: http: // www.ebi.ac.uk/ research/ cgg/ services/ cast/