Showing papers in "Bioinformatics in 2000"

PDF

Open Access

Journal Article•DOI•

The PSIPRED protein structure prediction server.

[...]

Liam J. McGuffin¹, Kevin Bryson, David T. Jones²•Institutions (2)

University of Warwick¹, Brunel University London²

01 Apr 2000-Bioinformatics

TL;DR: The PSIPRED protein structure prediction server allows users to submit a protein sequence, perform a prediction of their choice and receive the results of the prediction both textually via e-mail and graphically via the web.

...read moreread less

Abstract: The PSIPRED protein structure prediction server allows users to submit a protein sequence, perform a prediction of their choice and receive the results of the prediction both textually via e-mail and graphically via the web. The user may select one of three prediction methods to apply to their sequence: PSIPRED, a highly accurate secondary structure prediction method; MEMSAT 2, a new version of a widely used transmembrane topology prediction method; or GenTHREADER, a sequence profile based fold recognition method.

...read moreread less

3,381 citations

Journal Article•DOI•

Artemis: sequence visualization and annotation.

[...]

Kim Rutherford¹, Julian Parkhill, James Crook, Terry Horsnell, Peter M. Rice, Marie-Adèle Rajandream, Bart Barrell - Show less +3 more•Institutions (1)

Wellcome Trust¹

01 Oct 2000-Bioinformatics

TL;DR: Artemis is a DNA sequence visualization and annotation tool that allows the results of any analysis or sets of analyses to be viewed in the context of the sequence and its six-frame translation.

...read moreread less

Abstract: Summary: Artemis is a DNA sequence visualization and annotation tool that allows the results of any analysis or sets of analyses to be viewed in the context of the sequence and its six-frame translation. Artemis is especially useful in analysing the compact genomes of bacteria, archaea and lower eukaryotes, and will cope with sequences of any size from small genes to whole genomes. It is implemented in Java, and can be run on any suitable platform. Sequences and annotation can be read and written directly in EMBL, GenBank and GFF format. Availability: Artemis is available under the GNU General Public License from http:// www.sanger.ac.uk/ Software/ Artemis

...read moreread less

3,080 citations

Journal Article•DOI•

Support vector machine classification and validation of cancer tissue samples using microarray expression data

[...]

Terrence S. Furey¹, Nello Cristianini², Nigel Duffy¹, David W. Bednarski³, Michèl Schummer³, David Haussler¹ - Show less +2 more•Institutions (3)

University of California, Santa Cruz¹, University of Bristol², University of Washington³

01 Oct 2000-Bioinformatics

TL;DR: A new method to analyse tissue samples using support vector machines for mis-labeled or questionable tissue results and shows that other machine learning methods also perform comparably to the SVM on many of those datasets.

...read moreread less

Abstract: Motivation: DNA microarray experiments generating thousands of gene expression measurements, are being used to gather information from tissue and cell samples regarding gene expression differences that will be useful in diagnosing disease. We have developed a new method to analyse this kind of data using support vector machines (SVMs). This analysis consists of both classification of the tissue samples, and an exploration of the data for mis-labeled or questionable tissue results. Results: We demonstrate the method in detail on samples consisting of ovarian cancer tissues, normal ovarian tissues, and other normal tissues. The dataset consists of expression experiment results for 97 802 cDNAs for each tissue. As a result of computational analysis, a tissue sample is discovered and confirmed to be wrongly labeled. Upon correction of this mistake and the removal of an outlier, perfect classification of tissues is achieved, but not with high confidence. We identify and analyse a subset of genes from the ovarian dataset whose expression is highly differentiated between the types of tissues. To show robustness of the SVM method, two previously published datasets from other types of tissues or cells are analysed. The results are comparable to those previously obtained. We show that other machine learning methods also perform comparably to the SVM on many of those datasets. Availability: The SVM software is available at http:// www. cs.columbia.edu/ ∼bgrundy/ svm.

...read moreread less

2,464 citations

Journal Article•DOI•

Assessing the accuracy of prediction algorithms for classification: an overview

[...]

Pierre Baldi¹, Søren Brunak, Yves Chauvin, Claus A. Andersen, Henrik Nielsen - Show less +1 more•Institutions (1)

University of California, Irvine¹

01 May 2000-Bioinformatics

TL;DR: A unified overview of methods that currently are widely used to assess the accuracy of prediction algorithms, from raw percentages, quadratic error measures and other distances, and correlation coefficients, and to information theoretic measures such as relative entropy and mutual information are provided.

...read moreread less

Abstract: We provide a unified overview of methods that currently are widely used to assess the accuracy of prediction algorithms, from raw percentages, quadratic error measures and other distances, and correlation coefficients, and to information theoretic measures such as relative entropy and mutual information. We briefly discuss the advantages and disadvantages of each approach. For classification tasks, we derive new learning algorithms for the design of prediction systems by directly optimising the correlation coefficient. We observe and prove several results relating sensitivity and specificity of optimal systems. While the principles are general, we illustrate the applicability on specific problems such as protein secondary structure and signal peptide prediction.

...read moreread less

1,972 citations

Journal Article•DOI•

DNA binding sites: representation and discovery.

[...]

Gary D. Stormo¹•Institutions (1)

Washington University in St. Louis¹

01 Jan 2000-Bioinformatics

TL;DR: The purpose of this article is to provide a brief history of the development and application of computer algorithms for the analysis and prediction of DNA binding sites.

...read moreread less

Abstract: The purpose of this article is to provide a brief history of the development and application of computer algorithms for the analysis and prediction of DNA binding sites. This problem can be conveniently divided into two subproblems. The first is, given a collection of known binding sites, develop a representation of those sites that can be used to search new sequences and reliably predict where additional binding sites occur. The second is, given a set of sequences known to contain binding sites for a common factor, but not knowing where the sites are, discover the location of the sites in each sequence and a representation for the specificity of the protein.

...read moreread less

1,556 citations

Journal Article•DOI•

RDP: detection of recombination amongst aligned sequences

[...]

Darren P. Martin¹, Edward P. Rybicki•Institutions (1)

University of Cape Town¹

01 Jun 2000-Bioinformatics

TL;DR: Recombination Detection Program is a program that applies a pairwise scanning approach to the detection of recombination amongst a group of aligned DNA sequences.

...read moreread less

Abstract: Recombination Detection Program (RDP) is a program that applies a pairwise scanning approach to the detection of recombination amongst a group of aligned DNA sequences. The software runs under Windows95 and combines highly automated screening of large numbers of sequences with a highly interactive interface for examining the results of the analyses.

...read moreread less

1,400 citations

Journal Article•DOI•

Genetic network inference: from co-expression clustering to reverse engineering.

[...]

Patrik D'haeseleer¹, Shoudan Liang, Roland Somogyi²•Institutions (2)

University of New Mexico¹, Incyte²

01 Aug 2000-Bioinformatics

TL;DR: It is concluded that the combination of predictive modeling with systematic experimental verification will be required to gain a deeper insight into living organisms, therapeutic targeting and bioengineering.

...read moreread less

Abstract: Advances in molecular biological, analytical, and computational technologies are enabling us to systematically investigate the complex molecular processes underlying biological systems. In particular, using high-throughput gene expression assays, we are able to measure the output of the gene regulatory network. We aim here to review datamining and modeling approaches for conceptualizing and unraveling the functional relationships implicit in these datasets. Clustering of co-expression profiles allows us to infer shared regulatory inputs and functional pathways. We discuss various aspects of clustering, ranging from distance measures to clustering algorithms and multiple-duster memberships. More advanced analysis aims to infer causal connections between genes directly, i.e., who is regulating whom and how. We discuss several approaches to the problem of reverse engineering of genetic networks, from discrete Boolean networks, to continuous linear and non-linear models. We conclude that the combination of predictive modeling with systematic experimental verification will be required to gain a deeper insight into living organisms, therapeutic targeting, and bioengineering.

...read moreread less

1,010 citations

Journal Article•DOI•

Sister-scanning: a Monte Carlo procedure for assessing signals in recombinant sequences.

[...]

Mark J. Gibbs¹, John S. Armstrong, Adrian J. Gibbs•Institutions (1)

Australian National University¹

01 Jul 2000-Bioinformatics

TL;DR: A method that, unlike available methods, directly measures variations in phylogenetic signals in gene sequences that result from recombination, tests the significance of the signal variations and distinguishes misleading signals is developed, called sister-scanning.

...read moreread less

Abstract: Motivation: To devise a method that, unlike available methods, directly measures variations in phylogenetic signals in gene sequences that result from recombination, tests the significance of the signal variations and distinguishes misleading signals. Results: We have developed a method, that we call ‘sisterscanning’, for assessing phylogenetic and compositional signals in the various patterns of identity that occur between four nucleotide sequences. A Monte Carlo randomization is done for all columns (positions) within a window and Z -scores are obtained for four real sequences or three real sequences with an outlier that is also randomized. The usefulness of the approach is demonstrated using tobamovirus and luteovirus sequences. Contradictory phylogenetic signals were distinguished in both datasets, as were regions of sequence that contained no clear signal or potentially misleading signals related to compositional similarities. In the tobamovirus dataset, contradictory phylogenetic signals were separated by coding sequences up to a kilobase long that contained no clear signal. Our re-analysis of this dataset using sister-scanning also yielded the first evidence known to us of an interspecies recombination site within a viral RNA-dependent RNA polymerase gene together with evidence of an unusual pattern of conservation in the three codon positions. Availability: A program package, SiScan, for use under MS-DOS can be downloaded from http:// life.anu.edu.au/ with test data and instructions.

...read moreread less

975 citations

Journal Article•DOI•

VISTA : visualizing global DNA sequence alignments of arbitrary length.

[...]

Chris Mayor¹, Michael Brudno, Jody R. Schwartz, Alexander Poliakov, Edward M. Rubin, Kelly A. Frazer, Lior Pachter, Inna Dubchak - Show less +4 more•Institutions (1)

AmeriCorps VISTA¹

01 Nov 2000-Bioinformatics

TL;DR: Vista is a program for visualizing global DNA sequence alignments of arbitrary length that has a clean output, allowing for easy identification of similarity, and is easily configurable, enabling the visualization of alignments at different levels of resolution.

...read moreread less

Abstract: VISTA is a program for visualizing global DNA sequence alignments of arbitrary length. It has a clean output, allowing for easy identification of similarity, and is easily configurable, enabling the visualization of alignments of various lengths at different levels of resolution. It is currently available on the web, thus allowing for easy access by all researchers. Availability: VISTA server is available on the web at http://www-gsd.lbl.gov/vista. The source code is available upon request.

...read moreread less

956 citations

Journal Article•DOI•

DaliLite workbench for protein structure comparison

[...]

Liisa Holm¹, Jong Park•Institutions (1)

European Bioinformatics Institute¹

01 Jun 2000-Bioinformatics

TL;DR: DaliLite is a program for pairwise structure comparison and for structure database searching and a web interface is provided to view the results, multiple alignments and 3D superimpositions of structures.

...read moreread less

Abstract: Summary: DaliLite is a program for pairwise structure comparison and for structure database searching. It is a standalone version of the search engine of the popular Dali server. A web interface is provided to view the results, multiple alignments and 3D superimpositions of structures. Availability: DaliLite has been ported to the Linux and Irix operating systems and can be compiled in many other UNIX operating systems. It is found at http:// www. embl-ebi.ac.uk/ dali/ DaliLite.

...read moreread less

939 citations

Journal Article•DOI•

GOLD—Graphical Overview of Linkage Disequilibrium

[...]

Gonçalo R. Abecasis¹, William O.C.M. Cookson²•Institutions (2)

Wellcome Trust Centre for Human Genetics¹, University of Oxford²

01 Feb 2000-Bioinformatics

TL;DR: A software package that provides a graphical summary of linkage disequilibrium in human genetic data that allows for theAnalysis of family data and is well suited to the analysis of dense genetic maps is described.

...read moreread less

Abstract: Summary: We describe a software package that provides a graphical summary of linkage disequilibrium in human genetic data. It allows for the analysis of family data and is well suited to the analysis of dense genetic maps. Availability: http:// www.well.ox.ac.uk/ asthma/ GOLD Contact: goncalo@well.ox.ac.uk Precise estimates of the location of complex disease genes should permit their identification through positional cloning, even when understanding of the underlying biochemical pathways is limited (Collins, 1992). Public and private genome projects are investing a great deal of effort in the identification of polymorphic sites in the human population. These efforts are cataloguing increasing numbers of single-nucleotide polymorphisms (SNPs) which are well suited to automated high-throughput analysis. A dense genetic map of the human genome should be provided by SNPs in the near future. Traditional linkage analysis, based on allele sharing between relatives, identifies broad chromosomal regions that are likely to contain disease genes. However, the resolution of these methods is limited by the number of recombination events in typical pedigrees and impractical for positional cloning efforts in complex disease. Finemapping within the broad regions identified by allelesharing methods is a major challenge. Gene mapping strategies based on linkage disequilibrium are expected to have much greater resolution, and should be able to capitalize on dense SNP maps as they become available (Risch and Merikangas, 1996). As ancestral haplotypes propagate through a population, their physical length is reduced by recombination events. Recombination events between markers separated by very short distances are very rare. Individuals inheriting a disease mutation from a common, but possibly distant, ancestor are expected to share a region of the ancestral haplotype in which the mutation originated. Markers within this shared haplotype are non-randomly associated

...read moreread less

Journal Article•DOI•

PASS: prediction of activity spectra for biologically active substances

[...]

Alexey Lagunin, A. V. Stepanchikova, Dmitrii Filimonov, Vladimir Poroikov

01 Aug 2000-Bioinformatics

TL;DR: A WWW server for the on-line prediction of the biological activity spectra of substances has been constructed and a WWW interface for the PASS software is developed.

...read moreread less

Abstract: The concept of the biological activity spectrum was introduced to describe the properties of biologically active substances. The PASS (prediction of activity spectra for substances) software product, which predicts more than 300 pharmacological effects and biochemical mechanisms on the basis of the structural formula of a substance, may be efficiently used to find new targets (mechanisms) for some ligands and, conversely, to reveal new ligands for some biological targets. We have developed a WWW interface for the PASS software. A WWW server for the on-line prediction of the biological activity spectra of substances has been constructed.

...read moreread less

Journal Article•DOI•

Inferring qualitative relations in genetic networks and metabolic pathways

[...]

Tatsuya Akutsu¹, Satoru Miyano¹, Satoru Kuhara²•Institutions (2)

University of Tokyo¹, Kyushu University²

01 Aug 2000-Bioinformatics

TL;DR: Inferring genetic network architecture from time series data of gene expression patterns is an important topic in bioinformatics and inference algorithms based on the Boolean network were proposed, but were not sufficient as a model of a genetic network.

...read moreread less

Abstract: Motivation: Inferring genetic network architecture from time series data of gene expression patterns is an important topic in bioinformatics. Although inference algorithms based on the Boolean network were proposed, the Boolean network was not sufficient as a model of a genetic network. Results: First, a Boolean network model with noise is proposed, together with an inference algorithm for it. Next, a qualitative network model is proposed, in which regulation rules are represented as qualitative rules and embedded in the network structure. Algorithms are also presented for inferring qualitative relations from time series data. Then, an algorithm for inferring S-systems (synergistic and saturable systems) from time series data is presented, where S-systems are based on a particular kind of nonlinear differential equation and have been applied to the analysis of various biological systems. Theoretical results are shown for Boolean networks with noises and simple qualitative networks. Computational results are shown for Boolean networks with noises and S-systems, where real data are not used because the proposed models are still conceptual and the quantity and quality of currently available data are not enough for the application of the proposed methods.

...read moreread less

Journal Article•DOI•

Estimating the rate of molecular evolution: incorporating non-contemporaneous sequences into maximum likelihood phylogenies

[...]

Andrew Rambaut

01 Apr 2000-Bioinformatics

TL;DR: The program provides a maximum likelihood estimate of the rate and also the associated date of the most recent common ancestor of the sequences, under a model which assumes a constant rate of substitution (molecular clock) but which accommodates the dates of isolation.

...read moreread less

Abstract: Motivation: TipDate is a program that will use sequences that have been isolated at different dates to estimate their rate of molecular evolution. The program provides a maximum likelihood estimate of the rate and also the associated date of the most recent common ancestor of the sequences, under a model which assumes a constant rate of substitution (molecular clock) but which accommodates the dates of isolation. Confidence intervals for these parameters are also estimated. Results: The approach was applied to a sample of 17 dengue virus serotype 4 sequences, isolated at dates ranging from 1956 to 1994. The rate of substitution for this serotype was estimated to be 7.91×10 −4 substitutions per site per year (95% confidence intervals of 6.07×10 −4 , 9.86 × 10 −4 ). This is compatible with a date of 1922 (95% confidence intervals of 1900‐1936) for the most recent common ancestor of these sequences. Availability: TipDate can be obtained by WWW from {http: //evolve.zoo.ox.ac.uk/software}. The package includes the source code, manual and example files. Both UNIX and Apple Macintosh versions are available from the same

...read moreread less

Journal Article•DOI•

RRTree: Relative-Rate Tests between groups of sequences on a phylogenetic tree

[...]

Marc Robinson-Rechavi¹, Dorothée Huchon•Institutions (1)

Tel Aviv University¹

01 Mar 2000-Bioinformatics

TL;DR: UNLABELLED RRTree is a user-friendly program for comparing substitution rates between lineages of protein or DNA sequences, relative to an outgroup, through relative rate tests.

...read moreread less

Abstract: Summary: RRTree is a user-friendly program for comparing substitution rates between lineages of protein or DNA sequences, relative to an outgroup, through relative rate tests. Genetic diversity is taken into account through use of several sequences, and phylogenetic relations are integrated by topological weighting. Availability: The ANSI C source code of RRTree, and compiled versions for Macintosh, MS-DOS/Windows, SUN Solaris, and CGI, are freely available at http://pbil.univ-lyon1.fr/software/rrtree.html Contact: marc.robinson@ens-lyon.fr.

...read moreread less

Journal Article•DOI•

Secondary Structure Alone Is Generally Not Statistically Significant for the Detection of Noncoding RNAs

[...]

Elena Rivas¹, Sean R. Eddy¹•Institutions (1)

Washington University in St. Louis¹

01 Jul 2000-Bioinformatics

TL;DR: A scanning algorithm for detecting noncoding RNA genes in genome sequences is developed, using a fully probabilistic version of the Zuker minimum-energy folding algorithm, which concludes that although a distinct, stable secondary structure is undoubtedly important in mostnoncoding RNAs, the stability of most nonc coding RNA secondary structures is not sufficiently different from the predicted stability of a random sequence to be useful as a general genefinding approach.

...read moreread less

Abstract: Motivation Several results in the literature suggest that biologically interesting RNAs have secondary structures that are more stable than expected by chance. Based on these observations, we developed a scanning algorithm for detecting noncoding RNA genes in genome sequences, using a fully probabilistic version of the Zuker minimum-energy folding algorithm. Results Preliminary results were encouraging, but certain anomalies led us to do a carefully controlled investigation of this class of methods. Ultimately, our results argue that for the probabilistic model there is indeed a statistical effect, but it comes mostly from local base-composition bias and not from RNA secondary structure. For the thermodynamic implementation (which evaluates statistical significance by doing Monte Carlo shuffling in fixed-length sequence windows, thus eliminating the base-composition effect) the signals for noncoding RNAs are still usually indistinguishable from noise, especially when certain statistical artifacts resulting from local base-composition inhomogeneity are taken into account. We conclude that although a distinct, stable secondary structure is undoubtedly important in most noncoding RNAs, the stability of most noncoding RNA secondary structures is not sufficiently different from the predicted stability of a random sequence to be useful as a general genefinding approach.

...read moreread less

Journal Article•DOI•

TeXshade: shading and labeling of multiple sequence alignments using LaTeX2e

[...]

Eric Beitz¹•Institutions (1)

University of Tübingen¹

01 Feb 2000-Bioinformatics

TL;DR: T E Xshade is the first T E X-based alignment shading software featuring, in addition to standard identity and similarity shading, special modes for the display of functional aspects such as charge, hydropathy or solvent accessibility.

...read moreread less

Abstract: Motivation: Typesetting, shading and labeling of nucleotide and peptide alignments using standard word processing or graphics software is time consuming. Available automatic sequence shading programs usually do not allow manual application of additional shadings or labels. Hence, a flexible alignment shading package was designed for both calculated and manual shading, using the macro language of the scientific typesetting software L AT E X2 e. Results: T E Xshade is the first T E X-based alignment shading software featuring, in addition to standard identity and similarity shading, special modes for the display of functional aspects such as charge, hydropathy or solvent accessibility. A plenitude of commands for manual shading, graphical labels, re-arrangements of the sequence order, numbering, legends etc. is implemented. Further, TEXshade allows the inclusion and display of secondary structure predictions in the DSSP-, STRIDEand PHD-format. Availability: From http:// homepages.uni-tuebingen.de/ beitz/ tse.html (macro package and on-line documentation)

...read moreread less

Journal Article•DOI•

Detecting hypermutations in viral sequences with an emphasis on G --> A hypermutation.

[...]

Patrick P. Rose¹, Bette T. Korber•Institutions (1)

Los Alamos National Laboratory¹

01 Apr 2000-Bioinformatics

TL;DR: This program compares sequence sets to a reference sequence, tallies G --> A hypermutations, and presents the results in various tables and graphs, which include dinucleotide context, summaries of all observed nucleotide changes, and stop codons introduced by hypermutation.

...read moreread less

Abstract: Summary This program compares sequence sets to a reference sequence, tallies G --> A hypermutations, and presents the results in various tables and graphs, which include dinucleotide context, summaries of all observed nucleotide changes, and stop codons introduced by hypermutation. Availability www.hiv.lanl.gov/HYPERMUT/hypermut.html

...read moreread less

Journal Article•DOI•

InterPro--an integrated documentation resource for protein families, domains and functional sites

[...]

01 Dec 2000-Bioinformatics

TL;DR: InterPro is a new integrated documentation resource for protein families, domains and functional sites, developed initially as a means of rationalising the complementary efforts of the PROSITE, PRINTS, Pfam and ProDom database projects.

...read moreread less

Abstract: MOTIVATION: InterPro is a new integrated documentation resource for protein families, domains and functional sites, developed initially as a means of rationalising the complementary efforts of the PROSITE, PRINTS, Pfam and ProDom database projects. RESULTS: Merged annotations from PRINTS, PROSITE and Pfam form the InterPro core. Each combined InterPro entry includes functional descriptions and literature references, and links are made back to the relevant parent database(s), allowing users to see at a glance whether a particular family or domain has associated patterns, profiles, fingerprints, etc. Merged and individual entries (i.e. those that have no counterpart in the companion resources) are assigned unique accession numbers. Release 1.2 of InterPro (June 2000) contains over 3000 entries, representing families, domains, repeats and sites of post-translational modification (PTMs) encoded by 6581 different regular expressions, profiles, fingerprints and Hidden Markov Models (HMMs). Each InterPro entry lists all the matches against SWISS-PROT and TrEMBL (more than 1000000 hits from 264333 different proteins out of 384572 in SWISS-PROT and TrEMBL).

...read moreread less

Journal Article•DOI•

Six-fold speed-up of Smith-Waterman sequence database searches using parallel processing on common microprocessors

[...]

Torbjørn Rognes¹, Erling Seeberg¹•Institutions (1)

University of Oslo¹

01 Aug 2000-Bioinformatics

TL;DR: A fast implementation of the Smith-Waterman sequence-alignment algorithm using Single-Instruction, Multiple-Data (SIMD) technology is presented, based on the MultiMedia eXtensions (MMX) and Streaming SIMD Extensions (SSE) technology that is embedded in Intel's latest microprocessors.

...read moreread less

Abstract: Motivation: Sequence database searching is among the most important and challenging tasks in bioinformatics. The ultimate choice of sequence-search algorithm is that of Smith–Waterman. However, because of the computationally demanding nature of this method, heuristic programs or special-purpose hardware alternatives have been developed. Increased speed has been obtained at the cost of reduced sensitivity or very expensive hardware. Results: A fast implementation of the Smith–Waterman sequence-alignment algorithm using Single-Instruction, Multiple-Data (SIMD) technology is presented. This implementation is based on the MultiMedia eXtensions (MMX) and Streaming SIMD Extensions (SSE) technology that is embedded in Intel’s latest microprocessors. Similar technology exists also in other modern microprocessors. Six-fold speed-up relative to the fastest previously known Smith–Waterman implementation on the same hardware was achieved by an optimized 8-way parallel processing approach. A speed of more than 150 million cell updates per second was obtained on a single Intel Pentium III 500 MHz microprocessor. This is probably the fastest implementation of this algorithm on a single general-purpose microprocessor described to date. Availability: Online searches with the software are available at http:// dna.uio.no/ search/

...read moreread less

Journal Article•DOI•

Frequency-domain analysis of biomolecular sequences

[...]

Dimitris Anastassiou¹•Institutions (1)

Columbia University¹

01 Dec 2000-Bioinformatics

TL;DR: An optimization procedure improving upon traditional Fourier analysis performance in distinguishing coding from noncoding regions in DNA sequences is provided and it is demonstrated that color spectrograms can visually provide significant information about biomolecular sequences, thus facilitating understanding of local nature, structure and function.

...read moreread less

Abstract: Motivation: Frequency-domain analysis of biomolecular sequences is hindered by their representation as strings of characters. If numerical values are assigned to each of these characters, then the resulting numerical sequences are readily amenable to digital signal processing. Results: We introduce new computational and visual tools for biomolecular sequences analysis. In particular, we provide an optimization procedure improving upon traditional Fourier analysis performance in distinguishing coding from noncoding regions in DNA sequences. We also show that the phase of a properly defined Fourier transform is a powerful predictor of the reading frame of protein coding regions. Resulting color maps help in visually identifying not only the existence of protein coding areas for both DNA strands, but also the coding direction and the reading frame for each of the exons. Furthermore, we demonstrate that color spectrograms can visually provide, in the form of local ‘texture’, significant information about biomolecular sequences, thus facilitating understanding of local nature, structure and function. Availability: All software for techniques described in this paper is available from the author upon request.

...read moreread less

Journal Article•DOI•

IMpRH Server: an RH mapping server available on the Web

[...]

Denis Milan¹, Rachel J. Hawken², Cédric Cabau¹, Sophie Leroux¹, C Genêt¹, Yvette Lahbib¹, Gwenola Tosser¹, Annie Robic¹, François Hatey¹, Lee Alexander², Craig W. Beattie², Lawrence B. Schook², Martine Yerle¹, Joël Gellin¹ - Show less +10 more•Institutions (2)

Institut national de la recherche agronomique¹, University of Minnesota²

01 Jun 2000-Bioinformatics

TL;DR: The IMpRH database is the official database for submission of new results and queries and not only permits the sharing of public data but also semi-private and private data.

...read moreread less

Abstract: Summary: The INRA-Minnesota Porcine Radiation Hybrid (IMpRH) Server provides both a mapping tool (IMpRH mapping tool) and a database (IMpRH database) of officially submitted results. The mapping tool permits the mapping of a new marker relatively to markers previously mapped on the IMpRH panel. The IMpRH database is the official database for submission of new results and queries. The database not only permits the sharing of public data but also semi-private and private data. Availability: http:// imprh.toulouse.inra.fr

...read moreread less

Journal Article•DOI•

MaskerAid : a performance enhancement to RepeatMasker

[...]

Joseph A. Bedell¹, Ian F Korf¹, Warren Gish¹•Institutions (1)

Washington University in St. Louis¹

01 Nov 2000-Bioinformatics

TL;DR: MaskerAid is a software enhancement to RepeatMasker that increased the speed of masking more than 30-fold at the most sensitive setting, creating a costly bottleneck in large-scale analyses.

...read moreread less

Abstract: Summary: Identifying and masking repetitive elements is usually the first step when analyzing vertebrate genomic sequence. Current repeat identification software is sensitive but slow, creating a costly bottleneck in large-scale analyses. We have developed MaskerAid, a software enhancement to RepeatMasker that increased the speed of masking more than 30-fold at the most sensitive setting. Availability: On request from the authors (see http:// sapiens.wustl.edu/ MaskerAid).

...read moreread less

Journal Article•DOI•

An ontology for biological function based on molecular interactions

[...]

Peter D. Karp¹•Institutions (1)

SRI International¹

01 Mar 2000-Bioinformatics

TL;DR: The article explores the notion of computing with function, and explains the importance of ontologies of function to bioinformatics, and presents the functional ontology developed for the EcoCyc database.

...read moreread less

Abstract: Motivations: A number of important bioinformatics computations involve computing with function: executing computational operations whose inputs or outputs are descriptions of the functions of biomolecules. Examples include performing functional queries to sequence and pathway databases, and determining functional equality to evaluate algorithms that predict function from sequence. A prerequisite to computing with function is the existence of an ontology that provides a structured semantic encoding of function. Functional bioinformatics is an emerging subfield of bioinformatics that is concerned with developing ontologies and algorithms for computing with biological function. Results: The article explores the notion of computing with function, and explains the importance of ontologies of function to bioinformatics. The functional ontology developed for the EcoCyc database is presented. This ontology can encode a diverse array of biochemical processes, including enzymatic reactions involving smallmolecule substrates and macromolecular substrates, signal-transduction processes, transport events, and mechanisms of regulation of gene expression. The ontology is validated through its use to express complex functional queries for the EcoCyc DB.

...read moreread less

Journal Article•DOI•

GeneRAGE: a robust algorithm for sequence clustering and domain detection.

[...]

Anton J. Enright¹, Christos A. Ouzounis¹•Institutions (1)

European Bioinformatics Institute¹

01 May 2000-Bioinformatics

TL;DR: A new algorithm for the automatic clustering of protein sequence datasets has been developed that represents all similarity relationships within the dataset in a binary matrix and can hence quickly and accurately cluster large protein datasets into families.

...read moreread less

Abstract: Motivation: Efficient, accurate and automatic clustering of large protein sequence datasets, such as complete proteomes, into families, according to sequence similarity. Detection and correction of false positive and negative relationships with subsequent detection and resolution of multi-domain proteins. Results: A new algorithm for the automatic clustering of protein sequence datasets has been developed. This algorithm represents all similarity relationships within the dataset in a binary matrix. Removal of false positives is achieved through subsequent symmetrification of the matrix using a Smith‐Waterman dynamic programming alignment algorithm. Detection of multi-domain protein families and further false positive relationships within the symmetrical matrix is achieved through iterative processing of matrix elements with successive rounds of Smith‐Waterman dynamic programming alignments. Recursive single-linkage clustering of the corrected matrix allows efficient and accurate family representation for each protein in the dataset. Initial clusters containing multi-domain families, are split into their constituent clusters using the information obtained by the multidomain detection step. This algorithm can hence quickly and accurately cluster large protein datasets into families. Problems due to the presence of multi-domain proteins are minimized, allowing more precise clustering information to be obtained automatically. Availability: GeneRAGE (version 1.0) executable binaries for most platforms may be obtained from the authors on request. The system is available to academic users free of charge under license.

...read moreread less

Journal Article•DOI•

Domain size distributions can predict domain boundaries

[...]

Sarah J. Wheelan¹, Aron Marchler-Bauer¹, Stephen H. Bryant¹•Institutions (1)

National Institutes of Health¹

01 Jul 2000-Bioinformatics

TL;DR: It is found that domain boundary predictions are surprisingly successful for sequences up to 400 residues long and that guessing domain boundaries in this way can improve the sensitivity of threading analysis.

...read moreread less

Abstract: Motivation: The sizes of protein domains observed in the 3D-structure database follow a surprisingly narrow distribution. Structural domains are furthermore formed from a single-chain continuous segment in over 80% of instances. These observations imply that some choices of domain boundaries on an otherwise uncharacterized sequence are more likely than others, based solely on the size and segment number of predicted domains. This property might be used to guess the locations of protein domain boundaries. Results: To test this possibility we enumerate putative domain boundaries and calculate their relative likelihood under a probability model that considers only the size and segment number of predicted domains. We ask, in a cross-validated test using sequences with known 3D structure, whether the most likely guesses agree with the observed domain structure. We find that domain boundary predictions are surprisingly successful for sequences up to 400 residues long and that guessing domain boundaries in this way can improve the sensitivity of threading analysis. Availability: The DGS algorithm, for ‘Domain Guess by Size’, is available as a web service at http:// www.ncbi. nlm.nih.gov/ dgs. This site also provides the DGS source code.

...read moreread less

Journal Article•DOI•

BIND--a data specification for storing and describing biomolecular interactions, molecular complexes and pathways.

[...]

Gary D. Bader¹, Christopher W. V. Hogue•Institutions (1)

University of Toronto¹

01 May 2000-Bioinformatics

TL;DR: A complete data specification in ASN.1 that can describe information about biomolecular interactions, complexes and pathways is defined and used in the Biomolecular Interaction Network Database (BIND).

...read moreread less

Abstract: Motivation: Proteomics is gearing up towards highthroughput methods for identifying and characterizing all of the proteins, protein domains and protein interactions in a cell and will eventually create more recorded biological information than the Human Genome Project. Each protein expressed in a cell can interact with various other proteins and molecules in the course of its function. A standard data specification is required that can describe and store this information in all its detail and allow efficient cross-platform transfer of data. A complete specification must be the basis for any database or tool for managing and analysing this information. Results:We have defined a complete data specification in ASN.1 that can describe information about biomolecular interactions, complexes and pathways. Our group is using this data specification in our database, the Biomolecular Interaction Network Database (BIND). An interaction record is based on the interaction between two objects. An object can be a protein, DNA, RNA, ligand, molecular complex or an interaction. Interaction description encompasses cellular location, experimental conditions used to observe the interaction, conserved sequence, molecular location, chemical action, kinetics, thermodynamics, and chemical state. Molecular complexes are defined as collections of more than two interactions that form a complex, with extra descriptive information such as complex topology. Pathways are defined as collections of more than two interactions that form a pathway, with additional descriptive information such as cell cycle stage. A request for proposal of a human readable flat-file format that mirrors the BIND data specification is also tendered for interested parties.

...read moreread less

Journal Article•DOI•

POWER_SAGE: comparing statistical tests for SAGE experiments.

[...]

Michael Z. Man, Xuning Wang, Yixin Wang

01 Nov 2000-Bioinformatics

TL;DR: This paper compares three statistical tests for detecting significant changes of gene expression in SAGE experiments and shows that the Chi-square test has the best power and robustness.

...read moreread less

Abstract: MOTIVATION: The Serial Analysis of Gene Expression (SAGE) technology determines the expression level of a gene by measuring the frequency of a sequence tag derived from the corresponding mRNA transcript. Several statistical tests have been developed to detect significant differences in tag frequency between two samples. However, which one of these tests has the greatest power to detect real changes remains undetermined. RESULTS: This paper compares three statistical tests for detecting significant changes of gene expression in SAGE experiments. The comparison makes use of Monte Carlo simulation that, in essence, generates "virtual" SAGE experiments. Our analysis shows that the Chi-square test has the best power and robustness. Since the POWER_ SAGE program can easily run "virtual" SAGE studies with different combinations of sample size and tag frequency and determine the power for each combination, it can serve as a useful tool for planning SAGE experiments. AVAILABILITY: The POWER_ SAGE software is available upon request from the authors. CONTACT: michael.man@pfizer.com

...read moreread less

Journal Article•DOI•

The language of RNA: a formal grammar that includes pseudoknots.

[...]

Elena Rivas¹, Sean R. Eddy¹•Institutions (1)

Washington University in St. Louis¹

01 Apr 2000-Bioinformatics

TL;DR: A one-to-one correspondence is shown between a polynomial time dynamic programming algorithm and a formal transformational grammar for RNA secondary structure with pseudoknots, which encompasses the context-free grammars and goes beyond to generate pseudoknotted structures.

...read moreread less

Abstract: Motivation: In a previous paper, we presented a polynomial time dynamic programming algorithm for predicting optimal RNA secondary structure including pseudoknots. However, a formal grammatical representation for RNA secondary structure with pseudoknots was still lacking. Results: Here we show a one-to-one correspondence between that algorithm and a formal transformational grammar. This grammar class encompasses the contextfree grammars and goes beyond to generate pseudoknotted structures. The pseudoknot grammar avoids the use of general context-sensitive rules by introducing a small number of auxiliary symbols used to reorder the strings generated by an otherwise context-free grammar. This formal representation of the residue correlations in RNA structure is important because it means we can build full probabilistic models of RNA secondary structure, including pseudoknots, and use them to optimally parse sequences in polynomial time.

...read moreread less

Journal Article•DOI•

CAST: an iterative algorithm for the complexity analysis of sequence tracts

[...]

Vasilis J. Promponas¹, Anton J. Enright, Sophia Tsoka, David P. Kreil, Christophe Leroy, Stavros J. Hamodrakas, Chris Sander, Christos A. Ouzounis² - Show less +4 more•Institutions (2)

National and Kapodistrian University of Athens¹, European Bioinformatics Institute²

01 Oct 2000-Bioinformatics

TL;DR: A novel algorithm for low-complexity region detection and selective masking based on multiple-pass Smith-Waterman comparison of the query sequence against twenty homopolymers with infinite gap penalties that is sufficient for masking database query sequences without generating false positives.

...read moreread less

Abstract: Motivation: Sensitive detection and masking of lowcomplexity regions in protein sequences. Filtered sequences can be used in sequence comparison without the risk of matching compositionally biased regions. The main advantage of the method over similar approaches is the selective masking of single residue types without affecting other, possibly important, regions. Results: A novel algorithm for low-complexity region detection and selective masking. The algorithm is based on multiple-pass Smith–Waterman comparison of the query sequence against twenty homopolymers with infinite gap penalties. The output of the algorithm is both the masked query sequence for further analysis, e.g. database searches, as well as the regions of low complexity. The detection of low-complexity regions is highly specific for single residue types. It is shown that this approach is sufficient for masking database query sequences without generating false positives. The algorithm is benchmarked against widely available algorithms using the 210 genes of Plasmodium falciparum chromosome 2, a dataset known to contain a large number of low-complexity regions. Availability: CAST (version 1.0) executable binaries are available to academic users free of charge under license. Web site entry point, server and additional material: http: // www.ebi.ac.uk/ research/ cgg/ services/ cast/

...read moreread less

Collapse