scispace - formally typeset
Search or ask a question

Showing papers in "Bioinformatics in 2005"


Journal ArticleDOI
TL;DR: Haploview is a software package that provides computation of linkage disequilibrium statistics and population haplotype patterns from primary genotype data in a visually appealing and interactive interface.
Abstract: Summary: Research over the last few years has revealed significant haplotype structure in the human genome. The characterization of these patterns, particularly in the context of medical genetic association studies, is becoming a routine research activity. Haploview is a software package that provides computation of linkage disequilibrium statistics and population haplotype patterns from primary genotype data in a visually appealing and interactive interface. Availability: http://www.broad.mit.edu/mpg/haploview/ Contact: jcbarret@broad.mit.edu

13,862 citations


Journal ArticleDOI
TL;DR: Blast2GO (B2G), a research tool designed with the main purpose of enabling Gene Ontology (GO) based data mining on sequence data for which no GO annotation is yet available, is presented.
Abstract: Summary: We present here Blast2GO (B2G), a research tool designed with the main purpose of enabling Gene Ontology (GO) based data mining on sequence data for which no GO annotation is yet available. B2G joints in one application GO annotation based on similarity searches with statistical analysis and highlighted visualization on directed acyclic graphs. This tool offers a suitable platform for functional genomics research in non-model species. B2G is an intuitive and interactive desktop application that allows monitoring and comprehension of the whole annotation and analysis process. Availability: Blast2GO is freely available via Java Web Start at http://www.blast2go.de Supplementary material:http://www.blast2go.de -> Evaluation Contact:[email protected]; [email protected]

10,092 citations


Journal ArticleDOI
TL;DR: The Biological Networks Gene Ontology tool (BiNGO) is an open-source Java tool to determine whichGene Ontology terms are significantly overrepresented in a set of genes.
Abstract: Summary: The Biological Networks Gene Ontology tool (BiNGO) is an open-source Java tool to determine which Gene Ontology (GO) terms are significantly overrepresented in a set of genes. BiNGO can be used either on a list of genes, pasted as text, or interactively on subgraphs of biological networks visualized in Cytoscape. BiNGO maps the predominant functional themes of the tested gene set on the GO hierarchy, and takes advantage of Cytoscape's versatile visualization environment to produce an intuitive and customizable visual representation of the results. Availability: http://www.psb.ugent.be/cbd/papers/BiNGO/ Contact: martin.kuiper@psb.ugent.be

3,884 citations


Journal ArticleDOI
TL;DR: PowerMarker delivers a data-driven, integrated analysis environment (IAE) for genetic data that accelerates the analysis lifecycle and enables users to maintain data integrity throughout the process.
Abstract: Summary: PowerMarker delivers a data-driven, integrated analysis environment (IAE) for genetic data. The IAE integrates data management, analysis and visualization in a user-friendly graphical user interface. It accelerates the analysis lifecycle and enables users to maintain data integrity throughout the process. An ever-growing list of more than 50 different statistical analyses for genetic markers has been implemented in PowerMarker. Availability: www.powermarker.net Contact: powermarker@hotmail.com

3,808 citations


Journal ArticleDOI
TL;DR: This work has built a tool for the selection of the best-fit model of evolution, among a set of candidate models, for a given protein sequence alignment in order to study protein evolution and phylogenetic inference.
Abstract: Summary: Using an appropriate model of amino acid replacement is very important for the study of protein evolution and phylogenetic inference. We have built a tool for the selection of the best-fit model of evolution, among a set of candidate models, for a given protein sequence alignment. Availability: ProtTest is available under the GNU license from http://darwin.uvigo.es Contact: fabascal@uvigo.es

3,150 citations


Journal ArticleDOI
TL;DR: The HyPhypackage is designed to provide a flexible and unified platform for carrying out likelihood-based analyses on multiple alignments of molecular sequence data, with the emphasis on studies of rates and patterns of sequence evolution.
Abstract: Summary: The HyPhypackage is designed to provide a flexible and unified platform for carrying out likelihood-based analyses on multiple alignments of molecular sequence data, with the emphasis on studies of rates and patterns of sequence evolution. Availability: http://www.hyphy.org Contact: muse@stat.ncsu.edu Supplementary information:HyPhydocumentation and tutorials are available at http://www.hyphy.org

2,845 citations


Journal ArticleDOI
TL;DR: UNLABELLED ROCR is a package for evaluating and visualizing the performance of scoring classifiers in the statistical language R that features over 25 performance measures that can be freely combined to create two-dimensional performance curves.
Abstract: Summary: ROCR is a package for evaluating and visualizing the performance of scoring classifiers in the statistical language R. It features over 25 performance measures that can be freely combined to create two-dimensional performance curves. Standard methods for investigating trade-offs between specific performance measures are available within a uniform framework, including receiver operating characteristic (ROC) graphs, precision/recall plots, lift charts and cost curves. ROCR integrates tightly with R's powerful graphics capabilities, thus allowing for highly adjustable plots. Being equipped with only three commands and reasonable default values for optional parameters, ROCR combines flexibility with ease of usage. Availability:http://rocr.bioinf.mpi-sb.mpg.de. ROCR can be used under the terms of the GNU General Public License. Running within R, it is platform-independent. Contact: tobias.sing@mpi-sb.mpg.de

2,838 citations


Journal ArticleDOI
TL;DR: A KO-Based Annotation System (KOBAS) is developed that can automatically annotate a set of sequences with KO terms and identify both the most frequent and the statistically significantly enriched pathways.
Abstract: Motivation: High-throughput technologies such as DNA sequencing and microarrays have created the need for automated annotation of large sets of genes, including whole genomes, and automated identification of pathways. Ontologies, such as the popular Gene Ontology (GO), provide a common controlled vocabulary for these types of automated analysis. Yet, while GO offers tremendous value, it also has certain limitations such as the lack of direct association with pathways. Results: We demonstrated the use of the KEGG Orthology (KO), part of the KEGG suite of resources, as an alternative controlled vocabulary for automated annotation and pathway identification. We developed a KO-Based Annotation System (KOBAS) that can automatically annotate a set of sequences with KO terms and identify both the most frequent and the statistically significantly enriched pathways. Results from both whole genome and microarray gene cluster annotations with KOBAS are comparable and complementary to known annotations. KOBAS is a freely available standalone Python program that can contribute significantly to genome annotation and microarray analysis. Availability: Supplementary data and the KOBAS system are available at http://genome.cbi.pku.edu.cn/download.html Contact: weilp@mail.cbi.pku.edu.cn

2,595 citations


Journal ArticleDOI
TL;DR: A method for detecting distant homologous relationships between proteins based on the generalized alignment of protein sequences with a profile hidden Markov model (HMM) to the case of pairwise alignment of profile HMMs is presented.
Abstract: Motivation: Protein homology detection and sequence alignment are at the basis of protein structure prediction, function prediction and evolution. Results: We have generalized the alignment of protein sequences with a profile hidden Markov model (HMM) to the case of pairwise alignment of profile HMMs. We present a method for detecting distant homologous relationships between proteins based on this approach. The method (HHsearch) is benchmarked together with BLAST, PSI-BLAST, HMMER and the profile--profile comparison tools PROF_SIM and COMPASS, in an all-against-all comparison of a database of 3691 protein domains from SCOP 1.63 with pairwise sequence identities below 20%. Sensitivity: When the predicted secondary structure is included in the HMMs, HHsearch is able to detect between 2.7 and 4.2 times more homologs than PSI-BLAST or HMMER and between 1.44 and 1.9 times more than COMPASS or PROF_SIM for a rate of false positives of 10%. Approximately half of the improvement over the profile--profile comparison methods is attributable to the use of profile HMMs in place of simple profiles. Alignment quality: Higher sensitivity is mirrored by an increased alignment quality. HHsearch produced 1.2, 1.7 and 3.3 times more good alignments ('balanced' score >0.3) than the next best method (COMPASS), and 1.6, 2.9 and 9.4 times more than PSI-BLAST, at the family, superfamily and fold level, respectively. Speed: HHsearch scans a query of 200 residues against 3691 domains in 33 s on an AMD64 2GHz PC. This is 10 times faster than PROF_SIM and 17 times faster than COMPASS. Availability: HHsearch can be downloaded from http://www.protevo.eb.tuebingen.mpg.de/download/ together with up-to-date versions of SCOP and PFAM. A web server is available at http://www.protevo.eb.tuebingen.mpg.de/toolkit/index.php?view=hhpred Contact: johannes.soeding@tuebingen.mpg.de

2,420 citations


Journal ArticleDOI
TL;DR: GMAP, a standalone program for mapping and aligning cDNA sequences to a genome with minimal startup time and memory requirements, and provides fast batch processing of large sequence sets, demonstrates a several-fold increase in speed over existing programs.
Abstract: Motivation: We introduce gmap, a standalone program for mapping and aligning cDNA sequences to a genome. The program maps and aligns a single sequence with minimal startup time and memory requirements, and provides fast batch processing of large sequence sets. The program generates accurate gene structures, even in the presence of substantial polymorphisms and sequence errors, without using probabilistic splice site models. Methodology underlying the program includes a minimal sampling strategy for genomic mapping, oligomer chaining for approximate alignment, sandwich DP for splice site detection, and microexon identification with statistical significance testing. Results: On a set of human messenger RNAs with random mutations at a 1 and 3% rate, gmap identified all splice sites accurately in over 99.3% of the sequences, which was one-tenth the error rate of existing programs. On a large set of human expressed sequence tags, gmap provided higher-quality alignments more often than blat did. On a set of Arabidopsis cDNAs, gmap performed comparably with GeneSeqer. In these experiments, gmap demonstrated a several-fold increase in speed over existing programs. Availability: Source code for gmap and associated programs is available at http://www.gene.com/share/gmap Contact: [email protected] Supplementary information: http://www.gene.com/share/gmap

2,058 citations


Journal ArticleDOI
TL;DR: A new version of the program MatInspector is presented that identifies TFBS in nucleotide sequences using a large library of weight matrices using a matrix family concept, optimized thresholds, and comparative analysis and produces concise results avoiding redundant and false-positive matches.
Abstract: Motivation: Promoter analysis is an essential step on the way to identify regulatory networks. A prerequisite for successful promoter analysis is the prediction of potential transcription factor binding sites (TFBS) with reasonable accuracy. The next steps in promoter analysis can be tackled only with reliable predictions, e.g. finding phylogenetically conserved patterns or identifying higher order combinations of sites in promoters of co-regulated genes. Results: We present a new version of the program MatInspector that identifies TFBS in nucleotide sequences using a large library of weight matrices. By introducing a matrix family concept, optimized thresholds, and comparative analysis, the enhanced program produces concise results avoiding redundant and false-positive matches. We describe a number of programs based on MatInspector allowing in-depth promoter analysis (DiAlignTF, FrameWorker) and targeted design of regulatory sequences (SequenceShaper). Availability: MatInspector and the other programs described here can be used online at http://www.genomatix.de/matinspector.html. Access is free after registration within certain limitations (e.g. the number of analysis per month is currently limited to 20 analyses of arbitrary sequences). Contact: cartharius@genomatix.de Supplementary information: http://www.genomatix.de/matinspector.html

Journal ArticleDOI
TL;DR: The IUPred server presents a novel algorithm for predicting such regions from amino acid sequences by estimating their total pairwise interresidue interaction energy, based on the assumption that IUP sequences do not fold due to their inability to form sufficient stabilizing inter Residue interactions.
Abstract: Summary: Intrinsically unstructured/disordered proteins and domains (IUPs) lack a well-defined three-dimensional structure under native conditions. The IUPred server presents a novel algorithm for predicting such regions from amino acid sequences by estimating their total pairwise interresidue interaction energy, based on the assumption that IUP sequences do not fold due to their inability to form sufficient stabilizing interresidue interactions. Optional to the prediction are built-in parameter sets optimized for predicting short or long disordered regions and structured domains. Availability: The IUPred server is available for academic users at http://iupred.enzim.hu Contact: [email protected]

Journal ArticleDOI
TL;DR: The biomaRt package provides a tight integration of large, public or locally installed BioMart databases with data analysis in Bioconductor creating a powerful environment for biological data mining.
Abstract: Summary:biomaRt is a new Bioconductor package that integrates BioMart data resources with data analysis software in Bioconductor. It can annotate a wide range of gene or gene product identifiers (e.g. Entrez-Gene and Affymetrix probe identifiers) with information such as gene symbol, chromosomal coordinates, Gene Ontology and OMIM annotation. Furthermore biomaRt enables retrieval of genomic sequences and single nucleotide polymorphism information, which can be used in data analysis. Fast and up-to-date data retrieval is possible as the package executes direct SQL queries to the BioMart databases (e.g. Ensembl). The biomaRt package provides a tight integration of large, public or locally installed BioMart databases with data analysis in Bioconductor creating a powerful environment for biological data mining. Availability:http://www.bioconductor.org. LGPL Contact: steffen.durinck@esat.kuleuven.ac.be

Journal ArticleDOI
TL;DR: The Artemis Comparison Tool (ACT) allows an interactive visualisation of comparisons between complete genome sequences and associated annotations and so inherits powerful searching and analysis tools.
Abstract: The Artemis Comparison Tool (ACT) allows an interactive visualisation of comparisons between complete genome sequences and associated annotations. The comparison data can be generated with several different programs; BLASTN, TBLASTX or Mummer comparisons between genomic DNA sequences, or orthologue tables generated by reciprocal FASTA comparison between protein sets. It is possible to identify regions of similarity, insertions and rearrangements at any level from the whole genome to base-pair differences. ACT uses Artemis components to display the sequences and so inherits powerful searching and analysis tools. ACT is part of the Artemis distribution and is similarly open source, written in Java and can run on any Java enabled platform, including UNIX, Macintosh and Windows. Availability: ACT is freely available (under a GPL licence) for download from the Sanger Institute web site, http://www.sanger.ac.uk Contact: artemis@sanger.ac.uk

Journal ArticleDOI
TL;DR: A new method for de novo identification of repeat families via extension of consensus seeds is developed, which enables a rigorous definition of repeat boundaries, a key issue in repeat analysis.
Abstract: Every time we compare two species that are closer to each other than either is to humans, we get nearly killed by unmasked repeats. Webb Miller (Personal communication) Motivation:De novo repeat family identification is a challenging algorithmic problem of great practical importance. As the number of genome sequencing projects increases, there is a pressing need to identify the repeat families present in large, newly sequenced genomes. We develop a new method for de novo identification of repeat families via extension of consensus seeds; our method enables a rigorous definition of repeat boundaries, a key issue in repeat analysis. Results: Our RepeatScout algorithm is more sensitive and is orders of magnitude faster than RECON, the dominant tool for de novo repeat family identification in newly sequenced genomes. Using RepeatScout, we estimate that ∼2% of the human genome and 4% of mouse and rat genomes consist of previously unannotated repetitive sequence. Availability: Source code is available for download at http://www-cse.ucsd.edu/groups/bioinformatics/software.html Contact: ppevzner@cs.ucsd.edu

Journal ArticleDOI
TL;DR: This paper presents the latest release of the program RAxML-III for rapid maximum likelihood-based inference of large evolutionary trees which allows for computation of 1.000-taxon trees in less than 24 hours on a single PC processor.
Abstract: Motivation: The computation of large phylogenetic trees with statistical models such as maximum likelihood or bayesian inference is computationally extremely intensive. It has repeatedly been demonstrated that these models are able to recover the true tree or a tree which is topologically closer to the true tree more frequently than less elaborate methods such as parsimony or neighbor joining. Due to the combinatorial and computational complexity the size of trees which can be computed on a Biologist's PC workstation within reasonable time is limited to trees containing approximately 100 taxa. Results: In this paper we present the latest release of our program RAxML-III for rapid maximum likelihood-based inference of large evolutionary trees which allows for computation of 1.000-taxon trees in less than 24 hours on a single PC processor. We compare RAxML-III to the currently fastest implementations for maximum likelihood and bayesian inference: PHYML and MrBayes. Whereas RAxML-III performs worse than PHYML and MrBayes on synthetic data it clearly outperforms both programs on all real data alignments used in terms of speed and final likelihood values. Availability Supplementary information: RAxML-III including all alignments and final trees mentioned in this paper is freely available as open source code at http://wwwbode.cs.tum/~stamatak Contact: stamatak@cs.tum.edu

Journal ArticleDOI
TL;DR: A method is proposed for extracting more information from within-array replicate spots in microarray experiments by estimating the strength of the correlation between them that greatly improves the precision with which the genewise variances are estimated and thereby improves inference methods designed to identify differentially expressed genes.
Abstract: Motivation: Spotted arrays are often printed with probes in duplicate or triplicate, but current methods for assessing differential expression are not able to make full use of the resulting information. The usual practice is to average the duplicate or triplicate results for each probe before assessing differential expression. This results in the loss of valuable information about genewise variability. Results: A method is proposed for extracting more information from within-array replicate spots in microarray experiments by estimating the strength of the correlation between them. The method involves fitting separate linear models to the expression data for each gene but with a common value for the between-replicate correlation. The method greatly improves the precision with which the genewise variances are estimated and thereby improves inference methods designed to identify differentially expressed genes. The method may be combined with empirical Bayes methods for moderating the genewise variances between genes. The method is validated using data from a microarray experiment involving calibration and ratio control spots in conjunction with spiked-in RNA. Comparing results for calibration and ratio control spots shows that the common correlation method results in substantially better discrimination of differentially expressed genes from those which are not. The spike-in experiment also confirms that the results may be further improved by empirical Bayes smoothing of the variances when the sample size is small. Availability: The methodology is implemented in the limma software package for R, available from the CRAN repository http://www.r-project.org Contact: [email protected]

Journal ArticleDOI
TL;DR: UNLABELLED Datamonkey is a web interface to a suite of cutting edge maximum likelihood-based tools for identification of sites subject to positive or negative selection that are implemented to run in parallel on a cluster of computers.
Abstract: Summary: Datamonkey is a web interface to a suite of cutting edge maximum likelihood-based tools for identification of sites subject to positive or negative selection. The methods range from very fast data exploration to the some of the most complex models available in public domain software, and are implemented to run in parallel on a cluster of computers. Availability:http://www.datamonkey.org. In the future, we plan to expand the collection of available analytic tools, and provide a package for installation on other systems. Contact: spond@ucsd.edu

Journal ArticleDOI
TL;DR: GD, The Golm Metabolome Database is presented, an open access metabolome database, which provides public access to custom mass spectral libraries, metabolite profiling experiments as well as additional information and tools, e.g. with regard to methods, spectral information or compounds.
Abstract: Summary: Metabolomics, in particular gas chromatography--mass spectrometry (GC--MS) based metabolite profiling of biological extracts, is rapidly becoming one of the cornerstones of functional genomics and systems biology. Metabolite profiling has profound applications in discovering the mode of action of drugs or herbicides, and in unravelling the effect of altered gene expression on metabolism and organism performance in biotechnological applications. As such the technology needs to be available to many laboratories. For this, an open exchange of information is required, like that already achieved for transcript and protein data. One of the key-steps in metabolite profiling is the unambiguous identification of metabolites in highly complex metabolite preparations from biological samples. Collections of mass spectra, which comprise frequently observed metabolites of either known or unknown exact chemical structure, represent the most effective means to pool the identification efforts currently performed in many laboratories around the world. Here we present GMD, The Golm Metabolome Database, an open access metabolome database, which should enable these processes. GMD provides public access to custom mass spectral libraries, metabolite profiling experiments as well as additional information and tools, e.g. with regard to methods, spectral information or compounds. The main goal will be the representation of an exchange platform for experimental research activities and bioinformatics to develop and improve metabolomics by multidisciplinary cooperation. Availability: http://csbdb.mpimp-golm.mpg.de/gmd.html Contact: Steinhauser@mpimp-golm.mpg.de Supplementary information: http://csbdb.mpimp-golm.mpg.de/

Journal ArticleDOI
TL;DR: This work compares several methods for estimating the 'true' prediction error of a prediction model in the presence of feature selection, and finds that LOOCV and 10-fold CV have the smallest bias for linear discriminant analysis and the .632+ bootstrap has the lowest mean square error.
Abstract: Motivation: In genomic studies, thousands of features are collected on relatively few samples. One of the goals of these studies is to build classifiers to predict the outcome of future observations. There are three inherent steps to this process: feature selection, model selection and prediction assessment. With a focus on prediction assessment, we compare several methods for estimating the 'true' prediction error of a prediction model in the presence of feature selection. Results: For small studies where features are selected from thousands of candidates, the resubstitution and simple split-sample estimates are seriously biased. In these small samples, leave-one-out cross-validation (LOOCV), 10-fold cross-validation (CV) and the .632+ bootstrap have the smallest bias for diagonal discriminant analysis, nearest neighbor and classification trees. LOOCV and 10-fold CV have the smallest bias for linear discriminant analysis. Additionally, LOOCV, 5- and 10-fold CV, and the .632+ bootstrap have the lowest mean square error. The .632+ bootstrap is quite biased in small sample sizes with strong signal-to-noise ratios. Differences in performance among resampling methods are reduced as the number of specimens available increase. Contact: annette.molinaro@yale.edu Supplementary Information: A complete compilation of results and R code for simulations and analyses are available in Molinaro et al. (2005) (http://linus.nci.nih.gov/brb/TechReport.htm).

Journal ArticleDOI
TL;DR: The method can be applied in structural genomics studies where protein binding sites remain uncharacterized since the 86% success rate for unbound proteins appears to be only slightly lower than that of ligand-bound proteins.
Abstract: Motivation: Identifying the location of ligand binding sites on a protein is of fundamental importance for a range of applications including molecular docking, de novo drug design and structural identification and comparison of functional sites. Here, we describe a new method of ligand binding site prediction called Q-SiteFinder. It uses the interaction energy between the protein and a simple van der Waals probe to locate energetically favourable binding sites. Energetically favourable probe sites are clustered according to their spatial proximity and clusters are then ranked according to the sum of interaction energies for sites within each cluster. Results: There is at least one successful prediction in the top three predicted sites in 90% of proteins tested when using Q-SiteFinder. This success rate is higher than that of a commonly used pocket detection algorithm (Pocket-Finder) which uses geometric criteria. Additionally, Q-SiteFinder is twice as effective as Pocket-Finder in generating predicted sites that map accurately onto ligand coordinates. It also generates predicted sites with the lowest average volumes of the methods examined in this study. Unlike pocket detection, the volumes of the predicted sites appear to show relatively low dependence on protein volume and are similar in volume to the ligands they contain. Restricting the size of the pocket is important for reducing the search space required for docking and de novo drug design or site comparison. The method can be applied in structural genomics studies where protein binding sites remain uncharacterized since the 86% success rate for unbound proteins appears to be only slightly lower than that of ligand-bound proteins. Availability: Both Q-SiteFinder and Pocket-Finder have been made available online at http://www.bioinformatics.leeds.ac.uk/qsitefinder and http://www.bioinformatics.leeds.ac.uk/pocketfinder Contact: r.m.jackson@leeds.ac.uk

Journal ArticleDOI
TL;DR: A new approach that combines sequential, structural and chemical information into one graph model of proteins, derivable from protein sequence and structure only, is competitive with vector models that require additional protein information, such as the size of surface pockets.
Abstract: Motivation: Computational approaches to protein function prediction infer protein function by finding proteins with similar sequence, structure, surface clefts, chemical properties, amino acid motifs, interaction partners or phylogenetic profiles. We present a new approach that combines sequential, structural and chemical information into one graph model of proteins. We predict functional class membership of enzymes and non-enzymes using graph kernels and support vector machine classification on these protein graphs. Results: Our graph model, derivable from protein sequence and structure only, is competitive with vector models that require additional protein information, such as the size of surface pockets. If we include this extra information into our graph model, our classifier yields significantly higher accuracy levels than the vector models. Hyperkernels allow us to select and to optimally combine the most relevant node attributes in our protein graphs. We have laid the foundation for a protein function prediction system that integrates protein information from various sources efficiently and effectively. Availability: More information available via www.dbs.ifi.lmu.de/Mitarbeiter/borgwardt.html. Contact: borgwardt@dbs.ifi.lmu.de

Journal ArticleDOI
TL;DR: UNLABELLED RDP2 is a Windows 95/XP program that examines nucleotide sequence alignments and attempts to identify recombinant sequences and recombination breakpoints using 10 published recombination detection methods, including GENECONV, BOOTSCAN, MAXIMUM chi(2), CHIMAERA and SISTER SCANNING.
Abstract: Summary: RDP2 is a Windows 95/XP program that examines nucleotide sequence alignments and attempts to identify recombinant sequences and recombination breakpoints using 10 published recombination detection methods, including GENECONV, BOOTSCAN, MAXIMUM χ2, CHIMAERA and SISTER SCANNING. The program enables fast automated analysis of large alignments (up to 300 sequences containing 13 000 sites), and interactive exploration, management and verification of results with different recombination detection and tree drawing methods. Availability: RDP2 is available free from the RDP2 website (http://darwin.uvigo.es/rdp/rdp.html) Contact: darren@science.uct.ac.za Supplementary information: Detailed descriptions of RDP2 and the methods it implements are included in the program manual, which can be downloaded from the RDP2 website.

Journal ArticleDOI
TL;DR: The analyses provide a novel route to infer expression profiles for presumed ancestral nodes in the tissue dendrogram, whereby de novo enhancement and diminution of gene expression go hand in hand, and highlight the importance of gene suppression events.
Abstract: Motivation: Genes are often characterized dichotomously as either housekeeping or single-tissue specific. We conjectured that crucial functional information resides in genes with midrange profiles of expression. Results: To obtain such novel information genome-wide, we have determined the mRNA expression levels for one of the largest hitherto analyzed set of 62 839 probesets in 12 representative normal human tissues. Indeed, when using a newly defined graded tissue specificity index τ, valued between 0 for housekeeping genes and 1 for tissue-specific genes, genes with midrange profiles having 0.15 50% of all expression patterns. We developed a binary classification, indicating for every gene the IB tissues in which it is overly expressed, and the 12 - IB tissues in which it shows low expression. The 85 dominant midrange patterns with IB = 2--11 were found to be bimodally distributed, and to contribute most significantly to the definition of tissue specification dendrograms. Our analyses provide a novel route to infer expression profiles for presumed ancestral nodes in the tissue dendrogram. Such definition has uncovered an unsuspected correlation, whereby de novo enhancement and diminution of gene expression go hand in hand. These findings highlight the importance of gene suppression events, with implications to the course of tissue specification in ontogeny and phylogeny. Availability: All data and analyses are publically available at the GeneNote website, http://genecards.weizmann.ac.il/genenote/ and, GEO accession GSE803. Contact: doron.lancet@weizmann.ac.il Supplementary information: Four tables available at the above site.

Journal ArticleDOI
TL;DR: In this article, the authors present a review of clustering validation techniques for post-genomic data analysis, with a particular focus on their application to postgenomic analysis of biological data.
Abstract: Motivation: The discovery of novel biological knowledge from the ab initio analysis of post-genomic data relies upon the use of unsupervised processing methods, in particular clustering techniques. Much recent research in bioinformatics has therefore been focused on the transfer of clustering methods introduced in other scientific fields and on the development of novel algorithms specifically designed to tackle the challenges posed by post-genomic data. The partitions returned by a clustering algorithm are commonly validated using visual inspection and concordance with prior biological knowledge---whether the clusters actually correspond to the real structure in the data is somewhat less frequently considered. Suitable computational cluster validation techniques are available in the general data-mining literature, but have been given only a fraction of the same attention in bioinformatics. Results: This review paper aims to familiarize the reader with the battery of techniques available for the validation of clustering results, with a particular focus on their application to post-genomic data analysis. Synthetic and real biological datasets are used to demonstrate the benefits, and also some of the perils, of analytical clustervalidation. Availability: The software used in the experiments is available at http://dbkweb.ch.umist.ac.uk/handl/clustervalidation/ Contact: J.Handl@postgrad.manchester.ac.uk Supplementary information: Enlarged colour plots are provided in the Supplementary Material, which is available at http://dbkweb.ch.umist.ac.uk/handl/clustervalidation/

Journal ArticleDOI
TL;DR: An easy-to-use, versatile and freely available graphic web server, FoldIndex© predicts if a given protein sequence is intrinsically unfolded implementing the algorithm of Uversky and co-workers, which is based on the average residue hydrophobicity and net charge of the sequence.
Abstract: Summary: An easy-to-use, versatile and freely available graphic web server, FoldIndex© is described: it predicts if a given protein sequence is intrinsically unfolded implementing the algorithm of Uversky and co-workers, which is based on the average residue hydrophobicity and net charge of the sequence. FoldIndex© has an error rate comparable to that of more sophisticated fold prediction methods. Sliding windows permit identification of large regions within a protein that possess folding propensities different from those of the whole protein. Availability: FoldIndex© can be accessed at http://bioportal.weizmann.ac.il/fldbin/findex Contact: Joel.Sussman@weizmann.ac.il Supplementary information: http://www.weizmann.ac.il/sb/faculty_pages/Sussman/papers/suppl/Prilusky_2005

Journal ArticleDOI
TL;DR: A detailed comparison of the capabilities of 14 ontological analysis tools is presented using the following criteria: scope of the analysis, visualization capabilities, statistical model used, correction for multiple comparisons, reference microarrays available, installation issues and sources of annotation data.
Abstract: Summary: Independent of the platform and the analysis methods used, the result of a microarray experiment is, in most cases, a list of differentially expressed genes An automatic ontological analysis approach has been recently proposed to help with the biological interpretation of such results Currently, this approach is the de facto standard for the secondary analysis of high throughput experiments and a large number of tools have been developed for this purpose We present a detailed comparison of 14 such tools using the following criteria: scope of the analysis, visualization capabilities, statistical model(s) used, correction for multiple comparisons, reference microarrays available, installation issues and sources of annotation data This detailed analysis of the capabilities of these tools will help researchers choose the most appropriate tool for a given type of analysis More importantly, in spite of the fact that this type of analysis has been generally adopted, this approach has several important intrinsic drawbacks These drawbacks are associated with all tools discussed and represent conceptual limitations of the current state-of-the-art in ontological analysis We propose these as challenges for the next generation of secondary data analysis tools Contact: [email protected]

Journal ArticleDOI
TL;DR: A novel framework for small-sample inference of graphical models from gene expression data that focuses on the so-called graphical Gaussian models (GGMs) that are now frequently used to describe gene association networks and to detect conditionally dependent genes is introduced.
Abstract: Motivation: Genetic networks are often described statistically using graphical models (e.g. Bayesian networks). However, inferring the network structure offers a serious challenge in microarray analysis where the sample size is small compared to the number of considered genes. This renders many standard algorithms for graphical models inapplicable, and inferring genetic networks an 'ill-posed' inverse problem. Methods: We introduce a novel framework for small-sample inference of graphical models from gene expression data. Specifically, we focus on the so-called graphical Gaussian models (GGMs) that are now frequently used to describe gene association networks and to detect conditionally dependent genes. Our new approach is based on (1) improved (regularized) small-sample point estimates of partial correlation, (2) an exact test of edge inclusion with adaptive estimation of the degree of freedom and (3) a heuristic network search based on false discovery rate multiple testing. Steps (2) and (3) correspond to an empirical Bayes estimate of the network topology. Results: Using computer simulations, we investigate the sensitivity (power) and specificity (true negative rate) of the proposed framework to estimate GGMs from microarray data. This shows that it is possible to recover the true network topology with high accuracy even for small-sample datasets. Subsequently, we analyze gene expression data from a breast cancer tumor study and illustrate our approach by inferring a corresponding large-scale gene association network for 3883 genes. Availability: The authors have implemented the approach in the R package 'GeneTS' that is freely available from http://www.stat.uni-muenchen.de/~strimmer/genets/, from the R archive (CRAN) and from the Bioconductor website. Contact: korbinian.strimmer@lmu.de

Journal ArticleDOI
TL;DR: A software system GEMS (Gene Expression Model Selector) that automates high-quality model construction and enforces sound optimization and performance estimation procedures is developed, the first such system to be informed by a rigorous comparative analysis of the available algorithms and datasets.
Abstract: Motivation: Cancer diagnosis is one of the most important emerging clinical applications of gene expression microarray technology. We are seeking to develop a computer system for powerful and reliable cancer diagnostic model creation based on microarray data. To keep a realistic perspective on clinical applications we focus on multicategory diagnosis. To equip the system with the optimum combination of classifier, gene selection and cross-validation methods, we performed a systematic and comprehensive evaluation of several major algorithms for multicategory classification, several gene selection methods, multiple ensemble classifier methods and two cross-validation designs using 11 datasets spanning 74 diagnostic categories and 41 cancer types and 12 normal tissue types. Results: Multicategory support vector machines (MC-SVMs) are the most effective classifiers in performing accurate cancer diagnosis from gene expression data. The MC-SVM techniques by Crammer and Singer, Weston and Watkins and one-versus-rest were found to be the best methods in this domain. MC-SVMs outperform other popular machine learning algorithms, such as k-nearest neighbors, backpropagation and probabilistic neural networks, often to a remarkable degree. Gene selection techniques can significantly improve the classification performance of both MC-SVMs and other non-SVM learning algorithms. Ensemble classifiers do not generally improve performance of the best non-ensemble models. These results guided the construction of a software system GEMS (Gene Expression Model Selector) that automates high-quality model construction and enforces sound optimization and performance estimation procedures. This is the first such system to be informed by a rigorous comparative analysis of the available algorithms and datasets. Availability: The software system GEMS is available for download from http://www.gems-system.org for non-commercial use. Contact: alexander.statnikov@vanderbilt.edu

Journal ArticleDOI
TL;DR: The success rates obtained by the new predictor are all significantly higher than those by the previous predictors, which implies that the distribution of hydrophobicity and hydrophilicity of the amino acid residues along a protein chain plays a very important role to its structure and function.
Abstract: Motivation: With protein sequences entering into databanks at an explosive pace, the early determination of the family or subfamily class for a newly found enzyme molecule becomes important because this is directly related to the detailed information about which specific target it acts on, as well as to its catalytic process and biological function. Unfortunately, it is both time-consuming and costly to do so by experiments alone. In a previous study, the covariant-discriminant algorithm was introduced to identify the 16 subfamily classes of oxidoreductases. Although the results were quite encouraging, the entire prediction process was based on the amino acid composition alone without including any sequence-order information. Therefore, it is worthy of further investigation. Results: To incorporate the sequence-order effects into the predictor, the 'amphiphilic pseudo amino acid composition' is introduced to represent the statistical sample of a protein. The novel representation contains 20 + 2λ discrete numbers: the first 20 numbers are the components of the conventional amino acid composition; the next 2λ numbers are a set of correlation factors that reflect different hydrophobicity and hydrophilicity distribution patterns along a protein chain. Based on such a concept and formulation scheme, a new predictor is developed. It is shown by the self-consistency test, jackknife test and independent dataset tests that the success rates obtained by the new predictor are all significantly higher than those by the previous predictors. The significant enhancement in success rates also implies that the distribution of hydrophobicity and hydrophilicity of the amino acid residues along a protein chain plays a very important role to its structure and function. Contact: [email protected]