Showing papers in "BMC Bioinformatics in 2008"

PDF

Open Access

Journal Article•DOI•

WGCNA: an R package for weighted correlation network analysis.

[...]

Peter Langfelder¹, Steve Horvath¹•Institutions (1)

29 Dec 2008-BMC Bioinformatics

TL;DR: The WGCNA R software package is a comprehensive collection of R functions for performing various aspects of weighted correlation network analysis that includes functions for network construction, module detection, gene selection, calculations of topological properties, data simulation, visualization, and interfacing with external software.

...read moreread less

Abstract: Correlation networks are increasingly being used in bioinformatics applications For example, weighted gene co-expression network analysis is a systems biology method for describing the correlation patterns among genes across microarray samples Weighted correlation network analysis (WGCNA) can be used for finding clusters (modules) of highly correlated genes, for summarizing such clusters using the module eigengene or an intramodular hub gene, for relating modules to one another and to external sample traits (using eigengene network methodology), and for calculating module membership measures Correlation networks facilitate network based gene screening methods that can be used to identify candidate biomarkers or therapeutic targets These methods have been successfully applied in various biological contexts, eg cancer, mouse genetics, yeast genetics, and analysis of brain imaging data While parts of the correlation network methodology have been described in separate publications, there is a need to provide a user-friendly, comprehensive, and consistent software implementation and an accompanying tutorial The WGCNA R software package is a comprehensive collection of R functions for performing various aspects of weighted correlation network analysis The package includes functions for network construction, module detection, gene selection, calculations of topological properties, data simulation, visualization, and interfacing with external software Along with the R package we also present R software tutorials While the methods development was motivated by gene expression data, the underlying data mining approach can be applied to a variety of different settings The WGCNA package provides R functions for weighted correlation network analysis, eg co-expression network analysis of gene expression data The R package along with its source code and additional material are freely available at http://wwwgeneticsuclaedu/labs/horvath/CoexpressionNetwork/Rpackages/WGCNA

...read moreread less

14,243 citations

Journal Article•DOI•

I-TASSER server for protein 3D structure prediction.

[...]

Yang Zhang¹•Institutions (1)

University of Kansas¹

23 Jan 2008-BMC Bioinformatics

TL;DR: The I-TASSER server has been developed to generate automated full-length 3D protein structural predictions where the benchmarked scoring system helps users to obtain quantitative assessments of the I- TASSER models.

...read moreread less

Abstract: Prediction of 3-dimensional protein structures from amino acid sequences represents one of the most important problems in computational structural biology. The community-wide Critical Assessment of Structure Prediction (CASP) experiments have been designed to obtain an objective assessment of the state-of-the-art of the field, where I-TASSER was ranked as the best method in the server section of the recent 7th CASP experiment. Our laboratory has since then received numerous requests about the public availability of the I-TASSER algorithm and the usage of the I-TASSER predictions. An on-line version of I-TASSER is developed at the KU Center for Bioinformatics which has generated protein structure predictions for thousands of modeling requests from more than 35 countries. A scoring function (C-score) based on the relative clustering structural density and the consensus significance score of multiple threading templates is introduced to estimate the accuracy of the I-TASSER predictions. A large-scale benchmark test demonstrates a strong correlation between the C-score and the TM-score (a structural similarity measurement with values in [0, 1]) of the first models with a correlation coefficient of 0.91. Using a C-score cutoff > -1.5 for the models of correct topology, both false positive and false negative rates are below 0.1. Combining C-score and protein length, the accuracy of the I-TASSER models can be predicted with an average error of 0.08 for TM-score and 2 A for RMSD. The I-TASSER server has been developed to generate automated full-length 3D protein structural predictions where the benchmarked scoring system helps users to obtain quantitative assessments of the I-TASSER models. The output of the I-TASSER server for each query includes up to five full-length models, the confidence score, the estimated TM-score and RMSD, and the standard deviation of the estimations. The I-TASSER server is freely available to the academic community at http://zhang.bioinformatics.ku.edu/I-TASSER .

...read moreread less

4,754 citations

Journal Article•DOI•

The Metagenomics RAST Server: A Public Resource for the Automatic Phylogenetic and Functional Analysis of Metagenomes

[...]

Folker Meyer¹, Folker Meyer², Daniel Paarmann¹, Mark D'Souza¹, Robert Olson², Elizabeth M. Glass², Michael Kubal¹, Tobias Paczian², Alexis A. Rodriguez¹, Rick Stevens¹, Rick Stevens², Andreas Wilke¹, Jared Wilkening², Robert Edwards², Robert Edwards³ - Show less +11 more•Institutions (3)

University of Chicago¹, Argonne National Laboratory², San Diego State University³

19 Sep 2008-BMC Bioinformatics

TL;DR: The open-source metagenomics RAST service provides a new paradigm for the annotation and analysis of metagenomes that is stable, extensible, and freely available to all researchers.

...read moreread less

Abstract: Random community genomes (metagenomes) are now commonly used to study microbes in different environments. Over the past few years, the major challenge associated with metagenomics shifted from generating to analyzing sequences. High-throughput, low-cost next-generation sequencing has provided access to metagenomics to a wide range of researchers. A high-throughput pipeline has been constructed to provide high-performance computing to all researchers interested in using metagenomics. The pipeline produces automated functional assignments of sequences in the metagenome by comparing both protein and nucleotide databases. Phylogenetic and functional summaries of the metagenomes are generated, and tools for comparative metagenomics are incorporated into the standard views. User access is controlled to ensure data privacy, but the collaborative environment underpinning the service provides a framework for sharing datasets between multiple users. In the metagenomics RAST, all users retain full control of their data, and everything is available for download in a variety of formats. The open-source metagenomics RAST service provides a new paradigm for the annotation and analysis of metagenomes. With built-in support for multiple data sources and a back end that houses abstract data types, the metagenomics RAST is stable, extensible, and freely available to all researchers. This service has removed one of the primary bottlenecks in metagenome sequence analysis – the availability of high-performance computing for annotating the data. http://metagenomics.nmpdr.org

...read moreread less

3,322 citations

Journal Article•DOI•

Conditional variable importance for random forests

[...]

Carolin Strobl¹, Anne-Laure Boulesteix, Thomas Kneib¹, Thomas Augustin¹, Achim Zeileis² - Show less +1 more•Institutions (2)

Ludwig Maximilian University of Munich¹, Vienna University of Economics and Business²

11 Jul 2008-BMC Bioinformatics

TL;DR: A new, conditional permutation scheme is developed for the computation of the variable importance measure that reflects the true impact of each predictor variable more reliably than the original marginal approach.

...read moreread less

Abstract: Random forests are becoming increasingly popular in many scientific fields because they can cope with "small n large p" problems, complex interactions and even highly correlated predictor variables. Their variable importance measures have recently been suggested as screening tools for, e.g., gene expression studies. However, these variable importance measures show a bias towards correlated predictor variables. We identify two mechanisms responsible for this finding: (i) A preference for the selection of correlated predictors in the tree building process and (ii) an additional advantage for correlated predictor variables induced by the unconditional permutation scheme that is employed in the computation of the variable importance measure. Based on these considerations we develop a new, conditional permutation scheme for the computation of the variable importance measure. The resulting conditional variable importance reflects the true impact of each predictor variable more reliably than the original marginal approach.

...read moreread less

2,466 citations

Journal Article•DOI•

LOSITAN: A workbench to detect molecular adaptation based on a Fst-outlier method

[...]

Tiago Antao¹, Ana M. Lopes², Ricardo Lopes², Albano Beja-Pereira², Gordon Luikart³, Gordon Luikart² - Show less +2 more•Institutions (3)

Liverpool School of Tropical Medicine¹, University of Porto², University of Montana³

28 Jul 2008-BMC Bioinformatics

TL;DR: This work presents LOSITAN, a selection detection workbench based on a well evaluated Fst-outlier detection method that greatly facilitates correct approximation of model parameters, provides data import and export functions, iterative contour smoothing and generation of graphics in a easy to use graphical user interface.

...read moreread less

Abstract: Testing for selection is becoming one of the most important steps in the analysis of multilocus population genetics data sets. Existing applications are difficult to use, leaving many non-trivial, error-prone tasks to the user. Here we present LOSITAN, a selection detection workbench based on a well evaluated F st -outlier detection method. LOSITAN greatly facilitates correct approximation of model parameters (e.g., genome-wide average, neutral F st ), provides data import and export functions, iterative contour smoothing and generation of graphics in a easy to use graphical user interface. LOSITAN is able to use modern multi-core processor architectures by locally parallelizing fdist, reducing computation time by half in current dual core machines and with almost linear performance gains in machines with more cores. LOSITAN makes selection detection feasible to a much wider range of users, even for large population genomic datasets, by both providing an easy to use interface and essential functionality to complete the whole selection detection process.

...read moreread less

1,121 citations

Journal Article•DOI•

LTRharvest , an efficient and flexible software for de novo detection of LTR retrotransposons

[...]

David Ellinghaus¹, Stefan Kurtz¹, Ute Willhoeft¹•Institutions (1)

University of Hamburg¹

14 Jan 2008-BMC Bioinformatics

TL;DR: A software tool for the de novo detection of full length LTR retrotransposons in large sequence sets and its ability to efficiently handle large datasets from finished or unfinished genome projects, its flexibility in incorporating known sequence features into the prediction, and its availability as an open source software.

...read moreread less

Abstract: Transposable elements are abundant in eukaryotic genomes and it is believed that they have a significant impact on the evolution of gene and chromosome structure While there are several completed eukaryotic genome projects, there are only few high quality genome wide annotations of transposable elements Therefore, there is a considerable demand for computational identification of transposable elements LTR retrotransposons, an important subclass of transposable elements, are well suited for computational identification, as they contain long terminal repeats (LTRs) We have developed a software tool LTRharvest for the de novo detection of full length LTR retrotransposons in large sequence sets LTRharvest efficiently delivers high quality annotations based on known LTR transposon features like length, distance, and sequence motifs A quality validation of LTRharvest against a gold standard annotation for Saccharomyces cerevisae and Drosophila melanogaster shows a sensitivity of up to 90% and 97% and specificity of 100% and 72%, respectively This is comparable or slightly better than annotations for previous software tools The main advantage of LTRharvest over previous tools is (a) its ability to efficiently handle large datasets from finished or unfinished genome projects, (b) its flexibility in incorporating known sequence features into the prediction, and (c) its availability as an open source software LTRharvest is an efficient software tool delivering high quality annotation of LTR retrotransposons It can, for example, process the largest human chromosome in approx 8 minutes on a Linux PC with 4 GB of memory Its flexibility and small space and run-time requirements makes LTRharvest a very competitive candidate for future LTR retrotransposon annotation projects Moreover, the structured design and implementation and the availability as open source provides an excellent base for incorporating novel concepts to further improve prediction of LTR retrotransposons

...read moreread less

995 citations

Journal Article•DOI•

ElliPro: a new structure-based tool for the prediction of antibody epitopes

[...]

Julia Ponomarenko¹, Julia Ponomarenko², Huynh-Hoa Bui³, Wei Li, Nicholas Fusseder, Philip E. Bourne¹, Philip E. Bourne², Alessandro Sette⁴, Björn Peters⁴ - Show less +5 more•Institutions (4)

University of California, San Diego¹, University of Montana², Isis Pharmaceuticals³, La Jolla Institute for Allergy and Immunology⁴

02 Dec 2008-BMC Bioinformatics

TL;DR: ElliPro is a web-tool that implements Thornton's method for identifying continuous epitopes in the protein regions protruding from the protein's globular surface and, together with a residue clustering algorithm, the MODELLER program and the Jmol viewer, allows the prediction and visualization of antibody epitope in a given protein sequence or structure.

...read moreread less

Abstract: Background Reliable prediction of antibody, or B-cell, epitopes remains challenging yet highly desirable for the design of vaccines and immunodiagnostics. A correlation between antigenicity, solvent accessibility, and flexibility in proteins was demonstrated. Subsequently, Thornton and colleagues proposed a method for identifying continuous epitopes in the protein regions protruding from the protein's globular surface. The aim of this work was to implement that method as a web-tool and evaluate its performance on discontinuous epitopes known from the structures of antibody-protein complexes.

...read moreread less

988 citations

Journal Article•DOI•

Highly sensitive feature detection for high resolution LC/MS

[...]

Ralf Tautenhahn¹, Christoph Böttcher¹, Steffen Neumann¹•Institutions (1)

Leibniz Association¹

28 Nov 2008-BMC Bioinformatics

TL;DR: A new feature detection algorithm centWave is developed for high-resolution LC/MS data sets, which collects regions of interest (partial mass traces) in the raw-data, and applies continuous wavelet transformation and optionally Gauss-fitting in the chromatographic domain.

...read moreread less

Abstract: Liquid chromatography coupled to mass spectrometry (LC/MS) is an important analytical technology for e.g. metabolomics experiments. Determining the boundaries, centres and intensities of the two-dimensional signals in the LC/MS raw data is called feature detection. For the subsequent analysis of complex samples such as plant extracts, which may contain hundreds of compounds, corresponding to thousands of features – a reliable feature detection is mandatory. We developed a new feature detection algorithm centWave for high-resolution LC/MS data sets, which collects regions of interest (partial mass traces) in the raw-data, and applies continuous wavelet transformation and optionally Gauss-fitting in the chromatographic domain. We evaluated our feature detection algorithm on dilution series and mixtures of seed and leaf extracts, and estimated recall, precision and F-score of seed and leaf specific features in two experiments of different complexity. The new feature detection algorithm meets the requirements of current metabolomics experiments. centWave can detect close-by and partially overlapping features and has the highest overall recall and precision values compared to the other algorithms, matchedFilter (the original algorithm of XCMS) and the centroidPicker from MZmine. The centWave algorithm was integrated into the Bioconductor R-package XCMS and is available from http://www.bioconductor.org/

...read moreread less

954 citations

Journal Article•DOI•

Enhanced Bayesian modelling in BAPS software for learning genetic structures of populations.

[...]

Jukka Corander¹, Pekka Marttinen², Jukka Sirén², Jing Tang²•Institutions (2)

Åbo Akademi University¹, University of Helsinki²

16 Dec 2008-BMC Bioinformatics

TL;DR: The Bayesian modelling methods introduced in this article represent an array of enhanced tools for learning the genetic structure of populations designed to meet the increasing need for analyzing large-scale population genetics data.

...read moreread less

Abstract: During the most recent decade many Bayesian statistical models and software for answering questions related to the genetic structure underlying population samples have appeared in the scientific literature. Most of these methods utilize molecular markers for the inferences, while some are also capable of handling DNA sequence data. In a number of earlier works, we have introduced an array of statistical methods for population genetic inference that are implemented in the software BAPS. However, the complexity of biological problems related to genetic structure analysis keeps increasing such that in many cases the current methods may provide either inappropriate or insufficient solutions. We discuss the necessity of enhancing the statistical approaches to face the challenges posed by the ever-increasing amounts of molecular data generated by scientists over a wide range of research areas and introduce an array of new statistical tools implemented in the most recent version of BAPS. With these methods it is possible, e.g., to fit genetic mixture models using user-specified numbers of clusters and to estimate levels of admixture under a genetic linkage model. Also, alleles representing a different ancestry compared to the average observed genomic positions can be tracked for the sampled individuals, and a priori specified hypotheses about genetic population structure can be directly compared using Bayes' theorem. In general, we have improved further the computational characteristics of the algorithms behind the methods implemented in BAPS facilitating the analyses of large and complex datasets. In particular, analysis of a single dataset can now be spread over multiple computers using a script interface to the software. The Bayesian modelling methods introduced in this article represent an array of enhanced tools for learning the genetic structure of populations. Their implementations in the BAPS software are designed to meet the increasing need for analyzing large-scale population genetics data. The software is freely downloadable for Windows, Linux and Mac OS X systems at http://web.abo.fi/fak/mnf//mate/jc/software/baps.html .

...read moreread less

818 citations

Journal Article•DOI•

BatchPrimer3: A high throughput web application for PCR and sequencing primer design

[...]

Frank M. You¹, Frank M. You², Naxin Huo¹, Naxin Huo², Yong Q. Gu², Ming-Cheng Luo¹, Yaqin Ma¹, Dave Hane², Dave Hane¹, Gerard R. Lazo², Jan Dvorak¹, Olin D. Anderson² - Show less +8 more•Institutions (2)

University of California, Berkeley¹, United States Department of Agriculture²

29 May 2008-BMC Bioinformatics

TL;DR: BatchPrimer3 is a comprehensive web primer design program to develop different types of primers in a high-throughput manner and has been designed using the Primer3 core program and validated in several laboratories.

...read moreread less

Abstract: Microsatellite (simple sequence repeat – SSR) and single nucleotide polymorphism (SNP) markers are two types of important genetic markers useful in genetic mapping and genotyping. Often, large-scale genomic research projects require high-throughput computer-assisted primer design. Numerous such web-based or standard-alone programs for PCR primer design are available but vary in quality and functionality. In particular, most programs lack batch primer design capability. Such a high-throughput software tool for designing SSR flanking primers and SNP genotyping primers is increasingly demanded. A new web primer design program, BatchPrimer3, is developed based on Primer3. BatchPrimer3 adopted the Primer3 core program as a major primer design engine to choose the best primer pairs. A new score-based primer picking module is incorporated into BatchPrimer3 and used to pick position-restricted primers. BatchPrimer3 v1.0 implements several types of primer designs including generic primers, SSR primers together with SSR detection, and SNP genotyping primers (including single-base extension primers, allele-specific primers, and tetra-primers for tetra-primer ARMS PCR), as well as DNA sequencing primers. DNA sequences in FASTA format can be batch read into the program. The basic information of input sequences, as a reference of parameter setting of primer design, can be obtained by pre-analysis of sequences. The input sequences can be pre-processed and masked to exclude and/or include specific regions, or set targets for different primer design purposes as in Primer3Web and primer3Plus. A tab-delimited or Excel-formatted primer output also greatly facilitates the subsequent primer-ordering process. Thousands of primers, including wheat conserved intron-flanking primers, wheat genome-specific SNP genotyping primers, and Brachypodium SSR flanking primers in several genome projects have been designed using the program and validated in several laboratories. BatchPrimer3 is a comprehensive web primer design program to develop different types of primers in a high-throughput manner. Additional methods of primer design can be easily integrated into future versions of BatchPrimer3. The program with source code and thousands of PCR and sequencing primers designed for wheat and Brachypodium are accessible at http://wheat.pw.usda.gov/demos/BatchPrimer3/ .

...read moreread less

757 citations

Journal Article•DOI•

OpenMS – An open-source software framework for mass spectrometry

[...]

Marc Sturm¹, Andreas Bertsch¹, Clemens Gröpl², Andreas Hildebrandt³, Rene Hussong³, Eva Lange², Nico Pfeifer¹, Ole Schulz-Trieglaff², Alexandra Zerck⁴, Knut Reinert², Oliver Kohlbacher¹ - Show less +7 more•Institutions (4)

University of Tübingen¹, Free University of Berlin², Saarland University³, Max Planck Society⁴

26 Mar 2008-BMC Bioinformatics

TL;DR: This work presents OpenMS, a software framework for rapid application development in mass spectrometry designed to be portable, easy-to-use and robust while offering a rich functionality ranging from basic data structures to sophisticated algorithms for data analysis.

...read moreread less

Abstract: Mass spectrometry is an essential analytical technique for high-throughput analysis in proteomics and metabolomics. The development of new separation techniques, precise mass analyzers and experimental protocols is a very active field of research. This leads to more complex experimental setups yielding ever increasing amounts of data. Consequently, analysis of the data is currently often the bottleneck for experimental studies. Although software tools for many data analysis tasks are available today, they are often hard to combine with each other or not flexible enough to allow for rapid prototyping of a new analysis workflow. We present OpenMS, a software framework for rapid application development in mass spectrometry. OpenMS has been designed to be portable, easy-to-use and robust while offering a rich functionality ranging from basic data structures to sophisticated algorithms for data analysis. This has already been demonstrated in several studies. OpenMS is available under the Lesser GNU Public License (LGPL) from the project website at http://www.openms.de .

...read moreread less

Journal Article•DOI•

A comprehensive comparison of random forests and support vector machines for microarray-based cancer classification

[...]

Alexander Statnikov¹, Lily Wang¹, Constantin F. Aliferis•Institutions (1)

Vanderbilt University¹

22 Jul 2008-BMC Bioinformatics

TL;DR: Both on average and in the majority of microarray datasets, random forests are outperformed by support vector machines both in the settings when no gene selection is performed and when several popular gene selection methods are used.

...read moreread less

Abstract: Cancer diagnosis and clinical outcome prediction are among the most important emerging applications of gene expression microarray technology with several molecular signatures on their way toward clinical deployment. Use of the most accurate classification algorithms available for microarray gene expression data is a critical ingredient in order to develop the best possible molecular signatures for patient care. As suggested by a large body of literature to date, support vector machines can be considered "best of class" algorithms for classification of such data. Recent work, however, suggests that random forest classifiers may outperform support vector machines in this domain. In the present paper we identify methodological biases of prior work comparing random forests and support vector machines and conduct a new rigorous evaluation of the two algorithms that corrects these limitations. Our experiments use 22 diagnostic and prognostic datasets and show that support vector machines outperform random forests, often by a large margin. Our data also underlines the importance of sound research design in benchmarking and comparison of bioinformatics algorithms. We found that both on average and in the majority of microarray datasets, random forests are outperformed by support vector machines both in the settings when no gene selection is performed and when several popular gene selection methods are used.

...read moreread less

Journal Article•DOI•

PredGPI: a GPI-anchor predictor

[...]

Andrea Pierleoni¹, Pier Luigi Martelli¹, Rita Casadio¹•Institutions (1)

University of Bologna¹

23 Sep 2008-BMC Bioinformatics

TL;DR: PredGPI is a prediction method that, by coupling a Hidden Markov Model and a Support Vector Machine, is able to efficiently predict both the presence of the GPI-anchor and the position of the ω-site and is therefore a costless, rapid and accurate method for screening whole proteomes.

...read moreread less

Abstract: Background Several eukaryotic proteins associated to the extracellular leaflet of the plasma membrane carry a Glycosylphosphatidylinositol (GPI) anchor, which is linked to the C-terminal residue after a proteolytic cleavage occurring at the so called ω-site. Computational methods were developed to discriminate proteins that undergo this post-translational modification starting from their aminoacidic sequences. However more accurate methods are needed for a reliable annotation of whole proteomes.

...read moreread less

Journal Article•DOI•

Improved accuracy of multiple ncRNA alignment by incorporating structural information into a MAFFT-based framework.

[...]

Kazutaka Katoh¹, Hiroyuki Toh¹•Institutions (1)

Kyushu University¹

25 Apr 2008-BMC Bioinformatics

TL;DR: A new option of the MAFFT alignment program, X-INS-i is developed, which builds a multiple alignment with an iterative method incorporating structural information through two components: pairwise structural alignments by an external pairwise alignment method such as SCARNA or LaRA and a new objective function, Four-way Consistency, derived from the base-pairing probability of every sub-aligned group at every multiple alignment stage.

...read moreread less

Abstract: Structural alignment of RNAs is becoming important, since the discovery of functional non-coding RNAs (ncRNAs). Recent studies, mainly based on various approximations of the Sankoff algorithm, have resulted in considerable improvement in the accuracy of pairwise structural alignment. In contrast, for the cases with more than two sequences, the practical merit of structural alignment remains unclear as compared to traditional sequence-based methods, although the importance of multiple structural alignment is widely recognized. We took a different approach from a straightforward extension of the Sankoff algorithm to the multiple alignments from the viewpoints of accuracy and time complexity. As a new option of the MAFFT alignment program, we developed a multiple RNA alignment framework, X-INS-i, which builds a multiple alignment with an iterative method incorporating structural information through two components: (1) pairwise structural alignments by an external pairwise alignment method such as SCARNA or LaRA and (2) a new objective function, Four-way Consistency, derived from the base-pairing probability of every sub-aligned group at every multiple alignment stage. The BRAliBASE benchmark showed that X-INS-i outperforms other methods currently available in the sum-of-pairs score (SPS) criterion. As a basis for predicting common secondary structure, the accuracy of the present method is comparable to or rather higher than those of the current leading methods such as RNA Sampler. The X-INS-i framework can be used for building a multiple RNA alignment from any combination of algorithms for pairwise RNA alignment and base-pairing probability. The source code is available at the webpage found in the Availability and requirements section.

...read moreread less

Journal Article•DOI•

RNAalifold: improved consensus structure prediction for RNA alignments

[...]

Stephan H. Bernhart¹, Ivo L. Hofacker², Sebastian Will³, Andreas Gruber², Peter F. Stadler - Show less +1 more•Institutions (3)

Leipzig University¹, University of Vienna², University of Freiburg³

11 Nov 2008-BMC Bioinformatics

TL;DR: The accuracy of RNAalifold predictions can be improved substantially by introducing a different, more rational handling of alignment gaps, and by replacing the rather simplistic model of covariance scoring with more sophisticated RIBOSUM-like scoring matrices.

...read moreread less

Abstract: Background: The prediction of a consensus structure for a set of related RNAs is an important first step for subsequent analyses. RNAalifold, which computes the minimum energy structure that is simultaneously formed by a set of aligned sequences, is one of the oldest and most widely used tools for this task. In recent years, several alternative approaches have been advocated, pointing to several shortcomings of the original RNAalifold approach. Results: We show that the accuracy of RNAalifold predictions can be improved substantially by introducing a different, more rational handling of alignment gaps, and by replacing the rather simplistic model of covariance scoring with more sophisticated RIBOSUM-like scoring matrices. These improvements are achieved without compromising the computational efficiency of the algorithm. We show here that the new version of RNAalifold not only outperforms the old one, but also several other tools recently developed, on different datasets. Conclusion: The new version of RNAalifold not only can replace the old one for almost any application but it is also competitive with other approaches including those based on SCFGs, maximum expected accuracy, or hierarchical nearest neighbor classifiers.

...read moreread less

Journal Article•DOI•

iRefIndex: A consolidated protein interaction database with provenance

[...]

Sabry Razick¹, George Magklaras¹, Ian Donaldson¹•Institutions (1)

University of Oslo¹

30 Sep 2008-BMC Bioinformatics

TL;DR: A unifying index that would facilitate searching for redundant interaction data and that would group together redundant interactionData while recording the methods used to perform this grouping is created.

...read moreread less

Abstract: Background: Interaction data for a given protein may be spread across multiple databases. We set out to create a unifying index that would facilitate searching for these data and that would group together redundant interaction data while recording the methods used to perform this grouping. Results: We present a method to generate a key for a protein interaction record and a key for each participant protein. These keys may be generated by anyone using only the primary sequence of the proteins, their taxonomy identifiers and the Secure Hash Algorithm. Two interaction records will have identical keys if they refer to the same set of identical protein sequences and taxonomy identifiers. We define records with identical keys as a redundant group. Our method required that we map protein database references found in interaction records to current protein sequence records. Operations performed during this mapping are described by a mapping score that may provide valuable feedback to source interaction databases on problematic references that are malformed, deprecated, ambiguous or unfound. Keys for protein participants allow for retrieval of interaction information independent of the protein references used in the original records. Conclusion: We have applied our method to protein interaction records from BIND, BioGrid, DIP, HPRD, IntAct, MINT, MPact, MPPI and OPHID. The resulting interaction reference index is provided in PSI-MITAB 2.5 format at http://irefindex.uio.no. This index may form the basis of alternative redundant groupings based on gene identifiers or near sequence identity groupings.

...read moreread less

Journal Article•DOI•

CellProfiler Analyst: data exploration and analysis software for complex image-based screens

[...]

Thouis R. Jones¹, Thouis R. Jones², In Han Kang², Douglas B. Wheeler², Robert A. Lindquist², Adam Papallo², Adam Papallo¹, David M. Sabatini², Polina Golland², Anne E. Carpenter², Anne E. Carpenter¹ - Show less +7 more•Institutions (2)

Broad Institute¹, Massachusetts Institute of Technology²

15 Nov 2008-BMC Bioinformatics

TL;DR: Here, CellProfiler Analyst, open-source software for the interactive exploration and analysis of multidimensional data, particularly data from high-throughput, image-based experiments, is described.

...read moreread less

Abstract: Image-based screens can produce hundreds of measured features for each of hundreds of millions of individual cells in a single experiment. Here, we describe CellProfiler Analyst, open-source software for the interactive exploration and analysis of multidimensional data, particularly data from high-throughput, image-based experiments. The system enables interactive data exploration for image-based screens and automated scoring of complex phenotypes that require combinations of multiple measured features per cell.

...read moreread less

Journal Article•DOI•

CUDA compatible GPU cards as efficient hardware accelerators for Smith-Waterman sequence alignment.

[...]

Svetlin A Manavski¹, Giorgio Valle¹•Institutions (1)

University of Padua¹

26 Mar 2008-BMC Bioinformatics

TL;DR: The results show that graphic cards are now sufficiently advanced to be used as efficient hardware accelerators for sequence alignment, and their performance is better than any alternative available on commodity hardware platforms.

...read moreread less

Abstract: Background Searching for similarities in protein and DNA databases has become a routine procedure in Molecular Biology. The Smith-Waterman algorithm has been available for more than 25 years. It is based on a dynamic programming approach that explores all the possible alignments between two sequences; as a result it returns the optimal local alignment. Unfortunately, the computational cost is very high, requiring a number of operations proportional to the product of the length of two sequences. Furthermore, the exponential growth of protein and DNA databases makes the Smith-Waterman algorithm unrealistic for searching similarities in large sets of sequences. For these reasons heuristic approaches such as those implemented in FASTA and BLAST tend to be preferred, allowing faster execution times at the cost of reduced sensitivity. The main motivation of our work is to exploit the huge computational power of commonly available graphic cards, to develop high performance solutions for sequence alignment.

...read moreread less

Journal Article•DOI•

minet : A R/Bioconductor Package for Inferring Large Transcriptional Networks Using Mutual Information

[...]

Patrick E. Meyer¹, Frederic Lafitte¹, Gianluca Bontempi¹•Institutions (1)

Université libre de Bruxelles¹

29 Oct 2008-BMC Bioinformatics

TL;DR: The package minet provides a series of tools for inferring transcriptional networks from microarray data and integrates accuracy assessment tools, like F-scores, PR-curves and ROC-Curves in order to compare the inferred network with a reference one.

...read moreread less

Abstract: This paper presents the R/Bioconductor package minet (version 1.1.6) which provides a set of functions to infer mutual information networks from a dataset. Once fed with a microarray dataset, the package returns a network where nodes denote genes, edges model statistical dependencies between genes and the weight of an edge quantifies the statistical evidence of a specific (e.g transcriptional) gene-to-gene interaction. Four different entropy estimators are made available in the package minet (empirical, Miller-Madow, Schurmann-Grassberger and shrink) as well as four different inference methods, namely relevance networks, ARACNE, CLR and MRNET. Also, the package integrates accuracy assessment tools, like F-scores, PR-curves and ROC-curves in order to compare the inferred network with a reference one. The package minet provides a series of tools for inferring transcriptional networks from microarray data. It is freely available from the Comprehensive R Archive Network (CRAN) as well as from the Bioconductor website.

...read moreread less

Journal Article•DOI•

QuantPrime--a flexible tool for reliable high-throughput primer design for quantitative PCR.

[...]

Samuel Arvidsson¹, Samuel Arvidsson², Miroslaw Kwasniewski³, Miroslaw Kwasniewski¹, Miroslaw Kwasniewski², Diego Mauricio Riaño-Pachón², Diego Mauricio Riaño-Pachón¹, Bernd Mueller-Roeber¹, Bernd Mueller-Roeber² - Show less +5 more•Institutions (3)

University of Potsdam¹, Max Planck Society², University of Silesia in Katowice³

01 Nov 2008-BMC Bioinformatics

TL;DR: QuantPrime constitutes a flexible, fully automated web application for reliable primer design for use in larger qPCR experiments, as proven by experimental data.

...read moreread less

Abstract: Medium- to large-scale expression profiling using quantitative polymerase chain reaction (qPCR) assays are becoming increasingly important in genomics research. A major bottleneck in experiment preparation is the design of specific primer pairs, where researchers have to make several informed choices, often outside their area of expertise. Using currently available primer design tools, several interactive decisions have to be made, resulting in lengthy design processes with varying qualities of the assays. Here we present QuantPrime, an intuitive and user-friendly, fully automated tool for primer pair design in small- to large-scale qPCR analyses. QuantPrime can be used online through the internet http://www.quantprime.de/ or on a local computer after download; it offers design and specificity checking with highly customizable parameters and is ready to use with many publicly available transcriptomes of important higher eukaryotic model organisms and plant crops (currently 295 species in total), while benefiting from exon-intron border and alternative splice variant information in available genome annotations. Experimental results with the model plant Arabidopsis thaliana, the crop Hordeum vulgare and the model green alga Chlamydomonas reinhardtii show success rates of designed primer pairs exceeding 96%. QuantPrime constitutes a flexible, fully automated web application for reliable primer design for use in larger qPCR experiments, as proven by experimental data. The flexible framework is also open for simple use in other quantification applications, such as hydrolyzation probe design for qPCR and oligonucleotide probe design for quantitative in situ hybridization. Future suggestions made by users can be easily implemented, thus allowing QuantPrime to be developed into a broad-range platform for the design of RNA expression assays.

...read moreread less

Journal Article•DOI•

Evaluation of time profile reconstruction from complex two-color microarray designs

[...]

Ana Carolina Fierro¹, Raphaël Thuret¹, Kristof Engelen², Gilles Bernot, Kathleen Marchal², Nicolas Pollet¹ - Show less +2 more•Institutions (2)

Centre national de la recherche scientifique¹, Katholieke Universiteit Leuven²

03 Jan 2008-BMC Bioinformatics

TL;DR: Including a dye effect such as in the methods lmbr_dye, anovaFix and anovaMix compensates for residual dye related inconsistencies in the data and renders the results more robust against array failure.

...read moreread less

Abstract: As an alternative to the frequently used "reference design" for two-channel microarrays, other designs have been proposed. These designs have been shown to be more profitable from a theoretical point of view (more replicates of the conditions of interest for the same number of arrays). However, the interpretation of the measurements is less straightforward and a reconstruction method is needed to convert the observed ratios into the genuine profile of interest (e.g. a time profile). The potential advantages of using these alternative designs thus largely depend on the success of the profile reconstruction. Therefore, we compared to what extent different linear models agree with each other in reconstructing expression ratios and corresponding time profiles from a complex design. On average the correlation between the estimated ratios was high, and all methods agreed with each other in predicting the same profile, especially for genes of which the expression profile showed a large variance across the different time points. Assessing the similarity in profile shape, it appears that, the more similar the underlying principles of the methods (model and input data), the more similar their results. Methods with a dye effect seemed more robust against array failure. The influence of a different normalization was not drastic and independent of the method used. Including a dye effect such as in the methods lmbr_dye, anovaFix and anovaMix compensates for residual dye related inconsistencies in the data and renders the results more robust against array failure. Including random effects requires more parameters to be estimated and is only advised when a design is used with a sufficient number of replicates. Because of this, we believe lmbr_dye, anovaFix and anovaMix are most appropriate for practical use.

...read moreread less

Journal Article•DOI•

The BioScope corpus: biomedical texts annotated for uncertainty, negation and their scopes

[...]

Veronika Vincze¹, György Szarvas¹, Richárd Farkas², György Móra¹, János Csirik² - Show less +1 more•Institutions (2)

University of Szeged¹, Hungarian Academy of Sciences²

19 Nov 2008-BMC Bioinformatics

TL;DR: A corpus annotation project that has produced a freely available resource for research on handling negation and uncertainty in biomedical texts, which is also a good resource for the linguistic analysis of scientific and clinical texts.

...read moreread less

Abstract: Detecting uncertain and negative assertions is essential in most BioMedical Text Mining tasks where, in general, the aim is to derive factual knowledge from textual data. This article reports on a corpus annotation project that has produced a freely available resource for research on handling negation and uncertainty in biomedical texts (we call this corpus the BioScope corpus). The corpus consists of three parts, namely medical free texts, biological full papers and biological scientific abstracts. The dataset contains annotations at the token level for negative and speculative keywords and at the sentence level for their linguistic scope. The annotation process was carried out by two independent linguist annotators and a chief linguist – also responsible for setting up the annotation guidelines – who resolved cases where the annotators disagreed. The resulting corpus consists of more than 20.000 sentences that were considered for annotation and over 10% of them actually contain one (or more) linguistic annotation suggesting negation or uncertainty. Statistics are reported on corpus size, ambiguity levels and the consistency of annotations. The corpus is accessible for academic purposes and is free of charge. Apart from the intended goal of serving as a common resource for the training, testing and comparing of biomedical Natural Language Processing systems, the corpus is also a good resource for the linguistic analysis of scientific and clinical texts.

...read moreread less

Journal Article•DOI•

PhyloNet: a software package for analyzing and reconstructing reticulate evolutionary relationships

[...]

Cuong Than¹, Derek Ruths¹, Luay Nakhleh¹•Institutions (1)

Rice University¹

28 Jul 2008-BMC Bioinformatics

TL;DR: The PhyloNet software package is reported on, which is a suite of tools for analyzing reticulate evolutionary relationships, or evolutionary networks, which are rooted, directed, acyclic graphs, leaf-labeled by a set of taxa.

...read moreread less

Abstract: Phylogenies, i.e., the evolutionary histories of groups of taxa, play a major role in representing the interrelationships among biological entities. Many software tools for reconstructing and evaluating such phylogenies have been proposed, almost all of which assume the underlying evolutionary history to be a tree. While trees give a satisfactory first-order approximation for many families of organisms, other families exhibit evolutionary mechanisms that cannot be represented by trees. Processes such as horizontal gene transfer (HGT), hybrid speciation, and interspecific recombination, collectively referred to as reticulate evolutionary events, result in networks, rather than trees, of relationships. Various software tools have been recently developed to analyze reticulate evolutionary relationships, which include SplitsTree4, LatTrans, EEEP, HorizStory, and T-REX. In this paper, we report on the PhyloNet software package, which is a suite of tools for analyzing reticulate evolutionary relationships, or evolutionary networks, which are rooted, directed, acyclic graphs, leaf-labeled by a set of taxa. These tools can be classified into four categories: (1) evolutionary network representation: reading/writing evolutionary networks in a newly devised compact form; (2) evolutionary network characterization: analyzing evolutionary networks in terms of three basic building blocks – trees, clusters, and tripartitions; (3) evolutionary network comparison: comparing two evolutionary networks in terms of topological dissimilarities, as well as fitness to sequence evolution under a maximum parsimony criterion; and (4) evolutionary network reconstruction: reconstructing an evolutionary network from a species tree and a set of gene trees. The software package, PhyloNet, offers an array of utilities to allow for efficient and accurate analysis of evolutionary networks. The software package will help significantly in analyzing large data sets, as well as in studying the performance of evolutionary network reconstruction methods. Further, the software package supports the proposed eNewick format for compact representation of evolutionary networks, a feature that allows for efficient interoperability of evolutionary network software tools. Currently, all utilities in PhyloNet are invoked on the command line.

...read moreread less

Journal Article•DOI•

The balance of reproducibility, sensitivity, and specificity of lists of differentially expressed genes in microarray studies

[...]

Leming Shi¹, Wendell D. Jones², Roderick V. Jensen³, Stephen C. Harris¹, Roger Perkins⁴, Federico Goodsaid⁵, Lei Guo¹, Lisa J. Croner⁶, Cecilie Boysen, Hong Fang⁴, Feng Qian⁴, Shashi Amur⁵, Wenjun Bao⁷, Catalin Barbacioru⁸, Vincent Bertholet, Xiaoxi Megan Cao⁴, Tzu Ming Chu⁷, Patrick J. Collins⁹, Xiaohui Fan¹, Xiaohui Fan¹⁰, Felix W. Frueh⁵, James C. Fuscoe¹, Xu Guo¹¹, Jing Han¹², Damir Herman¹³, Huixiao Hong⁴, Ernest S. Kawasaki¹⁴, Quan Zhen Li¹⁵, Yuling Luo, Yunqing Ma, Nan Mei¹, Ron L. Peterson¹⁶, Raj K. Puri¹², Richard Shippy¹⁷, Zhenqiang Su¹, Yongming Andrew Sun⁸, Hongmei Sun⁴, Brett T. Thorn⁴, Yaron Turpaz¹⁰, Charles Wang¹⁸, Sue Jane Wang⁵, Janet A. Warrington¹¹, James C. Willey, Jie Wu⁴, Qian Xie⁴, Liang Zhang, Lu Zhang, Sheng Zhong¹⁹, Russell D. Wolfinger⁷, Weida Tong¹ - Show less +46 more•Institutions (19)

National Center for Toxicological Research¹, Durham University², University of Massachusetts Boston³, ICF International⁴, Center for Drug Evaluation and Research⁵, Biogen Idec⁶, SAS Institute⁷, Applied Biosystems⁸, Agilent Technologies⁹, Zhejiang University¹⁰, Affymetrix¹¹, Center for Biologics Evaluation and Research¹², National Institutes of Health¹³, Advanced Technology Center¹⁴, University of Texas Southwestern Medical Center¹⁵, Novartis¹⁶, GE Healthcare¹⁷, Cedars-Sinai Medical Center¹⁸, University of Illinois at Urbana–Champaign¹⁹

12 Aug 2008-BMC Bioinformatics

TL;DR: The results provide practical guidance to choose the appropriate FC and P-value cutoffs when selecting a given number of DEGs and recommend the use of FC-ranking plus a non-stringent P cutoff as a straightforward and baseline practice in order to generate more reproducible DEG lists.

...read moreread less

Abstract: Background Reproducibility is a fundamental requirement in scientific experiments. Some recent publications have claimed that microarrays are unreliable because lists of differentially expressed genes (DEGs) are not reproducible in similar experiments. Meanwhile, new statistical methods for identifying DEGs continue to appear in the scientific literature. The resultant variety of existing and emerging methods exacerbates confusion and continuing debate in the microarray community on the appropriate choice of methods for identifying reliable DEG lists.

...read moreread less

Journal Article•DOI•

Corpus annotation for mining biomedical events from literature

[...]

Jin-Dong Kim¹, Tomoko Ohta¹, Jun'ichi Tsujii², Jun'ichi Tsujii¹•Institutions (2)

University of Tokyo¹, University of Manchester²

08 Jan 2008-BMC Bioinformatics

TL;DR: A new type of semantic annotation, event annotation, is completed, which is an addition to the existing annotations in the GENIA corpus, and is expected to become a valuable resource for NLP (Natural Language Processing)-based TM in the bio-medical domain.

...read moreread less

Abstract: Advanced Text Mining (TM) such as semantic enrichment of papers, event or relation extraction, and intelligent Question Answering have increasingly attracted attention in the bio-medical domain. For such attempts to succeed, text annotation from the biological point of view is indispensable. However, due to the complexity of the task, semantic annotation has never been tried on a large scale, apart from relatively simple term annotation. We have completed a new type of semantic annotation, event annotation, which is an addition to the existing annotations in the GENIA corpus. The corpus has already been annotated with POS (Parts of Speech), syntactic trees, terms, etc. The new annotation was made on half of the GENIA corpus, consisting of 1,000 Medline abstracts. It contains 9,372 sentences in which 36,114 events are identified. The major challenges during event annotation were (1) to design a scheme of annotation which meets specific requirements of text annotation, (2) to achieve biology-oriented annotation which reflect biologists' interpretation of text, and (3) to ensure the homogeneity of annotation quality across annotators. To meet these challenges, we introduced new concepts such as Single-facet Annotation and Semantic Typing, which have collectively contributed to successful completion of a large scale annotation. The resulting event-annotated corpus is the largest and one of the best in quality among similar annotation efforts. We expect it to become a valuable resource for NLP (Natural Language Processing)-based TM in the bio-medical domain.

...read moreread less

Journal Article•DOI•

Clustering cancer gene expression data: a comparative study

[...]

Marcilio C. P. de Souto¹, Marcilio C. P. de Souto², Ivan G. Costa², Ivan G. Costa³, Daniel S. A. de Araujo², Daniel S. A. de Araujo¹, Teresa B. Ludermir³, Alexander Schliep² - Show less +4 more•Institutions (3)

University of Rio Grande¹, Max Planck Society², Federal University of Pernambuco³

27 Nov 2008-BMC Bioinformatics

TL;DR: This study presents the first large-scale analysis of seven different clustering methods and four proximity measures for the analysis of 35 cancer gene expression data sets and reveals that the finite mixture of Gaussians, followed closely by k-means, exhibited the best performance in terms of recovering the true structure of the data sets.

...read moreread less

Abstract: The use of clustering methods for the discovery of cancer subtypes has drawn a great deal of attention in the scientific community. While bioinformaticians have proposed new clustering methods that take advantage of characteristics of the gene expression data, the medical community has a preference for using "classic" clustering methods. There have been no studies thus far performing a large-scale evaluation of different clustering methods in this context. We present the first large-scale analysis of seven different clustering methods and four proximity measures for the analysis of 35 cancer gene expression data sets. Our results reveal that the finite mixture of Gaussians, followed closely by k-means, exhibited the best performance in terms of recovering the true structure of the data sets. These methods also exhibited, on average, the smallest difference between the actual number of classes in the data sets and the best number of clusters as indicated by our validation criteria. Furthermore, hierarchical methods, which have been widely used by the medical community, exhibited a poorer recovery performance than that of the other methods evaluated. Moreover, as a stable basis for the assessment and comparison of different clustering methods for cancer gene expression data, this study provides a common group of data sets (benchmark data sets) to be shared among researchers and used for comparisons with new methods. The data sets analyzed in this study are available at http://algorithmics.molgen.mpg.de/Supplements/CompCancer/ .

...read moreread less

Journal Article•DOI•

A unified approach to false discovery rate estimation

[...]

Korbinian Strimmer¹•Institutions (1)

Leipzig University¹

09 Jul 2008-BMC Bioinformatics

TL;DR: A unifying algorithm for simultaneous estimation of both local FDR and tail area-based FDR is presented that can be applied to a diverse range of test statistics, including p-values, correlations, z- and t-scores.

...read moreread less

Abstract: False discovery rate (FDR) methods play an important role in analyzing high-dimensional data. There are two types of FDR, tail area-based FDR and local FDR, as well as numerous statistical algorithms for estimating or controlling FDR. These differ in terms of underlying test statistics and procedures employed for statistical learning. A unifying algorithm for simultaneous estimation of both local FDR and tail area-based FDR is presented that can be applied to a diverse range of test statistics, including p-values, correlations, z- and t-scores. This approach is semipararametric and is based on a modified Grenander density estimator. For test statistics other than p-values it allows for empirical null modeling, so that dependencies among tests can be taken into account. The inference of the underlying model employs truncated maximum-likelihood estimation, with the cut-off point chosen according to the false non-discovery rate. The proposed procedure generalizes a number of more specialized algorithms and thus offers a common framework for FDR estimation consistent across test statistics and types of FDR. In comparative study the unified approach performs on par with the best competing yet more specialized alternatives. The algorithm is implemented in R in the "fdrtool" package, available under the GNU GPL from http://strimmerlab.org/software/fdrtool/ and from the R package archive CRAN.

...read moreread less

Journal Article•DOI•

Using quality scores and longer reads improves accuracy of Solexa read mapping

[...]

Andrew D. Smith¹, Zhenyu Xuan¹, Michael Q. Zhang¹•Institutions (1)

Cold Spring Harbor Laboratory¹

28 Feb 2008-BMC Bioinformatics

TL;DR: The RMAP tool, which can map reads having a wide range of lengths and allows base-call quality scores to determine which positions in each read are more important when mapping, indicates that significant gains in Solexa read mapping performance can be achieved by considering the information in 3' ends of longer reads, and appropriately using the base- call quality scores.

...read moreread less

Abstract: Background Second-generation sequencing has the potential to revolutionize genomics and impact all areas of biomedical science. New technologies will make re-sequencing widely available for such applications as identifying genome variations or interrogating the oligonucleotide content of a large sample (e.g. ChIP-sequencing). The increase in speed, sensitivity and availability of sequencing technology brings demand for advances in computational technology to perform associated analysis tasks. The Solexa/Illumina 1G sequencer can produce tens of millions of reads, ranging in length from ~25–50 nt, in a single experiment. Accurately mapping the reads back to a reference genome is a critical task in almost all applications. Two sources of information that are often ignored when mapping reads from the Solexa technology are the 3' ends of longer reads, which contain a much higher frequency of sequencing errors, and the base-call quality scores.

...read moreread less

Journal Article•DOI•

TiGER: a database for tissue-specific gene expression and regulation.

[...]

Xiong Liu¹, Xueping Yu¹, Donald J. Zack, Heng Zhu¹, Jiang Qian¹ - Show less +1 more•Institutions (1)

Johns Hopkins University School of Medicine¹

09 Jun 2008-BMC Bioinformatics

TL;DR: This work has developed and made publicly available a database, TiGER, which summarizes and provides large scale data sets for tissue-specific gene expression and regulation in a variety of human tissues.

...read moreread less

Abstract: Understanding how genes are expressed and regulated in different tissues is a fundamental and challenging question. However, most of currently available biological databases do not focus on tissue-specific gene regulation. The recent development of computational methods for tissue-specific combinational gene regulation, based on transcription factor binding sites, enables us to perform a large-scale analysis of tissue-specific gene regulation in human tissues. The results are stored in a web database called TiGER (Tissue-specific Gene Expression and Regulation). The database contains three types of data including tissue-specific gene expression profiles, combinatorial gene regulations, and cis-regulatory module (CRM) detections. At present the database contains expression profiles for 19,526 UniGene genes, combinatorial regulations for 7,341 transcription factor pairs and 6,232 putative CRMs for 2,130 RefSeq genes. We have developed and made publicly available a database, TiGER, which summarizes and provides large scale data sets for tissue-specific gene expression and regulation in a variety of human tissues. This resource is available at [1].

...read moreread less

Journal Article•DOI•

SeqAn An efficient, generic C++ library for sequence analysis

[...]

Andreas Döring, David Weese, Tobias Rausch¹, Knut Reinert¹•Institutions (1)

Free University of Berlin¹

09 Jan 2008-BMC Bioinformatics

TL;DR: The design and content of SeqAn are described, which comprises implementations of existing, practical state-of-the-art algorithmic components to provide a sound basis for algorithm testing and development and greatly simplifies the rapid development of new bioinformatics tools.

...read moreread less

Abstract: Background The use of novel algorithmic techniques is pivotal to many important problems in life science. For example the sequencing of the human genome [1] would not have been possible without advanced assembly algorithms. However, owing to the high speed of technological progress and the urgent need for bioinformatics tools, there is a widening gap between state-of-the-art algorithmic techniques and the actual algorithmic components of tools that are in widespread use.

...read moreread less

Collapse