scispace - formally typeset
Search or ask a question

Showing papers in "Bioinformatics in 2013"


Journal ArticleDOI
TL;DR: The Spliced Transcripts Alignment to a Reference (STAR) software based on a previously undescribed RNA-seq alignment algorithm that uses sequential maximum mappable seed search in uncompressed suffix arrays followed by seed clustering and stitching procedure outperforms other aligners by a factor of >50 in mapping speed.
Abstract: Motivation Accurate alignment of high-throughput RNA-seq data is a challenging and yet unsolved problem because of the non-contiguous transcript structure, relatively short read lengths and constantly increasing throughput of the sequencing technologies. Currently available RNA-seq aligners suffer from high mapping error rates, low mapping speed, read length limitation and mapping biases. Results To align our large (>80 billon reads) ENCODE Transcriptome RNA-seq dataset, we developed the Spliced Transcripts Alignment to a Reference (STAR) software based on a previously undescribed RNA-seq alignment algorithm that uses sequential maximum mappable seed search in uncompressed suffix arrays followed by seed clustering and stitching procedure. STAR outperforms other aligners by a factor of >50 in mapping speed, aligning to the human genome 550 million 2 × 76 bp paired-end reads per hour on a modest 12-core server, while at the same time improving alignment sensitivity and precision. In addition to unbiased de novo detection of canonical junctions, STAR can discover non-canonical splices and chimeric (fusion) transcripts, and is also capable of mapping full-length RNA sequences. Using Roche 454 sequencing of reverse transcription polymerase chain reaction amplicons, we experimentally validated 1960 novel intergenic splice junctions with an 80-90% success rate, corroborating the high precision of the STAR mapping strategy. Availability and implementation STAR is implemented as a standalone C++ code. STAR is free open source software distributed under GPLv3 license and can be downloaded from http://code.google.com/p/rna-star/.

30,684 citations


Journal ArticleDOI
TL;DR: A range of new simulation algorithms and features developed during the past 4 years are presented, leading up to the GROMACS 4.5 software package, which provides extremely high performance and cost efficiency for high-throughput as well as massively parallel simulations.
Abstract: Motivation: Molecular simulation has historically been a low-throughput technique, but faster computers and increasing amounts of genomic and structural data are changing this by enabling large-scale automated simulation of, for instance, many conformers or mutants of biomolecules with or without a range of ligands. At the same time, advances in performance and scaling now make it possible to model complex biomolecular interaction and function in a manner directly testable by experiment. These applications share a need for fast and efficient software that can be deployed on massive scale in clusters, web servers, distributed computing or cloud resources. Results: Here, we present a range of new simulation algorithms and features developed during the past 4 years, leading up to the GROMACS 4.5 software package. The software now automatically handles wide classes of biomolecules, such as proteins, nucleic acids and lipids, and comes with all commonly used force fields for these molecules built-in. GROMACS supports several implicit solvent models, as well as new free-energy algorithms, and the software now uses multithreading for efficient parallelization even on low-end systems, including windows-based workstations. Together with hand-tuned assembly kernels and state-of-the-art parallelization, this provides extremely high performance and cost efficiency for high-throughput as well as massively parallel simulations. Availability: GROMACS is an open source and free software available from http://www.gromacs.org. Contact: erik.lindahl@scilifelab.se Supplementary information:Supplementary data are available at Bioinformatics online.

6,029 citations


Journal ArticleDOI
TL;DR: This tool improves on leading assembly comparison software with new ideas and quality metrics, and can evaluate assemblies both with a reference genome, as well as without a reference.
Abstract: Summary: Limitations of genome sequencing techniques have led to dozens of assembly algorithms, none of which is perfect. A number of methods for comparing assemblers have been developed, but none is yet a recognized benchmark. Further, most existing methods for comparing assemblies are only applicable to new assemblies of finished genomes; the problem of evaluating assemblies of previously unsequenced species has not been adequately considered. Here, we present QUAST—a quality assessment tool for evaluating and comparing genome assemblies. This tool improves on leading assembly comparison software with new ideas and quality metrics. QUAST can evaluate assemblies both with a reference genome, as well as without a reference. QUAST produces many reports, summary tables and plots to help scientists in their research and in their publications. In this study, we used QUAST to compare several genome assemblers on three datasets. QUAST tables and plots for all of them are available in the Supplementary Material, and interactive versions of these reports are on the QUAST website.

5,757 citations


Journal ArticleDOI
TL;DR: Infernal builds probabilistic profiles of the sequence and secondary structure of an RNA family called covariance models (CMs) from structurally annotated multiple sequence alignments given as input, and introduces a new filter pipeline for RNA homology search based on accelerated profile hidden Markov model (HMM) methods and HMM-banded CM alignment methods.
Abstract: Summary: Infernal builds probabilistic profiles of the sequence and secondary structure of an RNA family called covariance models (CMs) from structurally annotated multiple sequence alignments given as input. Infernal uses CMs to search for new family members in sequence databases and to create potentially large multiple sequence alignments. Version 1.1 of Infernal introduces a new filter pipeline for RNA homology search based on accelerated profile hidden Markov model (HMM) methods and HMM-banded CM alignment methods. This enables � 100-fold acceleration over the previous version and � 10 000-fold acceleration over exhaustive non-filtered CM searches. Availability: Source code, documentation and the benchmark are downloadable from http://infernal.janelia.org. Infernal is freely licensed under the GNU GPLv3 and should be portable to any POSIX-compliant operating system, including Linux and Mac OS/X. Documentation includes a user’s guide with a tutorial, a discussion of file formats and user options and additional details on methods implemented in the software. Contact: nawrockie@janelia.hhmi.org

2,013 citations


Journal ArticleDOI
TL;DR: The Poisson tree processes (PTP) model is introduced to infer putative species boundaries on a given phylogenetic input tree and yields more accurate results than de novo species delimitation methods.
Abstract: Motivation: Sequence-based methods to delimit species are central to DNA taxonomy, microbial community surveys and DNA metabarcoding studies. Current approaches either rely on simple sequence similarity thresholds (OTU-picking) or on complex and compute-intensive evolutionary models. The OTU-picking methods scale well on large datasets, but the results are highly sensitive to the similarity threshold. Coalescent-based species delimitation approaches often rely on Bayesian statistics and Markov Chain Monte Carlo sampling, and can therefore only be applied to small datasets. Results: We introduce the Poisson tree processes (PTP) model to infer putative species boundaries on a given phylogenetic input tree. We also integrate PTP with our evolutionary placement algorithm (EPA-PTP) to count the number of species in phylogenetic placements. We compare our approaches with popular OTU-picking methods and the General Mixed Yule Coalescent (GMYC) model. For de novo species delimitation, the stand-alone PTP model generally outperforms GYMC as well as OTU-picking methods when evolutionary distances between species are small. PTP neither requires an ultrametric input tree nor a sequence similarity threshold as input. In the open reference species delimitation approach, EPA-PTP yields more accurate results than de novo species delimitation methods. Finally, EPA-PTP scales on large datasets because it relies on the parallel implementations of the EPA and RAxML, thereby allowing to delimit species in high-throughput sequencing data. Availability and implementation: The code is freely available at www.

1,868 citations


Journal ArticleDOI
TL;DR: Although built as a stand-alone program, Pathview may seamlessly integrate with pathway and functional analysis tools for large-scale and fully automated analysis pipelines.
Abstract: Summary: Pathview is a novel tool set for pathway-based data integration and visualization. It maps and renders user data on relevant pathway graphs. Users only need to supply their data and specify the target pathway. Pathview automatically downloads the pathway graph data, parses the data file, maps and integrates user data onto the pathway and renders pathway graphs with the mapped data. Although built as a stand-alone program, Pathview may seamlessly integrate with pathway and functional analysis tools for large-scale and fully automated analysis pipelines. Availability: The package is freely available under the GPLv3 license through Bioconductor and R-Forge. It is available at http://bioconduc tor.org/packages/release/bioc/html/pathview.html and at http://

1,313 citations


Journal ArticleDOI
TL;DR: A novel model-based intra-array normalization strategy for 450 k data, called BMIQ (Beta MIxture Quantile dilation), to adjust the beta-values of type2 design probes into a statistical distribution characteristic of type1 probes is proposed.
Abstract: Motivation: The Illumina Infinium 450 k DNA Methylation Beadchip is a prime candidate technology for Epigenome-Wide Association Studies (EWAS). However, a difficulty associated with these beadarrays is that probes come in two different designs, characterized by widely different DNA methylation distributions and dynamic range, which may bias downstream analyses. A key statistical issue is therefore how best to adjust for the two different probe designs. Results: Here we propose a novel model-based intra-array normalization strategy for 450 k data, called BMIQ (Beta MIxture Quantile dilation), to adjust the beta-values of type2 design probes into a statistical distribution characteristic of type1 probes. The strategy involves application of a three-state beta-mixture model to assign probes to methylation states, subsequent transformation of probabilities into quantiles and finally a methylation-dependent dilation transformation to preserve the monotonicity and continuity of the data. We validate our method on cell-line data, fresh frozen and paraffin-embedded tumour tissue samples and demonstrate that BMIQ compares favourably with two competing methods. Specifically, we show that BMIQ improves the robustness of the normalization procedure, reduces the technical variation and bias of type2 probe values and successfully eliminates the type1 enrichment bias caused by the lower dynamic range of type2 probes. BMIQ will be useful as a preprocessing step for any study using the Illumina Infinium 450 k platform. Availability: BMIQ is freely available from http://code.google.com/p/bmiq/. Contact: a.teschendorff@ucl.ac.uk Supplementary information:Supplementary data are available at Bioinformatics online

1,257 citations


Journal ArticleDOI
TL;DR: This work describes mapDamage 2.0, a user-friendly package that extends the original features of mapDamage by incorporating a statistical model of DNA damage, and provides estimates of four key features of aDNA molecules: the average length of overhangs, nick frequency, cytosine deamination rates in both double-stranded regions and overhangers, and base quality scores according to their probability of being damaged.
Abstract: Motivation: Ancient DNA (aDNA) molecules in fossilized bones and teeth, coprolites, sediments, mummified specimens and museum collections represent fantastic sources of information for evolutionary biologists, revealing the agents of past epidemics and the dynamics of past populations. However, the analysis of aDNA generally faces two major issues. Firstly, sequences consist of a mixture of endogenous and various exogenous backgrounds, mostly microbial. Secondly, high nucleotide misincorporation rates can be observed as a result of severe post-mortem DNA damage. Such misincorporation patterns are instrumental to authenticate ancient sequences versus modern contaminants. We recently developed the user-friendly mapDamage package that identifies such patterns from next-generation sequencing (NGS) sequence datasets. The absence of formal statistical modeling of the DNA damage process however precluded rigorous quantitative comparisons across samples. Results: Here, we describe mapDamage 2.0 that extends the original features of mapDamage by incorporating a statistical model of DNA damage. Assuming that damage events depend only on sequencing position and post-mortem deamination, our Bayesian statistical framework provides estimates of four key features of aDNA molecules: the average length of overhangs (�), nick frequency (�), and cytosine deamination rates in both double stranded regions (�d) and overhangs (�s). Our model enables rescaling base quality scores according to their probability of being damaged. mapDamage 2.0 handles NGS datasets with ease and is compatible with a wide range of DNA library protocols. Availability: mapDamage 2.0 is available at XXXXX as a Python package and documentation is maintained at the Centre for GeoGenetics website (geogenetics.ku.dk/publications/mapdamage/).

1,054 citations


Journal ArticleDOI
TL;DR: EBSeq is developed, using the merits of empirical Bayesian methods, for identifying DE isoforms in an RNA-seq experiment comparing two or more biological conditions and proves to be a robust approach for identifying De genes.
Abstract: Motivation: Messenger RNA expression is important in normal development and differentiation, as well as in manifestation of disease. RNA-seq experiments allow for the identification of differentially expressed (DE) genes and their corresponding isoforms on a genome-wide scale. However, statistical methods are required to ensure that accurate identifications are made. A number of methods exist for identifying DE genes, but far fewer are available for identifying DE isoforms. When isoform DE is of interest, investigators often apply gene-level (count-based) methods directly to estimates of isoform counts. Doing so is not recommended. In short, estimating isoform expression is relatively straightforward for some groups of isoforms, but more challenging for others. This results in estimation uncertainty that varies across isoform groups. Count-based methods were not designed to accommodate this varying uncertainty, and consequently, application of them for isoform inference results in reduced power for some classes of isoforms and increased false discoveries for others. Results: Taking advantage of the merits of empirical Bayesian methods, we have developed EBSeq for identifying DE isoforms in an RNA-seq experiment comparing two or more biological conditions. Results demonstrate substantially improved power and performance of EBSeq for identifying DE isoforms. EBSeq also proves to be a robust approach for identifying DE genes. Availability and implementation: An R package containing examples and sample datasets is available at http://www.biostat.wisc.edu/ � kendzior/EBSEQ/.

1,048 citations


Journal ArticleDOI
TL;DR: A new hybrid approach that has the computational efficiency of de Bruijn graph methods and the flexibility of overlap-based assembly strategies, and which allows variable read lengths while tolerating a significant level of sequencing error is described.
Abstract: Motivation. Second-generation sequencing technologies produce high coverage of the genome by short reads at a very low cost, which has prompted development of new assembly methods. In particular, multiple algorithms based on de Bruijn graphs have been shown to be effective for the assembly problem. In this paper we describe a new hybrid approach that has the computational efficiency of de Bruijn graph methods and the flexibility of overlap-based assembly strategies, and which allows variable read lengths while tolerating a significant level of sequencing error. Our method transforms very large numbers of paired-end reads into a much smaller number of longer “super-reads.” The use of super-reads allows us to assemble combinations of Illumina reads of differing lengths together with longer reads from 454 and Sanger sequencing technologies, making it one of the few assemblers capable of handling such mixtures. We call our system the Maryland Super-Read Celera Assembler (abbreviated MaSuRCA and pronounced “mazurka”). Results. We evaluate the performance of MaSuRCA against two of the most widely used assemblers for Illumina data, Allpaths-LG and SOAPdenovo2, on two data sets from organisms for which highquality assemblies are available: the bacterium Rhodobacter sphaeroides and chromosome 16 of the mouse genome. We show that MaSuRCA performs on par or better than Allpaths-LG and significantly better than SOAPdenovo on these data, when evaluated against the finished sequence. We then show that MaSuRCA can significantly improve its assemblies when the original data are augmented with long reads. Availability. MaSuRCA is available as open-source code at ftp://ftp.genome.umd.edu/pub/MaSuRCA/. Previous (pre-publication) releases have been publicly available for over a year. Contact. Aleksey Zimin, alekseyz@ipst.umd.edu

1,032 citations


Journal ArticleDOI
TL;DR: The CluePedia Cytoscape plugin is a search tool for new markers potentially associated to pathways that can be connected based on in silico and/or experimental information and integrated into a ClueGO network of terms/pathways.
Abstract: Summary: The CluePedia Cytoscape plugin is a search tool for new markers potentially associated to pathways. CluePedia calculates linear and non-linear statistical dependencies from experimental data. Genes, proteins and miRNAs can be connected based on in silico and/or experimental information and integrated into a ClueGO network of terms/pathways. Interrelations within each pathway can be investigated, and new potential associations may be revealed through gene/protein/miRNA enrichments. A pathway-like visualization can be created using the Cerebral plugin layout. Combining all these features is essential for data interpretation and the generation of new hypotheses. The CluePedia Cytoscape plugin is user-friendly and has an expressive and intuitive visualization. Availability: http://www.ici.upmc.fr/cluepedia/ and via the Cytoscape plugin manager. The user manual is available at the CluePedia website. Contact: bernhard.mlecnik@crc.jussieu.fr or jerome.galon@crc.jussieu.fr Supplementary information:Supplementary data are available at Bioinformatics online.

Journal ArticleDOI
TL;DR: Two new methods, one based on binding-specific substructure comparison (TM-Site) and another on sequence profile alignment (S-SITE), for complementary binding site predictions are developed, which demonstrate a new robust approach to protein-ligand binding site recognition, ready for genome-wide structure-based function annotations.
Abstract: Motivation: Identification of protein–ligand binding sites is critical to protein function annotation and drug discovery. However, there is no method that could generate optimal binding site prediction for different protein types. Combination of complementary predictions is probably the most reliable solution to the problem. Results: We develop two new methods, one based on binding-specific substructure comparison (TM-SITE) and another on sequence profile alignment (S-SITE), for complementary binding site predictions. The methods are tested on a set of 500 non-redundant proteins harboring 814 natural, drug-like and metal ion molecules. Starting from low-resolution protein structure predictions, the methods successfully recognize 451% of binding residues with average Matthews correlation coefficient (MCC) significantly higher (with P-value 510 –9 in student t-test) than other state-of-the-art methods, including COFACTOR, FINDSITE and ConCavity. When combining TM-SITE and S-SITE with other structure-based programs, a consensus approach (COACH) can increase MCC by 15% over the best individual predictions. COACH was examined in the recent community-wide COMEO experiment and consistently ranked as the best method in last 22 individual datasets with the Area Under the Curve score 22.5% higher than the second best method. These data demonstrate a new robust approach to protein–ligand binding site recognition, which is ready for genome-wide structure-based function annotations.

Journal ArticleDOI
TL;DR: TANGO is a coherent framework allowing biologists to perform the complete analysis process of 3D fluorescence images by combining two environments: ImageJ and R, providing an intuitive user interface providing the means to precisely build a segmentation procedure and set-up analyses, without possessing programming skills.
Abstract: Motivation: The cell nucleus is a highly organized cellular organelle that contains the genetic material. The study of nuclear architecture has become an important field of cellular biology. Extracting quantitative data from 3D fluorescence imaging helps understand the functions of different nuclear compartments. However, such approaches are limited by the requirement for processing and analyzing large sets of images. Results: Here, we describe Tools for Analysis of Nuclear Genome Organization (TANGO), an image analysis tool dedicated to the study of nuclear architecture. TANGO is a coherent framework allowing biologists to perform the complete analysis process of 3D fluorescence images by combining two environments: ImageJ (http://imagej.nih.gov/ij/) for image processing and quantitative analysis and R (http://cran.r-project.org) for statistical processing of measurement results. It includes an intuitive user interface providing the means to precisely build a segmentation procedure and set-up analyses, without possessing programming skills. TANGO is a versatile tool able to process large sets of images, allowing quantitative study of nuclear organization. Availability: TANGO is composed of two programs: (i) an ImageJ plug-in and (ii) a package (rtango) for R. They are both free and open source, available (http://biophysique.mnhn.fr/tango) for Linux, Microsoft Windows and Macintosh OSX. Distribution is under the GPL v.2 licence. Contact: rf.ueissuj.vns@reiduob.samoht Supplementary information: Supplementary data are available at Bioinformatics online.

Journal ArticleDOI
TL;DR: A tool for DNA/DNA sequence comparison that is built on the HMMER framework, which applies probabilistic inference methods based on hidden Markov models to the problem of homology search, called nhmmer, enables improved detection of remote DNA homologs.
Abstract: Summary: Sequence database searches are an essential part of molecular biology, providing information about the function and evolutionary history of proteins, RNA molecules and DNA sequence elements. We present a tool for DNA/DNA sequence comparison that is built on the HMMER framework, which applies probabilistic inference methods based on hidden Markov models to the problem of homology search. This tool, called nhmmer, enables improved detection of remote DNA homologs, and has been used in combination with Dfam and RepeatMasker to improve annotation of transposable elements in the human genome. Availability: nhmmer is a part of the new HMMER3.1 release. Source code and documentation can be downloaded from http://hmmer.org. HMMER3.1 is freely licensed under the GNU GPLv3 and should be portable to any POSIX-compliant operating system, including Linux and Mac OS/X. Contact: wheelert@janelia.hhmi.org

Journal ArticleDOI
TL;DR: RepeatExplorer as mentioned in this paper is a collection of software tools for characterization of repetitive elements which is accessible via web interface and uses graph-based sequence clustering algorithm to facilitate de novo repeat identification without the need for reference databases of known elements.
Abstract: Motivation: Repetitive DNA makes up large portions of plant and animal nuclear genomes, yet it remains the least characterized genome component in most species studied so far. Although the recent availability of high throughput sequencing data provides necessary resources for in-depth investigation of genomic repeats, its utility is hampered by the lack of specialized bioinformatics tools and appropriate computational resources that would enable large-scale repeat analysis to be run by biologically-oriented researchers. Results: Here we present RepeatExplorer, a collection of software tools for characterization of repetitive elements which is accessible via web interface. A key component of the server is the computational pipeline employing a graph-based sequence clustering algorithm to facilitate de novo repeat identification without the need for reference databases of known elements. Since the algorithm uses short sequences randomly sampled from the genome as input, it is ideal for analyzing next generation sequence reads. Additional tools are provided to aid in classification of identified repeats, investigate phylogenetic relationships of retroelements and perform comparative analysis of repeat composition between multiple species. The server allows to analyze several million sequence reads which typically results in identification of most high and medium copy repeats in higher plant genomes. Implementation and availability: RepeatExplorer was implemented within the Galaxy environment and set up on a public server at http://repeatexplorer.umbr.cas.cz/. Source code and instructions for local installation are available at http://w3lamc.umbr.cas.cz/lamc/ resources.php.

Journal ArticleDOI
TL;DR: The lDDT is a superposition-free score that evaluates local distance differences of all atoms in a model, including validation of stereochemical plausibility, which makes it a robust tool for the automated assessment of structure prediction servers without manual intervention.
Abstract: The assessment of protein structure prediction techniques requires objective criteria to measure the similarity between a computational model and the experimentally determined reference structure. Conventional similarity measures based on a global superposition of Cα atoms are strongly influenced by domain motions and do not assess the accuracy of local atomic details in the model.; The local Distance Difference Test (lDDT) is a superposition-free score which evaluates local distance differences of all atoms in a model, including validation of stereo-chemical plausibility. The reference can be a single structure, or an ensemble of equivalent structures. We demonstrate that lDDT is well suited to assess local model quality, even in presence of domain movements, while maintaining good correlation to global measures. These properties make lDDT a robust tool for the automated assessment of structure prediction servers without manual intervention.Availability and Implementation: Source code, binaries for Linux and MacOSX, and an interactive web server are available at http://swissmodel.expasy.org/lddt CONTACT: torsten.schwede@unibas.ch.

Journal ArticleDOI
TL;DR: This work proposes to extract microRNA-cancer associations by text mining and store them in a database called miRCancer, which documents 878 relationships between 236 microRNAs and 79 human cancers through the processing of >26 000 published articles.
Abstract: Motivation: Research interests in microRNAs have increased rapidly in the past decade. Many studies have showed that microRNAs have close relationships with various human cancers, and they potentially could be used as cancer indicators in diagnosis or as a suppressor for treatment purposes. There are several databases that contain microRNA–cancer associations predicted by computational methods but few from empirical results. Despite the fact that abundant experiments investigating microRNA expressions in cancer cells have been carried out, the results have remain scattered in the literature. We propose to extract microRNA–cancer associations by text mining and store them in a database called miRCancer. Results: The text mining is based on 75 rules we have constructed, which represent the common sentence structures typically used to state microRNA expressions in cancers. The microRNA–cancer association database, miRCancer, is updated regularly by running the text mining algorithm against PubMed. All miRNA–cancer associations are confirmed manually after automatic extraction. miRCancer currently documents 878 relationships between 236 microRNAs and 79 human cancers through the processing of426 000 published articles. Availability: miRCancer is freely available on the web at http://mircan

Journal ArticleDOI
TL;DR: The assumption that similar diseases tend to be associated with functionally similar lncRNAs is proposed and the method of Laplacian Regularized Least Squares for LncRNA-Disease Association (LRLSLDA) is developed in the semisupervised learning framework, which could be an effective and important biological tool for biomedical research.
Abstract: Motivation: More and more evidences have indicated that long–noncoding RNAs (lncRNAs) play critical roles in many important biological processes. Therefore, mutations and dysregulations of these lncRNAs would contribute to the development of various complex diseases. Developing powerful computational models for potential diseaserelated lncRNAs identification would benefit biomarker identification and drug discovery for human disease diagnosis, treatment, prognosis and prevention. Results: In this article, we proposed the assumption that similar diseases tend to be associated with functionally similar lncRNAs. Then, we further developed the method of Laplacian Regularized Least Squares for LncRNA–Disease Association (LRLSLDA) in the semisupervised learning framework. Although known disease– lncRNA associations in the database are rare, LRLSLDA still obtained an AUC of 0.7760 in the leave-one-out cross validation, significantly improving the performance of previous methods. We also illustrated the performance of LRLSLDA is not sensitive (even robust) to the parameters selection and it can obtain a reliable performance in all the test classes. Plenty of potential disease–lncRNA associations were publicly released and some of them have been confirmed by recent results in biological experiments. It is anticipated that LRLSLDA could be an effective and important biological tool for biomedical research. Availability: The code of LRLSLDA is freely available at http://asdcd.

Journal ArticleDOI
TL;DR: This article introduces the first machine learning approach for DNorm, using the NCBI disease corpus and the MEDIC vocabulary, which combines MeSH® and OMIM, a high-performing and mathematically principled framework for learning similarities between mentions and concept names directly from training data.
Abstract: Motivation: Despite the central role of diseases in biomedical research, there have been much fewer attempts to automatically determine which diseases are mentioned in a text—the task of disease name normalization (DNorm)—compared with other normalization tasks in biomedical text mining research. Methods: In this article we introduce the first machine learning approach for DNorm, using the NCBI disease corpus and the MEDIC vocabulary, which combines MeSH and OMIM. Our method is a high-performing and mathematically principled framework for learning similarities between mentions and concept names directly from training data. The technique is based on pairwise learning to rank, which has not previously been applied to the normalization task but has proven successful in large optimization problems for information retrieval. Results: We compare our method with several techniques based on lexical normalization and matching, MetaMap and Lucene. Our algorithm achieves 0.782 micro-averaged F-measure and 0.809 macroaveraged F-measure, an increase over the highest performing baseline method of 0.121 and 0.098, respectively. Availability: The source code for DNorm is available at http://www. ncbi.nlm.nih.gov/CBBresearch/Lu/Demo/DNorm, along with a webbased demonstration and links to the NCBI disease corpus. Results on PubMed abstracts are available in PubTator: http://www.ncbi.nlm. nih.gov/CBBresearch/Lu/Demo/PubTator

Journal ArticleDOI
TL;DR: Assessment of datasets from The Cancer Genome Atlas demonstrated that OncodriveCLUST selected cancer genes that were nevertheless missed by methods based on frequency and functional impact criteria, stressing the benefit of combining approaches based on complementary principles to identify driver mutations.
Abstract: Motivation: Gain-of-function mutations often cluster in specific protein regions, a signal that those mutations provide an adaptive advantage to cancer cells and consequently are positively selected during clonal evolution of tumours. We sought to determine the overall extent of this feature in cancer and the possibility to use this feature to identify drivers. Results: We have developed OncodriveCLUST, a method to identify genes with a significant bias towards mutation clustering within the protein sequence. This method constructs the background model by assessing coding-silent mutations, which are assumed not to be under positive selection and thus may reflect the baseline tendency of somatic mutations to be clustered. OncodriveCLUST analysis of the Catalogue of Somatic Mutations in Cancer retrieved a list of genes enriched by the Cancer Gene Census, prioritizing those with dominant phenotypes but also highlighting some recessive cancer genes, which showed wider but still delimited mutation clusters. Assessment of datasets from The Cancer Genome Atlas demonstrated that OncodriveCLUST selected cancer genes that were nevertheless missed by methods based on frequency and functional impact criteria. This stressed the benefit of combining approaches based on complementary principles to identify driver mutations. We propose OncodriveCLUST as an effective tool for that purpose. Availability: OncodriveCLUST has been implemented as a Python script and is freely available from http://bg.upf.edu/oncodriveclust

Journal ArticleDOI
TL;DR: A freely available, open source python package called protein in python (propy) for calculating the widely used structural and physicochemical features of proteins and peptides from amino acid sequence and can also easily compute the previous descriptors based on user-defined properties, which are automatically available from the AAindex database.
Abstract: Summary: Sequence-derived structural and physiochemical features have been frequently used for analysing and predicting structural, functional, expression and interaction profiles of proteins and peptides. To facilitate extensive studies of proteins and peptides, we developed a freely available, open source python package called protein in python (propy) for calculating the widely used structural and physicochemical features of proteins and peptides from amino acid sequence. It computes five feature groups composed of 13 features, including amino acid composition, dipeptide composition, tripeptide composition, normalized Moreau–Broto autocorrelation, Moran autocorrelation, Geary autocorrelation, sequence-order-coupling number, quasi-sequence-order descriptors, composition, transition and distribution of various structural and physicochemical properties and two types of pseudo amino acid composition (PseAAC) descriptors. These features could be generally regarded as different Chou’s PseAAC modes. In addition, it can also easily compute the previous descriptors based on user-defined properties, which are automatically available from the AAindex database. Availability: The python package, propy, is freely available via http:// code.google.com/p/protpy/downloads/list, and it runs on Linux and MS-Windows.

Journal ArticleDOI
TL;DR: NextGenMap is reported, a fast and accurate read mapper, which aligns reads reliably to a reference genome even when the sequence difference between target and reference genome is large, i.e. highly polymorphic genome.
Abstract: Summary: When choosing a read mapper, one faces the trade off between speed and the ability to map reads in highly polymorphic regions. Here, we report NextGenMap, a fast and accurate read mapper, which reduces this dilemma. NextGenMap aligns reads reliably to a reference genome even when the sequence difference between target and reference genome is large, i.e. highly polymorphic genome. At the same time, NextGenMap outperforms current mapping methods with respect to runtime and to the number of correctly mapped reads. NextGenMap efficiently uses the available hardware by exploiting multi-core CPUs as well as graphic cards (GPUs), if available. In addition, NextGenMap handles automatically any read data independent of read length and sequencing technology. Availability: NextGenMap source code and documentation are available at: http://cibiv.github.io/NextGenMap/ Contact: fritz.sedlazeck@univie.ac.at Supplementary information: Supplementary data are available at Bioinformatics online.

Journal ArticleDOI
TL;DR: A whole-genome shotgun sequencing strategy using two Illumina sequencing platforms and an assembly approach using the ABySS software is described, demonstrating how recent improvements in the sequencing technology, especially increasing read lengths and paired end reads from longer fragments have a major impact on the assembly contiguity.
Abstract: White spruce (Picea glauca) is a dominant conifer of the boreal forests of North America, and providing genomics resources for this commercially valuable tree will help improve forest management and conservation efforts. Sequencing and assembling the large and highly repetitive spruce genome though pushes the boundaries of the current technology. Here, we describe a whole-genome shotgun sequencing strategy using two Illumina sequencing platforms and an assembly approach using the ABySS software. We report a 20.8 giga base pairs draft genome in 4.9 million scaffolds, with a scaffold N50 of 20 356 bp. We demonstrate how recent improvements in the sequencing technology, especially increasing read lengths and paired end reads from longer fragments have a major impact on the assembly contiguity. We also note that scalable bioinformatics tools are instrumental in providing rapid draft assemblies. Availability: The Picea glauca genome sequencing and assembly data are available through NCBI (Accession#: ALWZ0100000000 PID: PRJNA83435). http://www.ncbi.nlm.nih.gov/bioproject/83435. Contact: [email protected] Supplementary information:Supplementary data are available at Bioinformatics online.

Journal ArticleDOI
TL;DR: Gottard et al. as discussed by the authors proposed a statistical model accounting for the fact that genes at the single-cell level can be on (and a continuous expression measure is recorded) or dichotomously off (and the recorded expression is zero).
Abstract: Motivation: Cell populations are never truly homogeneous; individual cells exist in biochemical states that define functional differences between them. New technology based on microfluidic arrays combined with multiplexed quantitative polymerase chain reactions now enables high-throughput single-cell gene expression measurement, allowing assessment of cellular heterogeneity. However, few analytic tools have been developed specifically for the statistical and analytical challenges of single-cell quantitative polymerase chain reactions data. Results: We present a statistical framework for the exploration, quality control and analysis of single-cell gene expression data from microfluidic arrays. We assess accuracy and within-sample heterogeneity of single-cell expression and develop quality control criteria to filter unreliable cell measurements. We propose a statistical model accounting for the fact that genes at the single-cell level can be on (and a continuous expression measure is recorded) or dichotomously off (and the recorded expression is zero). Based on this model, we derive a combined likelihood ratio test for differential expression that incorporates both the discrete and continuous components. Using an experiment that examines treatment-specific changes in expression, we show that this combined test is more powerful than either the continuous or dichotomous component in isolation, or a t-test on the zero-inflated data. Although developed for measurements from a specific platform (Fluidigm), these tools are generalizable to other multi-parametric measures over large numbers of events. Availability: All results presented here were obtained using the SingleCellAssay R package available on GitHub ( http://github.com/RGLab/SingleCellAssay). Contact: rgottard@fhcrc.org Supplementary information:Supplementary data are available at Bioinformatics online.

Journal ArticleDOI
TL;DR: A read simulator, PBSIM, is developed that captures characteristic features of PacBio reads using either a model-based or sampling-based method, suggesting that a continuous long reads coverage depth of at least 15 in combination with a circular consensus sequencing coveragedepth of at at least 30 achieved extensive assembly results.
Abstract: Motivation: PacBio sequencers produce two types of characteristic reads (continuous long reads: long and high error rate and circular consensus sequencing: short and low error rate), both of which could be useful for de novo assembly of genomes. Currently, there is no available simulator that targets the specific generation of PacBio libraries. Results: Our analysis of 13 PacBio datasets showed characteristic features of PacBio reads (e.g. the read length of PacBio reads follows a log-normal distribution). We have developed a read simulator, PBSIM, that captures these features using either a model-based or sampling-based method. Using PBSIM, we conducted several hybrid error correction and assembly tests for PacBio reads, suggesting that a continuous long reads coverage depth of at least 15 in combination with a circular consensus sequencing coverage depth of at least 30 achieved extensive assembly results. Availability: PBSIM is freely available from the web under the GNU GPL v2 license (http://code.google.com/p/pbsim/). Contact: mhamada@k.u-tokyo.ac.jp Supplementary information: Supplementary data are available at Bioinformatics online.

Journal ArticleDOI
TL;DR: A simple procedure called neighbor-based interaction-profile inferring (NII) is presented and integrated into the existing BLM method to handle the new candidate problem and demonstrates the effectiveness of the NII strategy and shows the great potential of BLM-NII for prediction of compound-protein interactions.
Abstract: Motivation:In silico methods provide efficient ways to predict possible interactions between drugs and targets. Supervised learning approach, bipartite local model (BLM), has recently been shown to be effective in prediction of drug–target interactions. However, for drug-candidate compounds or target-candidate proteins that currently have no known interactions available, its pure ‘local’ model is not able to be learned and hence BLM may fail to make correct prediction when involving such kind of new candidates. Results: We present a simple procedure called neighbor-based interaction-profile inferring (NII) and integrate it into the existing BLM method to handle the new candidate problem. Specifically, the inferred interaction profile is treated as label information and is used for model learning of new candidates. This functionality is particularly important in practice to find targets for new drug-candidate compounds and identify targeting drugs for new target-candidate proteins. Consistent good performance of the new BLM–NII approach has been observed in the experiment for the prediction of interactions between drugs and four categories of target proteins. Especially for nuclear receptors, BLM–NII achieves the most significant improvement as this dataset contains many drugs/targets with no interactions in the cross-validation. This demonstrates the effectiveness of the NII strategy and also shows the great potential of BLM–NII for prediction of compound–protein interactions. Contact: [email protected] Supplementary information: Supplementary data are available at Bioinformatics online.

Journal ArticleDOI
TL;DR: The results suggest that C9ORF72 is likely to regulate membrane traffic in conjunction with Rab-GTPase switches, and it is proposed to name the gene and its product DENN-like 72 (DENNL72).
Abstract: Motivation: Fronto-temporal dementia (FTD) and amyotrophic lateral sclerosis (ALS, also called motor neuron disease, MND) are severe neurodegenerative diseases that show considerable overlap at the clinical and cellular level. The most common single mutation in families with FTD or ALS has recently been mapped to a non-coding repeat expansion in the uncharacterized gene C9ORF72. Although a plausible mechanism for disease is that aberrant C9ORF72 mRNA poisons splicing, it is important to determine the cellular function of C9ORF72, about which nothing is known. Results: Sensitive homology searches showed that C9ORF72 is a full-length distant homologue of proteins related to Differentially Expressed in Normal and Neoplasia (DENN), which is a GDP/GTP exchange factor (GEF) that activates Rab-GTPases. Our results suggest that C9ORF72 is likely to regulate membrane traffic in conjunction with Rab-GTPase switches, and we propose to name the gene and its product DENN-like 72 (DENNL72). Supplementary information: Supplementary data are available at Bioinformatics online. Contact: tim.levine@ucl.ac.uk

Journal ArticleDOI
TL;DR: This work presents a new streaming algorithm for k-mer counting, called DSK (disk streaming of k-mers), which only requires a fixed user-defined amount of memory and disk space, and is the first approach that is able to count all the 27-mers of a human genome dataset using only 4.0 GB of memory & disk space.
Abstract: Counting all the k-mers (substrings of length k) in DNA/RNA sequencing reads is the preliminary step of many bioinformatics applications. However, state of the art k-mer counting methods require that a large data structure resides in memory. Such structure typically grows with the number of distinct k-mers to count. We present a new streaming algorithm for k-mer counting, called DSK (disk streaming of k-mers), which only requires a fixed, userdefined amount of memory and disk space. This approach realizes a memory, time and disk trade-off. The multi-set of all k-mers present in the reads is partitioned and partitions are saved to disk. Then, each partition is separately loaded in memory in a temporary hash table. The k-mer counts are returned by traversing each hash table. Low abundance k-mers are optionally filtered. DSK is the first approach that is able to count all the 27-mers of a human genome dataset using only 4.0 GB of memory and moderate disk space (160 GB), in 17.9 hours.

Journal ArticleDOI
TL;DR: An ultrafast DNA sequence aligner (Isaac Genome Alignment Software) that takes advantage of high-memory hardware and variant caller and is demonstrated to be four to five times faster than BWA + GATK on equivalent hardware.
Abstract: Summary: An ultrafast DNA sequence aligner (Isaac Genome Alignment Software) that takes advantage of high-memory hardware (48 GB) and variant caller (Isaac Variant Caller) have been developed. We demonstrate that our combined pipeline (Isaac) is four to five times faster than BWA þ GATK on equivalent hardware, with comparable accuracy as measured by trio conflict rates and sensitivity. We further show that Isaac is effective in the detection of disease-causing variants and can easily/economically be run on commodity hardware. Availability: Isaac has an open source license and can be obtained at

Journal ArticleDOI
TL;DR: This article uses the k-mer spectrum approach and introduces three correction techniques in a multistage workflow: two-sided conservative correction, one-sided aggressive correction and voting-based refinement to reveal that Musket is consistently one of the top performing correctors for Illumina short-read data.
Abstract: Motivation: The imperfect sequence data produced by nextgeneration sequencing technologies has motivated the development of a number of short-read error correctors in recent years. The majority of methods focus on the correction of substitution errors, which are the dominant error source in data produced by Illumina sequencing technology. Existing tools either score high in terms of recall or precision but not consistently high in terms of both measures. Results: In this paper, we present Musket, an efficient multistage kmer based corrector for Illumina short-read data. We employ the kmer spectrum approach and introduce three correction techniques in a multistage workflow: two-sided conservative correction, one-sided aggressive correction and voting-based refinement. Our performance evaluation results, in terms of correction quality and de novo genome assembly measures, reveal that Musket is consistently one of the top performing correctors. In addition, Musket is multithreaded using a master-slave model and demonstrates superior parallel scalability compared to all other evaluated correctors as well as a highly competitive overall execution time. Availability: Musket is available at http://musket.sourceforge.net. Contact: liuy@uni-mainz.de; bertil.schmidt@uni-mainz.de Supplementary information: available at Bioinformatics online