scispace - formally typeset
Search or ask a question

Showing papers in "Bioinformatics in 2015"


Journal ArticleDOI
TL;DR: This work presents HTSeq, a Python library to facilitate the rapid development of custom scripts for high-throughput sequencing data analysis, and presents htseq-count, a tool developed with HTSequ that preprocesses RNA-Seq data for differential expression analysis by counting the overlap of reads with genes.
Abstract: Motivation: A large choice of tools exists for many standard tasks in the analysis of high-throughput sequencing (HTS) data. However, once a project deviates from standard workflows, custom scripts are needed. Results: We present HTSeq, a Python library to facilitate the rapid development of such scripts. HTSeq offers parsers for many common data formats in HTS projects, as well as classes to represent data, such as genomic coordinates, sequences, sequencing reads, alignments, gene model information and variant calls, and provides data structures that allow for querying via genomic coordinates. We also present htseq-count, a tool developed with HTSeq that preprocesses RNA-Seq data for differential expression analysis by counting the overlap of reads with genes. Availability and implementation: HTSeq is released as an opensource software under the GNU General Public Licence and available from http://www-huber.embl.de/HTSeq or from the Python Package Index at https://pypi.python.org/pypi/HTSeq. Contact: sanders@fs.tum.de

15,744 citations


Journal ArticleDOI
TL;DR: Zdobnov et al. as discussed by the authors proposed a measure for quantitative assessment of genome assembly and annotation completeness based on evolutionarily informed expectations of gene content, and implemented the assessment procedure in open-source software, with sets of Benchmarking Universal Single-Copy Orthologs.
Abstract: Motivation Genomics has revolutionized biological research, but quality assessment of the resulting assembled sequences is complicated and remains mostly limited to technical measures like N50. Results We propose a measure for quantitative assessment of genome assembly and annotation completeness based on evolutionarily informed expectations of gene content. We implemented the assessment procedure in open-source software, with sets of Benchmarking Universal Single-Copy Orthologs, named BUSCO. Availability and implementation Software implemented in Python and datasets available for download from http://busco.ezlab.org. Contact evgeny.zdobnov@unige.ch Supplementary information Supplementary data are available at Bioinformatics online.

7,747 citations


Journal ArticleDOI
TL;DR: MEGAHIT is a NGS de novo assembler for assembling large and complex metagenomics data in a time- and cost-efficient manner and generated a three-time larger assembly, with longer contig N50 and average contig length.
Abstract: Summary: MEGAHIT is a NGS de novo assembler for assembling large and complex metagenomics data in a time- and cost-efficient manner. It finished assembling a soil metagenomics dataset with 252Gbps in 44.1 hours and 99.6 hours on a single computing node with and without a GPU, respectively. MEGAHIT assembles the data as a whole, i.e., no pre-processing like partitioning and normalization was needed. When compared with previous methods (Chikhi and Rizk, 2012; Howe, et al., 2014) on assembling the soil data, MEGAHIT generated a 3-time larger assembly, with longer contig N50 and average contig length; furthermore, 55.8% of the reads were aligned to the assembly, giving a 4-fold improvement . Availability: The source code of MEGAHIT is freely available at https://github.com/voutcn/megahit under GPLv3 license. Contact: rb@l3-bioinfo.com, twlam@cs.hku.hk

3,634 citations


Journal ArticleDOI
TL;DR: Roary, a tool that rapidly builds large-scale pan genomes, identifying the core and accessory genes, is introduced, making construction of the pan genome of thousands of prokaryote samples possible on a standard desktop without compromising on the accuracy of results.
Abstract: Summary: A typical prokaryote population sequencing study can now consist of hundreds or thousands of isolates. Interrogating these datasets can provide detailed insights into the genetic structure of prokaryotic genomes. We introduce Roary, a tool that rapidly builds large-scale pan genomes, identifying the core and accessory genes. Roary makes construction of the pan genome of thousands of prokaryote samples possible on a standard desktop without compromising on the accuracy of results. Using a single CPU Roary can produce a pan genome consisting of 1000 isolates in 4.5 hours using 13 GB of RAM, with further speedups possible using multiple processors. Availability and implementation: Roary is implemented in Perl and is freely available under an open source GPLv3 license from http://sanger-pathogens.github.io/Roary Contact: ku.ca.regnas@yraor Supplementary information: Supplementary data are available at Bioinformatics online.

3,147 citations


Journal ArticleDOI
TL;DR: The upgraded GSDS 2.0 with a newly designed interface, supports for more types of annotation features and formats, as well as an integrated visual editor for editing the generated figure, and a user-specified phylogenetic tree can be added to facilitate further evolutionary analysis.
Abstract: Summary: Visualizing genes’ structure and annotated features helps biologists to investigate their function and evolution intuitively. The Gene Structure Display Server (GSDS) has been widely used by more than 60 000 users since its first publication in 2007. Here, we reported the upgraded GSDS 2.0 with a newly designed interface, supports for more types of annotation features and formats, as well as an integrated visual editor for editing the generated figure. Moreover, a user-specified phylogenetic tree can be added to facilitate further evolutionary analysis. The full source code is also available for downloading. Availability and implementation: Web server and source code are freely available at http://gsds.cbi.pku.edu.cn. Contact: nc.ude.ukp.ibc.liam@goag or nc.ude.ukp.ibc.liam@sdsg Supplementary information: Supplementary data are available at Bioinformatics online.

2,517 citations


Journal ArticleDOI
TL;DR: UNLABELLED ChIPseeker is an R package for annotating ChIP-seq data analysis and provides functions to visualize ChIP peaks coverage over chromosomes and profiles of peaks binding to TSS regions.
Abstract: Summary: ChIPseeker is an R package for annotating ChIP-seq data analysis. It supports annotating ChIP peaks and provides functions to visualize ChIP peaks coverage over chromosomes and profiles of peaks binding to TSS regions. Comparison of ChIP peak profiles and annotation are also supported. Moreover, it supports evaluating significant overlap among ChIP-seq datasets. Currently, ChIPseeker contains 15 000 bed file information from GEO database. These datasets can be downloaded and compare with user’s own data to explore significant overlap datasets for inferring co-regulation or transcription factor complex for further investigation. Availability and implementation: ChIPseeker is released under Artistic-2.0 License. The source code and documents are freely available through Bioconductor (http://www.bioconductor.org/pack

2,130 citations


Journal ArticleDOI
TL;DR: A web server to predict the functional effect of single or multiple amino acid substitutions, insertions and deletions using the prediction tool PROVEAN, which provides rapid analysis of protein variants from any organisms, and also supports high-throughput analysis for human and mouse variants at both the genomic and protein levels.
Abstract: Summary: We present a web server to predict the functional effect of single or multiple amino acid substitutions, insertions and deletions using the prediction tool PROVEAN. The server provides rapid analysis of protein variants from any organisms, and also supports high-throughput analysis for human and mouse variants at both the genomic and protein levels. Availability and implementation: The web server is freely available and open to all users with no login requirements at http://provean.jcvi.org. Contact: gro.ivcj@nahca Supplementary information: Supplementary data are available at Bioinformatics online.

1,886 citations


Journal ArticleDOI
TL;DR: Bandage (a Bioinformatics Application for Navigating De novo Assembly Graphs Easily) is a tool for visualizing assembly graphs with connections that presents new possibilities for analyzing de novo assemblies that are not possible through investigation of contigs alone.
Abstract: Summary: Although de novo assembly graphs contain assembled contigs (nodes), the connections between those contigs (edges) are difficult for users to access. Bandage (a Bioinformatics Application for Navigating De novo Assembly Graphs Easily) is a tool for visualizing assembly graphs with connections. Users can zoom in to specific areas of the graph and interact with it by moving nodes, adding labels, changing colors and extracting sequences. BLAST searches can be performed within the Bandage graphical user interface and the hits are displayed as highlights in the graph. By displaying connections between contigs, Bandage presents new possibilities for analyzing de novo assemblies that are not possible through investigation of contigs alone. Availability and implementation: Source code and binaries are freely available at https://github.com/rrwick/Bandage. Bandage is implemented in C++ and supported on Linux, OS X and Windows. A full feature list and screenshots are available at http://rrwick.github.io/Bandage. Contact: moc.liamg@kciwrr Supplementary information: Supplementary data are available at Bioinformatics online.

1,379 citations


Journal ArticleDOI
TL;DR: NMRFAM-SPARKY has been repackaged with current versions of Python and Tcl/Tk, which support new tools for NMR peak simulation and graphical assignment determination, and greatly accelerate protein side chain assignments.
Abstract: Summary: SPARKY (Goddard and Kneller, SPARKY 3) remains the most popular software program for NMR data analysis, despite the fact that development of the package by its originators ceased in 2001 We have taken over the development of this package and describe NMRFAM-SPARKY, which implements new functions reflecting advances in the biomolecular NMR field NMRFAM-SPARKY has been repackaged with current versions of Python and Tcl/Tk, which support new tools for NMR peak simulation and graphical assignment determination These tools, along with chemical shift predictions from the PACSY database, greatly accelerate protein side chain assignments NMRFAM-SPARKY supports automated data format interconversion for interfacing with a variety of web servers including, PECAN , PINE, TALOS-N, CS-Rosetta, SHIFTX2 and PONDEROSA-C/S Availability and implementation: The software package, along with binary and source codes, if desired, can be downloaded freely from http://pinenmrfamwiscedu/download_packageshtml Instruction manuals and video tutorials can be found at http://wwwnmrfamwiscedu/nmrfam-sparky-distributionhtm Contact: udecsiwmafrmn@eelhw or udecsiwmafrmn@yelkram Supplementary information: Supplementary data are available at Bioinformatics online

1,365 citations


Journal ArticleDOI
TL;DR: LDlink is a web-based collection of bioinformatic modules that query single nucleotide polymorphisms in population groups of interest to generate haplotype tables and interactive plots that are tailored for investigators interested in mapping common and uncommon disease susceptibility loci.
Abstract: Summary: Assessing linkage disequilibrium (LD) across ancestral populations is a powerful approach for investigating population-specific genetic structure as well as functionally mapping regions of disease susceptibility. Here, we present LDlink, a web-based collection of bioinformatic modules that query single nucleotide polymorphisms (SNPs) in population groups of interest to generate haplotype tables and interactive plots. Modules are designed with an emphasis on ease of use, query flexibility, and interactive visualization of results. Phase 3 haplotype data from the 1000 Genomes Project are referenced for calculating pairwise metrics of LD, searching for proxies in high LD, and enumerating all observed haplotypes. LDlink is tailored for investigators interested in mapping common and uncommon disease susceptibility loci by focusing on output linking correlated alleles and highlighting putative functional variants. Availability and implementation: LDlink is a free and publically available web tool which can be accessed at http://analysistools.nci.nih.gov/LDlink/. Contact: vog.hin@aleihcam.llehctim

1,308 citations


Journal ArticleDOI
TL;DR: Sambamba is a faster alternative to samtools that exploits multi-core processing and dramatically reduces processing time, and is being adopted at sequencing centers, not only because of its speed, but also because of additional functionality, including coverage analysis and powerful filtering capability.
Abstract: Sambamba is a high-performance robust tool and library for working with SAM, BAM and CRAM sequence alignment files; the most common file formats for aligned next generation sequencing data. Sambamba is a faster alternative to samtools that exploits multi-core processing and dramatically reduces processing time. Sambamba is being adopted at sequencing centers, not only because of its speed, but also because of additional functionality, including coverage analysis and powerful filtering capability. AVAILABILITY AND IMPLEMENTATION: Sambamba is free and open source software, available under a GPLv2 license. Sambamba can be downloaded and installed from http://www.open-bio.org/wiki/Sambamba. Sambamba v0.5.0 was released with doi:10.5281/zenodo.13200. CONTACT: j.c.p.prins@umcutrecht.nl.

Journal ArticleDOI
TL;DR: Qualimap 2 represents a next step in the QC analysis of HTS data, along with comprehensive single-sample analysis of alignment data, and includes new modes that allow simultaneous processing and comparison of multiple samples.
Abstract: Motivation: Detection of random errors and systematic biases is a crucial step of a robust pipeline for processing high-throughput sequencing (HTS) data. Bioinformatics software tools capable of performing this task are available, either for general analysis of HTS data or targeted to a specific sequencing technology. However, most of the existing QC instruments only allow processing of one sample at a time. Results: Qualimap 2 represents a next step in the QC analysis of HTS data. Along with comprehensive single-sample analysis of alignment data, it includes new modes that allow simultaneous processing and comparison of multiple samples. As with the first version, the new features are available via both graphical and command line interface. Additionally, it includes a large number of improvements proposed by the user community. Availability and implementation: The implementation of the software along with documentation is freely available at http://www.qualimap.org. Contact: ed.gpm.nilreb-biipm@reyem Supplementary information: Supplementary data are available at Bioinformatics online.

Journal ArticleDOI
TL;DR: The first dedicated PRS software, PRSice (‘precise'), is presented, for calculating, applying, evaluating and plotting the results of PRS, and illustrated the importance of identifying the best-fit PRS and estimate a P-value significance threshold for high-resolution PRS studies.
Abstract: Summary: A polygenic risk score (PRS) is a sum of trait-associated alleles across many genetic loci, typically weighted by effect sizes estimated from a genome-wide association study. The application of PRS has grown in recent years as their utility for detecting shared genetic aetiology among traits has become appreciated; PRS can also be used to establish the presence of a genetic signal in underpowered studies, to infer the genetic architecture of a trait, for screening in clinical trials, and can act as a biomarker for a phenotype. Here we present the first dedicated PRS software, PRSice (‘precise’), for calculating, applying, evaluating and plotting the results of PRS. PRSice can calculate PRS at a large number of thresholds (“high resolution”) to provide the best-fit PRS, as well as provide results calculated at broad P-value thresholds, can thin Single Nucleotide Polymorphisms (SNPs) according to linkage disequilibrium and P-value or use all SNPs, handles genotyped and imputed data, can calculate and incorporate ancestry-informative variables, and can apply PRS across multiple traits in a single run. We exemplify the use of PRSice via application to data on schizophrenia, major depressive disorder and smoking, illustrate the importance of identifying the best-fit PRS and estimate a P-value significance threshold for high-resolution PRS studies. Availability and implementation: PRSice is written in R, including wrappers for bash data management scripts and PLINK-1.9 to minimize computational time. PRSice runs as a command-line program with a variety of user-options, and is freely available for download from http://PRSice.info Contact: jack.euesden@kcl.ac.uk or paul.oreilly@kcl.ac.uk Supplementary information: Supplementary data are available at Bioinformatics online.

Journal ArticleDOI
TL;DR: The results support the use of UniRef clusters as a comprehensive and scalable alternative to native sequence databases for similarity searches and reinforces its reliability for use in functional annotation.
Abstract: Motivation: UniRef databases provide full-scale clustering of UniProtKB sequences and are utilized for a broad range of applications, particularly similarity-based functional annotation. Non-redundancy and intra-cluster homogeneity in UniRef were recently improved by adding a sequence length overlap threshold. Our hypothesis is that these improvements would enhance the speed and sensitivity of similarity searches and improve the consistency of annotation within clusters. Results: Intra-cluster molecular function consistency was examined by analysis of Gene Ontology terms. Results show that UniRef clusters bring together proteins of identical molecular function in more than 97% of the clusters, implying that clusters are useful for annotation and can also be used to detect annotation inconsistencies. To examine coverage in similarity results, BLASTP searches against UniRef50 followed by expansion of the hit lists with cluster members demonstrated advantages compared with searches against UniProtKB sequences; the searches are concise (� 7 times shorter hit list before expansion), faster (� 6 times) and more sensitive in detection of remote similarities (>96% recall at e-value <0.0001). Our results support the use of UniRef clusters as a comprehensive and scalable alternative to native sequence databases for similarity searches and reinforces its reliability for use in functional annotation. Availability and implementation: Web access and file download from UniProt website at http:// www.uniprot.org/uniref and ftp://ftp.uniprot.org/pub/databases/uniprot/uniref. BLAST searches against UniRef are available at http://www.uniprot.org/blast/

Journal ArticleDOI
Tal Galili1
TL;DR: dendextend is an R package for creating and comparing visually appealing tree diagrams that provides utility functions for manipulating dendrogram objects as well as several advanced methods for comparing trees to one another (both statistically and visually).
Abstract: Summary: dendextend is an R package for creating and comparing visually appealing tree diagrams. dendextend provides utility functions for manipulating dendrogram objects (their color, shape and content) as well as several advanced methods for comparing trees to one another (both statistically and visually). As such, dendextend offers a flexible framework for enhancing R's rich ecosystem of packages for performing hierarchical clustering of items. Availability and implementation: The dendextend R package (including detailed introductory vignettes) is available under the GPL-2 Open Source license and is freely available to download from CRAN at: (http://cran.r-project.org/package=dendextend) Contact: li.ca.uat.htam@ililaG.laT

Journal ArticleDOI
TL;DR: The results indicate that Tax4Fun provides a good approximation to functional profiles obtained from metagenomic shotgun sequencing approaches, and is a software package that predicts the functional capabilities of microbial communities based on 16S rRNA datasets.
Abstract: Motivation: The characterization of phylogenetic and functional diversity is a key element in the analysis of microbial communities. Amplicon-based sequencing of marker genes, such as 16S rRNA, is a powerful tool for assessing and comparing the structure of microbial communities at a high phylogenetic resolution. Because 16S rRNA sequencing is more cost-effective than whole metagenome shotgun sequencing, marker gene analysis is frequently used for broad studies that involve a large number of different samples. However, in comparison to shotgun sequencing approaches, insights into the functional capabilities of the community get lost when restricting the analysis to taxonomic assignment of 16S rRNA data. Results: Tax4Fun is a software package that predicts the functional capabilities of microbial communities based on 16S rRNA datasets. We evaluated Tax4Fun on a range of paired metagenome/16S rRNA datasets to assess its performance. Our results indicate that Tax4Fun provides a good approximation to functional profiles obtained from metagenomic shotgun sequencing approaches. Availability and implementation: Tax4Fun is an open-source R package and applicable to output as obtained from the SILVAngs web server or the application of QIIME with a SILVA database extension. Tax4Fun is freely available for download at http://tax4fun.gobics.de/. Contact: ed.gdwg@uahssak Supplementary information: Supplementary data are available at Bioinformatics online.

Journal ArticleDOI
TL;DR: The GOplot package provides a deeper insight into omics data and allows scientists to generate insightful plots with only a few lines of code to easily communicate the findings.
Abstract: UNLABELLED Despite the plethora of methods available for the functional analysis of omics data, obtaining comprehensive-yet detailed understanding of the results remains challenging. This is mainly due to the lack of publicly available tools for the visualization of this type of information. Here we present an R package called GOplot, based on ggplot2, for enhanced graphical representation. Our package takes the output of any general enrichment analysis and generates plots at different levels of detail: from a general overview to identify the most enriched categories (bar plot, bubble plot) to a more detailed view displaying different types of information for molecules in a given set of categories (circle plot, chord plot, cluster plot). The package provides a deeper insight into omics data and allows scientists to generate insightful plots with only a few lines of code to easily communicate the findings. AVAILABILITY AND IMPLEMENTATION The R package GOplot is available via CRAN-The Comprehensive R Archive Network: http://cran.r-project.org/web/packages/GOplot. The shiny web application of the Venn diagram can be found at: https://wwalter.shinyapps.io/Venn/. A detailed manual of the package with sample figures can be found at https://wencke.github.io/ CONTACT fscabo@cnic.es or mricote@cnic.es.

Journal ArticleDOI
TL;DR: Unlike the original A5 pipeline, A5-miseq can use long reads from the Illumina MiSeq, use read pairing information during contig generation and includes several improvements to read trimming, resulting in substantially improved assemblies that recover a more complete set of reference genes than previous methods.
Abstract: MOTIVATION Open-source bacterial genome assembly remains inaccessible to many biologists because of its complexity Few software solutions exist that are capable of automating all steps in the process of de novo genome assembly from Illumina data RESULTS A5-miseq can produce high-quality microbial genome assemblies on a laptop computer without any parameter tuning A5-miseq does this by automating the process of adapter trimming, quality filtering, error correction, contig and scaffold generation and detection of misassemblies Unlike the original A5 pipeline, A5-miseq can use long reads from the Illumina MiSeq, use read pairing information during contig generation and includes several improvements to read trimming Together, these changes result in substantially improved assemblies that recover a more complete set of reference genes than previous methods AVAILABILITY A5-miseq is licensed under the GPL open-source license Source code and precompiled binaries for Mac OS X 106+ and Linux 2615+ are available from http://sourceforgenet/projects/ngopt CONTACT aarondarling@utseduau SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online

Journal ArticleDOI
TL;DR: This work demonstrates large reductions in error frequencies, especially for high-error-rate reads, by three independent means: filtering reads according to their expected number of errors, assembling overlapping read pairs and by exploiting unique sequence abundances to perform error correction.
Abstract: Motivation: Next-generation sequencing produces vast amounts of data with errors that are difficult to distinguish from true biological variation when coverage is low. Results: We demonstrate large reductions in error frequencies, especially for high-error-rate reads, by three independent means: (i) filtering reads according to their expected number of errors, (ii) assembling overlapping read pairs and (iii) for amplicon reads, by exploiting unique sequence abundances to perform error correction. We also show that most published paired read assemblers calculate incorrect posterior quality scores. Availability and implementation: These methods are implemented in the USEARCH package. Binaries are freely available at http://drive5.com/usearch. Contact: robert@drive5.com Supplementary information: Supplementary data are available at Bioinformatics online.

Journal ArticleDOI
TL;DR: CRISPRdirect is a simple and functional web server for selecting rational CRISPR/Cas targets from an input sequence that incorporates the genomic sequences of human, mouse, rat, marmoset, pig, chicken, frog, zebrafish, Ciona, fruit fly, silkworm, Caenorhabditis elegans, Arabidopsis, rice, Sorghum and budding yeast.
Abstract: Summary: CRISPRdirect is a simple and functional web server for selecting rational CRISPR/Cas targets from an input sequence. The CRISPR/Cas system is a promising technique for genome engineering which allows target-specific cleavage of genomic DNA guided by Cas9 nuclease in complex with a guide RNA (gRNA), that complementarily binds to a � 20 nt targeted sequence. The target sequence requirements are twofold. First, the 5 0 -NGG protospacer adjacent motif (PAM) sequence must be located adjacent to the target sequence. Second, the target sequence should be specific within the entire genome in order to avoid off-target editing. CRISPRdirect enables users to easily select rational target sequences with minimized off-target sites by performing exhaustive searches against genomic sequences. The server currently incorporates the genomic sequences of human, mouse, rat, marmoset, pig, chicken, frog, zebrafish, Ciona, fruit fly, silkworm, Caenorhabditis elegans, Arabidopsis, rice, Sorghum and budding yeast. Availability: Freely available at http://crispr.dbcls.jp/. Contact: y-naito@dbcls.rois.ac.jp Supplementary information: Supplementary data are available at Bioinformatics online.

Journal ArticleDOI
TL;DR: A new version of ASTRAL is presented, which is statistically consistent under the multi-species coalescent model and which is more accurate than other coalescent-based methods on the datasets the authors examined, and has substantially better accuracy under some conditions.
Abstract: MOTIVATION The estimation of species phylogenies requires multiple loci, since different loci can have different trees due to incomplete lineage sorting, modeled by the multi-species coalescent model. We recently developed a coalescent-based method, ASTRAL, which is statistically consistent under the multi-species coalescent model and which is more accurate than other coalescent-based methods on the datasets we examined. ASTRAL runs in polynomial time, by constraining the search space using a set of allowed 'bipartitions'. Despite the limitation to allowed bipartitions, ASTRAL is statistically consistent. RESULTS We present a new version of ASTRAL, which we call ASTRAL-II. We show that ASTRAL-II has substantial advantages over ASTRAL: it is faster, can analyze much larger datasets (up to 1000 species and 1000 genes) and has substantially better accuracy under some conditions. ASTRAL's running time is [Formula: see text], and ASTRAL-II's running time is [Formula: see text], where n is the number of species, k is the number of loci and X is the set of allowed bipartitions for the search space. AVAILABILITY AND IMPLEMENTATION ASTRAL-II is available in open source at https://github.com/smirarab/ASTRAL and datasets used are available at http://www.cs.utexas.edu/~phylo/datasets/astral2/. CONTACT smirarab@gmail.com SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.

Journal ArticleDOI
TL;DR: DANN uses the same feature set and training data as CADD to train a deep neural network (DNN), which can capture non-linear relationships among features and are better suited than SVMs for problems with a large number of samples and features.
Abstract: Summary: Annotating genetic variants, especially non-coding variants, for the purpose of identifying pathogenic variants remains a challenge. Combined annotation-dependent depletion (CADD) is an algorithm designed to annotate both coding and non-coding variants, and has been shown to outperform other annotation algorithms. CADD trains a linear kernel support vector machine (SVM) to differentiate evolutionarily derived, likely benign, alleles from simulated, likely deleterious, variants. However, SVMs cannot capture non-linear relationships among the features, which can limit performance. To address this issue, we have developed DANN. DANN uses the same feature set and training data as CADD to train a deep neural network (DNN). DNNs can capture non-linear relationships among features and are better suited than SVMs for problems with a large number of samples and features. We exploit Compute Unified Device Architecture-compatible graphics processing units and deep learning techniques such as dropout and momentum training to accelerate the DNN training. DANN achieves about a 19% relative reduction in the error rate and about a 14% relative increase in the area under the curve (AUC) metric over CADD’s SVM methodology. Availability and implementation: All data and source code are available at https://cbcl.ics.uci.edu/public_data/DANN/. Contact: ude.icu.sci@xhx

Journal ArticleDOI
TL;DR: A software package called illustrator of biological sequences (IBS) that can be used for representing the organization of either protein or nucleotide sequences in a convenient, efficient and precise manner.
Abstract: Summary: Biological sequence diagrams are fundamental for visualizing various functional elements in protein or nucleotide sequences that enable a summarization and presentation of existing information as well as means of intuitive new discoveries. Here, we present a software package called illustrator of biological sequences (IBS) that can be used for representing the organization of either protein or nucleotide sequences in a convenient, efficient and precise manner. Multiple options are provided in IBS, and biological sequences can be manipulated, recolored or rescaled in a user-defined mode. Also, the final representational artwork can be directly exported into a publication-quality figure. Availability and implementation: The standalone package of IBS was implemented in JAVA, while the online service was implemented in HTML5 and JavaScript. Both the standalone package and online service are freely available at http://ibs.biocuckoo.org. Contact: renjian.sysu@gmail.com or xueyu@hust.edu.cn Supplementary information: Supplementary data are available at Bioinformatics online.

Journal ArticleDOI
TL;DR: This work describes DISOPRED3, which extends its predecessor with new modules to predict IDRs and protein-binding sites within them and shows that this predictor generates precise assignments of disordered protein binding regions and that it compares well with other publicly available tools.
Abstract: Motivation: A sizeable fraction of eukaryotic proteins contain intrinsically disordered regions (IDRs), which act in unfolded states or by undergoing transitions between structured and unstructured conformations. Over time, sequence-based classifiers of IDRs have become fairly accurate and currently a major challenge is linking IDRs to their biological roles from the molecular to the systems level. Results: We describe DISOPRED3, which extends its predecessor with new modules to predict IDRs and protein-binding sites within them. Based on recent CASP evaluation results, DISOPRED3 can be regarded as state of the art in the identification of IDRs, and our self-assessment shows that it significantly improves over DISOPRED2 because its predictions are more specific across the whole board and more sensitive to IDRs longer than 20 amino acids. Predicted IDRs are annotated as protein binding through a novel SVM based classifier, which uses profile data and additional sequence-derived features. Based on benchmarking experiments with full cross-validation, we show that this predictor generates precise assignments of disordered protein binding regions and that it compares well with other publicly available tools. Availability and implementation: http://bioinf.cs.ucl.ac.uk/disopred Contact: ku.ca.lcu@senoj.t.d Supplementary information: Supplementary data are available at Bioinformatics online.

Journal ArticleDOI
TL;DR: DOSE is an R package providing semantic similarity computations among DO terms and genes which allows biologists to explore the similarities of diseases and of gene functions in disease perspective and to verify disease relevance in a biological experiment and identify unexpected disease associations.
Abstract: Summary Disease ontology (DO) annotates human genes in the context of disease. DO is important annotation in translating molecular findings from high-throughput data to clinical relevance. DOSE is an R package providing semantic similarity computations among DO terms and genes which allows biologists to explore the similarities of diseases and of gene functions in disease perspective. Enrichment analyses including hypergeometric model and gene set enrichment analysis are also implemented to support discovering disease associations of high-throughput biological data. This allows biologists to verify disease relevance in a biological experiment and identify unexpected disease associations. Comparison among gene clusters is also supported. Availability and implementation DOSE is released under Artistic-2.0 License. The source code and documents are freely available through Bioconductor (http://www.bioconductor.org/packages/release/bioc/html/DOSE.html). Supplementary information Supplementary data are available at Bioinformatics online. Contact gcyu@connect.hku.hk or tqyhe@jnu.edu.cn.

Journal ArticleDOI
TL;DR: In this paper, the authors proposed the use of diffusion maps to deal with the problem of defining differentiation trajectories, which enables the establishment of a pseudotemporal ordering of single cells in a high-dimensional gene expression space.
Abstract: Motivation: Single-cell technologies have recently gained popularity in cellular differentiation studies regarding their ability to resolve potential heterogeneities in cell populations. Analyzing such high-dimensional single-cell data has its own statistical and computational challenges. Popular multivariate approaches are based on data normalization, followed by dimension reduction and clustering to identify subgroups. However, in the case of cellular differentiation, we would not expect clear clusters to be present but instead expect the cells to follow continuous branching lineages. Results: Here, we propose the use of diffusion maps to deal with the problem of defining differentiation trajectories. We adapt this method to single-cell data by adequate choice of kernel width and inclusion of uncertainties or missing measurement values, which enables the establishment of a pseudotemporal ordering of single cells in a high-dimensional gene expression space. We expect this output to reflect cell differentiation trajectories, where the data originates from intrinsic diffusion-like dynamics. Starting from a pluripotent stage, cells move smoothly within the transcriptional landscape towards more differentiated states with some stochasticity along their path. We demonstrate the robustness of our method with respect to extrinsic noise (e.g. measurement noise) and sampling density heterogeneities on simulated toy data as well as two single-cell quantitative polymerase chain reaction datasets (i.e. mouse haematopoietic stem cells and mouse embryonic stem cells) and an RNA-Seq data of human pre-implantation embryos. We show that diffusion maps perform considerably better than Principal Component Analysis and are advantageous over other techniques for non-linear dimension reduction such as t-distributed Stochastic Neighbour Embedding for preserving the global structures and pseudotemporal ordering of cells. Availability and implementation: The Matlab implementation of diffusion maps for single-cell data is available at https://www.helmholtz-muenchen.de/icb/single-cell-diffusion-map. Contact: fbuettner.phys@gmail.com, fabian.theis@helmholtz-muenchen.de Supplementary information: Supplementary data are available at Bioinformatics online.

Journal ArticleDOI
TL;DR: A suite of utilities, Change-O, which provides tools for advanced analyses of large-scale Ig repertoire sequencing data, and enables the seamless integration of multiple analyses into a single workflow.
Abstract: Advances in high-throughput sequencing technologies now allow for large-scale characterization of B cell immunoglobulin (Ig) repertoires. The high germline and somatic diversity of the Ig repertoire presents challenges for biologically meaningful analysis, which requires specialized computational methods. We have developed a suite of utilities, Change-O, which provides tools for advanced analyses of large-scale Ig repertoire sequencing data. Change-O includes tools for determining the complete set of Ig variable region gene segment alleles carried by an individual (including novel alleles), partitioning of Ig sequences into clonal populations, creating lineage trees, inferring somatic hypermutation targeting models, measuring repertoire diversity, quantifying selection pressure, and calculating sequence chemical properties. All Change-O tools utilize a common data format, which enables the seamless integration of multiple analyses into a single workflow. Availability and implementation Change-O is freely available for non-commercial use and may be downloaded from http://clip.med.yale.edu/changeo. Contact steven.kleinstein@yale.edu.

Journal ArticleDOI
TL;DR: Cytoscape.js is an open-source JavaScript-based graph library that can be used to render interactive graphs in a web browser and in a headless manner, useful for graph operations on a server, such as Node.js.
Abstract: Summary: Cytoscape.js is an open-source JavaScript-based graph library. Its most common use case is as a visualization software component, so it can be used to render interactive graphs in a web browser. It also can be used in a headless manner, useful for graph operations on a server, such as Node.js. Availability and implementation: Cytoscape.js is implemented in JavaScript. Documentation, downloads and source code are available at http://js.cytoscape.org. Contact: gary.bader@utoronto.ca

Journal ArticleDOI
TL;DR: This work proposes an integrative approach, named FATHMM-MKL, to predict the functional consequences of both coding and non-coding sequence variants, which utilizes various genomic annotations, which have recently become available, and learns to weight the significance of each component annotation source.
Abstract: Motivation: Technological advances have enabled the identification of an increasingly large spectrum of single nucleotide variants within the human genome, many of which may be associated with monogenic disease or complex traits. Here, we propose an integrative approach, named FATHMM-MKL, to predict the functional consequences of both coding and non-coding sequence variants. Our method utilizes various genomic annotations, which have recently become available, and learns to weight the significance of each component annotation source. Results: We show that our method outperforms current state-of-the-art algorithms, CADD and GWAVA, when predicting the functional consequences of non-coding variants. In addition, FATHMM-MKL is comparable to the best of these algorithms when predicting the impact of coding variants. The method includes a confidence measure to rank order predictions.

Journal ArticleDOI
TL;DR: A novel algorithm named SNN-Cliq is described that clusters single-cell transcriptomes using the concept of shared nearest neighbor that shows advantages in handling high-dimensional data.
Abstract: Motivation The recent advance of single-cell technologies has brought new insights into complex biological phenomena. In particular, genome-wide single-cell measurements such as transcriptome sequencing enable the characterization of cellular composition as well as functional variation in homogenic cell populations. An important step in the single-cell transcriptome analysis is to group cells that belong to the same cell types based on gene expression patterns. The corresponding computational problem is to cluster a noisy high dimensional dataset with substantially fewer objects (cells) than the number of variables (genes). Results In this article, we describe a novel algorithm named shared nearest neighbor (SNN)-Cliq that clusters single-cell transcriptomes. SNN-Cliq utilizes the concept of shared nearest neighbor that shows advantages in handling high-dimensional data. When evaluated on a variety of synthetic and real experimental datasets, SNN-Cliq outperformed the state-of-the-art methods tested. More importantly, the clustering results of SNN-Cliq reflect the cell types or origins with high accuracy. Availability and implementation The algorithm is implemented in MATLAB and Python. The source code can be downloaded at http://bioinfo.uncc.edu/SNNCliq.