scispace - formally typeset
Search or ask a question

Showing papers in "Bioinformatics in 2017"


Journal ArticleDOI
TL;DR: UpSetR is an open source R package that employs a scalable matrix‐based visualization to show intersections of sets, their size, and other properties, and is released under the MIT License.
Abstract: Motivation Venn and Euler diagrams are a popular yet inadequate solution for quantitative visualization of set intersections. A scalable alternative to Venn and Euler diagrams for visualizing intersecting sets and their properties is needed. Results We developed UpSetR, an open source R package that employs a scalable matrix-based visualization to show intersections of sets, their size, and other properties. Availability and implementation UpSetR is available at https://github.com/hms-dbmi/UpSetR/ and released under the MIT License. A Shiny app is available at https://gehlenborglab.shinyapps.io/upsetr/ . Contact nils@hms.harvard.edu. Supplementary information Supplementary data are available at Bioinformatics online.

1,760 citations


Journal ArticleDOI
TL;DR: The Trainable Weka Segmentation (TWS), a machine learning tool that leverages a limited number of manual annotations in order to train a classifier and segment the remaining data automatically, is introduced.
Abstract: Summary State-of-the-art light and electron microscopes are capable of acquiring large image datasets, but quantitatively evaluating the data often involves manually annotating structures of interest. This process is time-consuming and often a major bottleneck in the evaluation pipeline. To overcome this problem, we have introduced the Trainable Weka Segmentation (TWS), a machine learning tool that leverages a limited number of manual annotations in order to train a classifier and segment the remaining data automatically. In addition, TWS can provide unsupervised segmentation learning schemes (clustering) and can be customized to employ user-designed image features or classifiers. Availability and implementation TWS is distributed as open-source software as part of the Fiji image processing distribution of ImageJ at http://imagej.net/Trainable_Weka_Segmentation . Contact ignacio.arganda@ehu.eus. Supplementary information Supplementary data are available at Bioinformatics online.

1,416 citations


Journal ArticleDOI
TL;DR: The R/Bioconductor package scater is developed to facilitate rigorous pre‐processing, quality control, normalization and visualization of scRNA‐seq data and provides a convenient, flexible workflow to process raw sequencing reads into a high‐quality expression dataset ready for downstream analysis.
Abstract: Single-cell RNA sequencing (scRNA-seq) is increasingly used to study gene expression at the level of individual cells. However, preparing raw sequence data for further analysis is not a straightforward process. Biases, artifacts and other sources of unwanted variation are present in the data, requiring substantial time and effort to be spent on pre-processing, quality control (QC) and normalization.We have developed the R/Bioconductor package scater to facilitate rigorous pre-processing, quality control, normalization and visualization of scRNA-seq data. The package provides a convenient, flexible workflow to process raw sequencing reads into a high-quality expression dataset ready for downstream analysis. scater provides a rich suite of plotting tools for single-cell data and a flexible data structure that is compatible with existing tools and can be used as infrastructure for future software development.The open-source code, along with installation instructions, vignettes and case studies, is available through Bioconductor at http://bioconductor.org/packages/scater .davis@ebi.ac.uk.Supplementary data are available at Bioinformatics online.

1,093 citations


Journal ArticleDOI
TL;DR: GenomeScope is an open‐source web tool to rapidly estimate the overall characteristics of a genome, including genome size, heterozygosity rate and repeat content from unprocessed short reads, which are essential for studying genome evolution.
Abstract: Summary: GenomeScope is an open-source web tool to rapidly estimate the overall characteristics of a genome, including genome size, heterozygosity rate, and repeat content from unprocessed short reads. These features are essential for studying genome evolution, and help to choose parameters for downstream analysis. We demonstrate its accuracy on 324 simulated and 16 real datasets with a wide range in genome sizes, heterozygosity levels, and error rates. Availability and Implementation: http://genomescope.org , https://github.com/schatzlab/genomescope.git. Contact: mschatz@jhu.edu. Supplementary information: Supplementary data are available at Bioinformatics online.

968 citations


Journal ArticleDOI
TL;DR: This work extended MISA, a computational tool assisting the development of microsatellites, and reimplemented it as a web‐based application, and improved compound microsatellite detection and added the possibility to display and export MISA results in GFF3 format for downstream analysis.
Abstract: Motivation Microsatellites are a widely-used marker system in plant genetics and forensics. The development of reliable microsatellite markers from resequencing data is challenging. Results We extended MISA, a computational tool assisting the development of microsatellite markers, and reimplemented it as a web-based application. We improved compound microsatellite detection and added the possibility to display and export MISA results in GFF3 format for downstream analysis. Availability and Implementation MISA-web can be accessed under http://misaweb.ipk-gatersleben.de/. The website provides tutorials, usage note as well as download links to the source code. Contact scholz@ipk-gatersleben.de.

964 citations


Journal ArticleDOI
TL;DR: The proposed LD Hub database and accompanying web interface will ensure maximal uptake of the LD score regression methodology, provide a useful database for the public dissemination of GWAS results, and provide a method for easily screening hundreds of traits for overlapping genetic aetiologies.
Abstract: Motivation: LD score regression is a reliable and efficient method of using genome-wide association study (GWAS) summary-level results data to estimate the SNP heritability of complex traits and diseases, partition this heritability into functional categories, and estimate the genetic correlation between different phenotypes. Because the method relies on summary level results data, LD score regression is computationally tractable even for very large sample sizes. However, publicly available GWAS summary-level data are typically stored in different databases and have different formats, making it difficult to apply LD score regression to estimate genetic correlations across many different traits simultaneously. Results: In this manuscript, we describe LD Hub - a centralized database of summary-level GWAS results for 173 diseases/traits from different publicly available resources/consortia and a web interface that automates the LD score regression analysis pipeline. To demonstrate functionality and validate our software, we replicated previously reported LD score regression analyses of 49 traits/diseases using LD Hub; and estimated SNP heritability and the genetic correlation across the different phenotypes. We also present new results obtained by uploading a recent atopic dermatitis GWAS meta-analysis to examine the genetic correlation between the condition and other potentially related traits. In response to the growing availability of publicly accessible GWAS summary-level results data, our database and the accompanying web interface will ensure maximal uptake of the LD score regression methodology, provide a useful database for the public dissemination of GWAS results, and provide a method for easily screening hundreds of traits for overlapping genetic aetiologies

854 citations


Journal ArticleDOI
TL;DR: This work presents a prediction algorithm using deep neural networks to predict protein subcellular localization relying only on sequence information, outperforming current state‐of‐the‐art algorithms, including those relying on homology information.
Abstract: Motivation The prediction of eukaryotic protein subcellular localization is a well-studied topic in bioinformatics due to its relevance in proteomics research. Many machine learning methods have been successfully applied in this task, but in most of them, predictions rely on annotation of homologues from knowledge databases. For novel proteins where no annotated homologues exist, and for predicting the effects of sequence variants, it is desirable to have methods for predicting protein properties from sequence information only. Results Here, we present a prediction algorithm using deep neural networks to predict protein subcellular localization relying only on sequence information. At its core, the prediction model uses a recurrent neural network that processes the entire protein sequence and an attention mechanism identifying protein regions important for the subcellular localization. The model was trained and tested on a protein dataset extracted from one of the latest UniProt releases, in which experimentally annotated proteins follow more stringent criteria than previously. We demonstrate that our model achieves a good accuracy (78% for 10 categories; 92% for membrane-bound or soluble), outperforming current state-of-the-art algorithms, including those relying on homology information. Availability and implementation The method is available as a web server at http://www.cbs.dtu.dk/services/DeepLoc. Example code is available at https://github.com/JJAlmagro/subcellular_localization. The dataset is available at http://www.cbs.dtu.dk/services/DeepLoc/data.php. Contact jjalma@dtu.dk.

767 citations


Journal ArticleDOI
TL;DR: The multi‐rate PTP is introduced, an improved method that alleviates the theoretical and technical shortcomings of PTP and consistently yields more accurate delimitations with respect to the taxonomy (i.e., identifies more taxonomic species, infers species numbers closer to theTaxonomy).
Abstract: Motivation: In recent years, molecular species delimitation has become a routine approach for quantifying and classifying biodiversity. Barcoding methods are of particular importance in large-scale surveys as they promote fast species discovery and biodiversity estimates. Among those, distance-based methods are the most common choice as they scale well with large datasets; however, they are sensitive to similarity threshold parameters and they ignore evolutionary relationships. The recently introduced "Poisson Tree Processes" (PTP) method is a phylogeny-aware approach that does not rely on such thresholds. Yet, two weaknesses of PTP impact its accuracy and practicality when applied to large datasets; it does not account for divergent intraspecific variation and is slow for a large number of sequences. Results: We introduce the multi-rate PTP (mPTP), an improved method that alleviates the theoretical and technical shortcomings of PTP. It incorporates different levels of intraspecific genetic diversity deriving from differences in either the evolutionary history or sampling of each species. Results on empirical data suggest that mPTP is superior to PTP and popular distance-based methods as it, consistently yields more accurate delimitations with respect to the taxonomy (i.e., identifies more taxonomic species, infers species numbers closer to the taxonomy). Moreover, mPTP does not require any similarity threshold as input. The novel dynamic programming algorithm attains a speedup of at least five orders of magnitude compared to PTP, allowing it to delimit species in large (meta-) barcoding data. In addition, Markov Chain Monte Carlo sampling provides a comprehensive evaluation of the inferred delimitation in just a few seconds for millions of steps, independently of tree size. Availability and Implementation: mPTP is implemented in C and is available for download at http://github.com/Pas-Kapli/mptp under the GNU Affero 3 license. A web-service is available at http://mptp.h-its.org . Contact: : paschalia.kapli@h-its.org or alexandros.stamatakis@h-its.org or tomas.flouri@h-its.org. Supplementary information: Supplementary data are available at Bioinformatics online.

535 citations


Journal ArticleDOI
TL;DR: Ggseqlogo as discussed by the authors is an R package built on the ggplot2 package that offers native illustration of publication-ready DNA, RNA and protein sequence logos in a highly customizable fashion with features including multi-logo plots, qualitative and quantitative colour schemes, annotation of logos and integration with other plots.
Abstract: Summary Sequence logos have become a crucial visualization method for studying underlying sequence patterns in the genome. Despite this, there remains a scarcity of software packages that provide the versatility often required for such visualizations. ggseqlogo is an R package built on the ggplot2 package that aims to address this issue. ggseqlogo offers native illustration of publication-ready DNA, RNA and protein sequence logos in a highly customizable fashion with features including multi-logo plots, qualitative and quantitative colour schemes, annotation of logos and integration with other plots. The package is intuitive to use and seamlessly integrates into R analysis pipelines. Availability and implementation ggseqlogo is released under the GNU licence and is freely available via CRAN-The Comprehensive R Archive Network https://cran.r-project.org/web/packages/ggseqlogo. A detailed tutorial can be found at https://omarwagih.github.io/ggseqlogo. Contact wagih@ebi.ac.uk.

520 citations


Journal ArticleDOI
TL;DR: A significantly updated and improved version of the Bioconductor package ChAMP, which can be used to analyze EPIC and 450k data and many enhanced functionalities have been added, including correction for cell‐type heterogeneity, network analysis and a series of interactive graphical user interfaces.
Abstract: Summary: The Illumina Infinium HumanMethylationEPIC BeadChip is the new platform for high-throughput DNA methylation analysis, effectively doubling the coverage compared to the older 450 K array. Here we present a significantly updated and improved version of the Bioconductor package ChAMP, which can be used to analyze EPIC and 450k data. Many enhanced functionalities have been added, including correction for cell-type heterogeneity, network analysis and a series of interactive graphical user interfaces. / Availability and implementation: ChAMP is a BioC package available from https://bioconductor.org/packages/release/bioc/html/ChAMP.html. / Contact: a.teschendorff@ucl.ac.uk or s.beck@ucl.ac.uk or a.feber@ucl.ac.uk / Supplementary information: Supplementary data are available at Bioinformatics online.

480 citations


Journal ArticleDOI
TL;DR: This work shows that a completely generic method based on deep learning and statistical word embeddings [called long short‐term memory network‐conditional random field (LSTM‐CRF)] outperforms state‐of‐the‐art entity‐specific NER tools, and often by a large margin.
Abstract: Motivation Text mining has become an important tool for biomedical research. The most fundamental text-mining task is the recognition of biomedical named entities (NER), such as genes, chemicals and diseases. Current NER methods rely on pre-defined features which try to capture the specific surface properties of entity types, properties of the typical local context, background knowledge, and linguistic information. State-of-the-art tools are entity-specific, as dictionaries and empirically optimal feature sets differ between entity types, which makes their development costly. Furthermore, features are often optimized for a specific gold standard corpus, which makes extrapolation of quality measures difficult. Results We show that a completely generic method based on deep learning and statistical word embeddings [called long short-term memory network-conditional random field (LSTM-CRF)] outperforms state-of-the-art entity-specific NER tools, and often by a large margin. To this end, we compared the performance of LSTM-CRF on 33 data sets covering five different entity classes with that of best-of-class NER tools and an entity-agnostic CRF implementation. On average, F1-score of LSTM-CRF is 5% above that of the baselines, mostly due to a sharp increase in recall. Availability and implementation The source code for LSTM-CRF is available at https://github.com/glample/tagger and the links to the corpora are available at https://corposaurus.github.io/corpora/ . Contact habibima@informatik.hu-berlin.de.

Journal ArticleDOI
TL;DR: OhmNet, a hierarchy‐aware unsupervised node feature learning approach for multi‐layer networks, is presented and it is demonstrated that it is possible to leverage the tissue hierarchy in order to effectively transfer cellular functions to a functionally uncharacterized tissue.
Abstract: Motivation Understanding functions of proteins in specific human tissues is essential for insights into disease diagnostics and therapeutics, yet prediction of tissue-specific cellular function remains a critical challenge for biomedicine. Results Here, we present OhmNet , a hierarchy-aware unsupervised node feature learning approach for multi-layer networks. We build a multi-layer network, where each layer represents molecular interactions in a different human tissue. OhmNet then automatically learns a mapping of proteins, represented as nodes, to a neural embedding-based low-dimensional space of features. OhmNet encourages sharing of similar features among proteins with similar network neighborhoods and among proteins activated in similar tissues. The algorithm generalizes prior work, which generally ignores relationships between tissues, by modeling tissue organization with a rich multiscale tissue hierarchy. We use OhmNet to study multicellular function in a multi-layer protein interaction network of 107 human tissues. In 48 tissues with known tissue-specific cellular functions, OhmNet provides more accurate predictions of cellular function than alternative approaches, and also generates more accurate hypotheses about tissue-specific protein actions. We show that taking into account the tissue hierarchy leads to improved predictive power. Remarkably, we also demonstrate that it is possible to leverage the tissue hierarchy in order to effectively transfer cellular functions to a functionally uncharacterized tissue. Overall, OhmNet moves from flat networks to multiscale models able to predict a range of phenotypes spanning cellular subsystems. Availability and implementation Source code and datasets are available at http://snap.stanford.edu/ohmnet . Contact jure@cs.stanford.edu.

Journal ArticleDOI
TL;DR: FQC is software that facilitates quality control of FASTQ files by carrying out a QC protocol using FastQC, parsing results, and aggregating quality metrics into an interactive dashboard designed to richly summarize individual sequencing runs.
Abstract: Summary FQC is software that facilitates quality control of FASTQ files by carrying out a QC protocol using FastQC, parsing results, and aggregating quality metrics into an interactive dashboard designed to richly summarize individual sequencing runs. The dashboard groups samples in dropdowns for navigation among the data sets, utilizes human-readable configuration files to manipulate the pages and tabs, and is extensible with CSV data. Availability and implementation FQC is implemented in Python 3 and Javascript, and is maintained under an MIT license. Documentation and source code is available at: https://github.com/pnnl/fqc . Contact joseph.brown@pnnl.gov.

Journal ArticleDOI
TL;DR: KaryoploteR as mentioned in this paper is an R/Bioconductor package to create linear chromosomal representations of any genome with genomic annotations and experimental data plotted along them, which allows the creation of highly customizable plots from arbitrary data with complete freedom on data positioning and representation.
Abstract: Motivation Data visualization is a crucial tool for data exploration, analysis and interpretation. For the visualization of genomic data there lacks a tool to create customizable non-circular plots of whole genomes from any species. Results We have developed karyoploteR, an R/Bioconductor package to create linear chromosomal representations of any genome with genomic annotations and experimental data plotted along them. Plot creation process is inspired in R base graphics, with a main function creating karyoplots with no data and multiple additional functions, including custom functions written by the end-user, adding data and other graphical elements. This approach allows the creation of highly customizable plots from arbitrary data with complete freedom on data positioning and representation. Availability and implementation karyoploteR is released under Artistic-2.0 License. Source code and documentation are freely available through Bioconductor (http://www.bioconductor.org/packages/karyoploteR) and at the examples and tutorial page at https://bernatgel.github.io/karyoploter_tutorial. Contact bgel@igtp.cat.

Journal ArticleDOI
TL;DR: A web application is implemented that uses key functions of R‐package SynergyFinder, and provides not only the flexibility of using multiple synergy scoring models, but also a user‐friendly interface for visualizing the drug combination landscapes in an interactive manner.
Abstract: Summary Rational design of drug combinations has become a promising strategy to tackle the drug sensitivity and resistance problem in cancer treatment. To systematically evaluate the pre-clinical significance of pairwise drug combinations, functional screening assays that probe combination effects in a dose-response matrix assay are commonly used. To facilitate the analysis of such drug combination experiments, we implemented a web application that uses key functions of R-package SynergyFinder, and provides not only the flexibility of using multiple synergy scoring models, but also a user-friendly interface for visualizing the drug combination landscapes in an interactive manner. Availability and implementation The SynergyFinder web application is freely accessible at https://synergyfinder.fimm.fi ; The R-package and its source-code are freely available at http://bioconductor.org/packages/release/bioc/html/synergyfinder.html . Contact jing.tang@helsinki.fi.

Journal ArticleDOI
TL;DR: This work presents a novel knowledge‐based approach that uses state‐of‐the‐art convolutional neural networks, where the algorithm is learned by examples, and demonstrates superior performance to two other competitive algorithmic strategies.
Abstract: Motivation An important step in structure-based drug design consists in the prediction of druggable binding sites. Several algorithms for detecting binding cavities, those likely to bind to a small drug compound, have been developed over the years by clever exploitation of geometric, chemical and evolutionary features of the protein. Results Here we present a novel knowledge-based approach that uses state-of-the-art convolutional neural networks, where the algorithm is learned by examples. In total, 7622 proteins from the scPDB database of binding sites have been evaluated using both a distance and a volumetric overlap approach. Our machine-learning based method demonstrates superior performance to two other competitive algorithmic strategies. Availability and implementation DeepSite is freely available at www.playmolecule.org. Users can submit either a PDB ID or PDB file for pocket detection to our NVIDIA GPU-equipped servers through a WebGL graphical interface. Contact gianni.defabritiis@upf.edu. Supplementary information Supplementary data are available at Bioinformatics online.

Journal ArticleDOI
TL;DR: Deorowicz et al. as discussed by the authors introduced KMC3, a significant improvement of the former KMC2 algorithm together with KMC tools for manipulating k-mer databases, which is shown on a few real problems.
Abstract: Counting all k -mers in a given dataset is a standard procedure in many bioinformatics applications. We introduce KMC3, a significant improvement of the former KMC2 algorithm together with KMC tools for manipulating k -mer databases. Usefulness of the tools is shown on a few real problems.Program is freely available at http://sun.aei.polsl.pl/REFRESH/kmc .sebastian.deorowicz@polsl.pl.Supplementary data are available at Bioinformatics online.

Journal ArticleDOI
TL;DR: A novel in silico framework for phylogeny and classification of prokaryotic viruses is presented, in line with the principles of phylogenetic systematics, and using a large reference dataset of officially classified viruses.
Abstract: Motivation Bacterial and archaeal viruses are crucial for global biogeochemical cycles and might well be game-changing therapeutic agents in the fight against multi-resistant pathogens. Nevertheless, it is still unclear how to best use genome sequence data for a fast, universal and accurate taxonomic classification of such viruses. Results We here present a novel in silico framework for phylogeny and classification of prokaryotic viruses, in line with the principles of phylogenetic systematics, and using a large reference dataset of officially classified viruses. The resulting trees revealed a high agreement with the classification. Except for low resolution at the family level, the majority of taxa was well supported as monophyletic. Clusters obtained with distance thresholds chosen for maximizing taxonomic agreement appeared phylogenetically reasonable, too. Analysis of an expanded dataset, containing >4000 genomes from public databases, revealed a large number of novel species, genera, subfamilies and families. Availability and implementation The selected methods are available as the easy-to-use web service 'VICTOR' at https://victor.dsmz.de. Contact jan.meier-kolthoff@dsmz.de. Supplementary information Supplementary data are available at Bioinformatics online.

Journal ArticleDOI
TL;DR: The model returns a predicted solubility and an indication of the features which deviate most from average values, and the utility of these additional features is demonstrated with the example of thioredoxin.
Abstract: Motivation Protein solubility is an important property in industrial and therapeutic applications Prediction is a challenge, despite a growing understanding of the relevant physicochemical properties Results Protein-Sol is a web server for predicting protein solubility Using available data for Escherichia coli protein solubility in a cell-free expression system, 35 sequence-based properties are calculated Feature weights are determined from separation of low and high solubility subsets The model returns a predicted solubility and an indication of the features which deviate most from average values Two other properties are profiled in windowed calculation along the sequence: fold propensity, and net segment charge The utility of these additional features is demonstrated with the example of thioredoxin Availability and implementation The Protein-Sol webserver is available at http://protein-solmanchesteracuk Contact jimwarwicker@manchesteracuk

Journal ArticleDOI
TL;DR: The application of LSTM‐BRNN to the prediction of protein structural properties makes the most significant improvement for residues with the most long‐range contacts over a previous window‐based, deep‐learning method SPIDER2.
Abstract: Motivation: The accuracy of predicting protein local and global structural properties such as secondary structure and solvent accessible surface area has been stagnant for many years because of the challenge of accounting for non-local interactions between amino acid residues that are close in three-dimensional structural space but far from each other in their sequence positions. All existing machine-learning techniques relied on a sliding window of 10–20 amino acid residues to capture some ‘short to intermediate’ non-local interactions. Here, we employed Long Short-Term Memory (LSTM) Bidirectional Recurrent Neural Networks (BRNNs) which are capable of capturing long range interactions without using a window. Results: We showed that the application of LSTM-BRNN to the prediction of protein structural properties makes the most significant improvement for residues with the most long-range contacts (|i-j| >19) over a previous window-based, deep-learning method SPIDER2. Capturing long-range interactions allows the accuracy of three-state secondary structure prediction to reach 84% and the correlation coefficient between predicted and actual solvent accessible surface areas to reach 0.80, plus a reduction of 5%, 10%, 5% and 10% in the mean absolute error for backbone ϕϕ , ψ, θ and τ angles, respectively, from SPIDER2. More significantly, 27% of 182724 40-residue models directly constructed from predicted Cα atom-based θ and τ have similar structures to their corresponding native structures (6A RMSD or less), which is 3% better than models built by ϕϕ and ψ angles. We expect the method to be useful for assisting protein structure and function prediction.

Journal ArticleDOI
TL;DR: ViPTree serves as a platform to visually investigate genomic alignments and automatically annotated gene functions for the uploaded viral genomes, thus providing virus researchers the first choice for classifying and understanding newly sequenced viral genomes.
Abstract: Summary ViPTree is a web server provided through GenomeNet to generate viral proteomic trees for classification of viruses based on genome-wide similarities. Users can upload viral genomes sequenced either by genomics or metagenomics. ViPTree generates proteomic trees for the uploaded genomes together with flexibly selected reference viral genomes. ViPTree also serves as a platform to visually investigate genomic alignments and automatically annotated gene functions for the uploaded viral genomes, thus providing virus researchers the first choice for classifying and understanding newly sequenced viral genomes. Availability and implementation ViPTree is freely available at: http://www.genome.jp/viptree . Contact goto@kuicr.kyoto-u.ac.jp. Supplementary information Supplementary data are available at Bioinformatics online.

Journal ArticleDOI
TL;DR: An online tool, Cas-Analyzer, a JavaScript-based implementation for NGS data analysis, which is completely used at a client-side web browser on-the-fly, there is no need to upload very large NGS datasets to a server, a time-consuming step in genome editing analysis.
Abstract: Genome editing with programmable nucleases has been widely adopted in research and medicine. Next generation sequencing (NGS) platforms are now widely used for measuring the frequencies of mutations induced by CRISPR-Cas9 and other programmable nucleases. Here, we present an online tool, Cas-Analyzer, a JavaScript-based implementation for NGS data analysis. Because Cas-Analyzer is completely used at a client-side web browser on-the-fly, there is no need to upload very large NGS datasets to a server, a time-consuming step in genome editing analysis. Currently, Cas-Analyzer supports various programmable nucleases, including single nucleases and paired nucleases. Availability and implementation Free access at http://www.rgenome.net/cas-analyzer/ CONTACT: sangsubae@hanyang.ac.kr or jskim01@snu.ac.krSupplementary information: Supplementary data are available at Bioinformatics online.

Journal ArticleDOI
TL;DR: This work presents FlashPCA2, a tool that can perform partial PCA on 1 million individuals faster than competing approaches, while requiring substantially less memory.
Abstract: Motivation Principal component analysis (PCA) is a crucial step in quality control of genomic data and a common approach for understanding population genetic structure. With the advent of large genotyping studies involving hundreds of thousands of individuals, standard approaches are no longer feasible. However, when the full decomposition is not required, substantial computational savings can be made. Results We present FlashPCA2, a tool that can perform partial PCA on 1 million individuals faster than competing approaches, while requiring substantially less memory. Availability and implementation https://github.com/gabraham/flashpca . Contact gad.abraham@unimelb.edu.au. Supplementary information Supplementary data are available at Bioinformatics online.

Journal ArticleDOI
TL;DR: A predictor, called iRSpot-EL, is developed by fusing different modes of pseudo K-tuple nucleotide composition and mode of dinucleotide-based auto-cross covariance into an ensemble classifier of clustering approach, which remarkably outperforms its existing counterparts.
Abstract: Motivation: Coexisting in a DNA system, meiosis and recombination are two indispensible aspects for cell reproduction and growth. With the avalanche of genome sequences emerging in the postgenomic age, it is an urgent challenge to acquire the information of DNA recombination spots because it can timely provide very useful insights into the mechanism of meiotic recombination and the process of genome evolution. Results: To address such a challenge, we have developed a predictor, called iRSpot-EL, by fusing different modes of pseudo K-tuple nucleotide composition and mode of dinucleotide-based autocross covariance into an ensemble classifier of clustering approach. Five-fold cross tests on a widely used benchmark dataset have indicated that the new predictor remarkably outperforms its existing counterparts. Particularly, far beyond their reach, the new predictor can be easily used to conduct the genome-wide analysis and the results obtained are quite consistent with the experimental map. Availability and Implementation: For the convenience of most experimental scientists, a userfriendly web-server for iRSpot-EL has been established at http://bioinformatics.hitsz.edu.cn/iRSpot- EL/, by which users can easily obtain their desired results without the need to go through the complicated mathematical equations involved.

Journal ArticleDOI
TL;DR: The software Lep-MAP3, capable of mapping high-throughput whole genome sequencing datasets, obtains very good performance already on 5x sequencing coverage and outperforms the fastest available software on simulated data on accuracy and often on speed.
Abstract: Motivation Accurate and dense linkage maps are useful in family-based linkage and association studies, quantitative trait locus mapping, analysis of genome synteny and other genomic data analyses. Moreover, linkage mapping is one of the best ways to detect errors in de novo genome assemblies, as well as to orient and place assembly contigs within chromosomes. A small mapping cross of tens of individuals will detect many errors where distant parts of the genome are erroneously joined together. With more individuals and markers, even more local errors can be detected and more contigs can be oriented. However, the tools that are currently available for constructing linkage maps are not well suited for large, possible low-coverage, whole genome sequencing datasets. Results Here we present a linkage mapping software Lep-MAP3, capable of mapping high-throughput whole genome sequencing datasets. Such data allows cost-efficient genotyping of millions of single nucleotide polymorphisms (SNPs) for thousands of individual samples, enabling, among other analyses, comprehensive validation and refinement of de novo genome assemblies. The algorithms of Lep-MAP3 can analyse low-coverage datasets and reduce data filtering and curation on any data. This yields more markers in the final maps with less manual work even on problematic datasets. We demonstrate that Lep-MAP3 obtains very good performance already on 5x sequencing coverage and outperforms the fastest available software on simulated data on accuracy and often on speed. We also construct de novo linkage maps on 7-12x whole-genome data on the Red postman butterfly (Heliconius erato) with almost 3 million markers. Availability and implementation Lep-MAP3 is available with the source code under GNU general public license from http://sourceforge.net/projects/lep-map3. Contact pasi.rastas@helsinki.fi. Supplementary information Supplementary data are available at Bioinformatics online.

Journal ArticleDOI
TL;DR: The novel and efficient algorithm SCODE is developed to infer regulatory networks from single‐cell RNA‐Seq during differentiation, based on ordinary differential equations, and it is confirmed that SCODE can reconstruct observed expression dynamics.
Abstract: Motivation The analysis of RNA-Seq data from individual differentiating cells enables us to reconstruct the differentiation process and the degree of differentiation (in pseudo-time) of each cell. Such analyses can reveal detailed expression dynamics and functional relationships for differentiation. To further elucidate differentiation processes, more insight into gene regulatory networks is required. The pseudo-time can be regarded as time information and, therefore, single-cell RNA-Seq data are time-course data with high time resolution. Although time-course data are useful for inferring networks, conventional inference algorithms for such data suffer from high time complexity when the number of samples and genes is large. Therefore, a novel algorithm is necessary to infer networks from single-cell RNA-Seq during differentiation. Results In this study, we developed the novel and efficient algorithm SCODE to infer regulatory networks, based on ordinary differential equations. We applied SCODE to three single-cell RNA-Seq datasets and confirmed that SCODE can reconstruct observed expression dynamics. We evaluated SCODE by comparing its inferred networks with use of a DNaseI-footprint based network. The performance of SCODE was best for two of the datasets and nearly best for the remaining dataset. We also compared the runtimes and showed that the runtimes for SCODE are significantly shorter than for alternatives. Thus, our algorithm provides a promising approach for further single-cell differentiation analyses. Availability and implementation The R source code of SCODE is available at https://github.com/hmatsu1226/SCODE. Contact hirotaka.matsumoto@riken.jp. Supplementary information Supplementary data are available at Bioinformatics online.

Journal ArticleDOI
TL;DR: PHYLOViZ 2.0 is presented, an extension of PHYLoviZ tool, a platform independent Java tool that allows phylogenetic inference and data visualization for large datasets of sequence based typing methods, including Single Nucleotide Polymorphism (SNP) and whole genome/core genome Multilocus Sequence Typing (wg/cgMLST) analysis.
Abstract: Summary: High Throughput Sequencing provides a cost effective means of generating high resolution data for hundreds or even thousands of strains, and is rapidly superseding methodologies based on a few genomic loci. The wealth of genomic data deposited on public databases such as Sequence Read Archive/European Nucleotide Archive provides a powerful resource for evolutionary analysis and epidemiological surveillance. However, many of the analysis tools currently available do not scale well to these large datasets, nor provide the means to fully integrate ancillary data. Here we present PHYLOViZ 2.0, an extension of PHYLOViZ tool, a platform independent Java tool that allows phylogenetic inference and data visualization for large datasets of sequence based typing methods, including Single Nucleotide Polymorphism (SNP) and whole genome/core genome Multilocus Sequence Typing (wg/cgMLST) analysis. PHYLOViZ 2.0 incorporates new data analysis algorithms and new visualization modules, as well as the capability of saving projects for subsequent work or for dissemination of results. Availability and Implementation: http://www.phyloviz.net/ (licensed under GPLv3). Contact: cvaz@inesc-id.pt Supplementary information: Supplementary data are available at Bioinformatics online.

Journal ArticleDOI
TL;DR: The annotatr Bioconductor package is developed to flexibly and quickly summarize and plot annotations of genomic regions, giving a better understanding of the genomic context of the regions.
Abstract: Motivation Analysis of next-generation sequencing data often results in a list of genomic regions. These may include differentially methylated CpGs/regions, transcription factor binding sites, interacting chromatin regions, or GWAS-associated SNPs, among others. A common analysis step is to annotate such genomic regions to genomic annotations (promoters, exons, enhancers, etc.). Existing tools are limited by a lack of annotation sources and flexible options, the time it takes to annotate regions, an artificial one-to-one region-to-annotation mapping, a lack of visualization options to easily summarize data, or some combination thereof. Results We developed the annotatr Bioconductor package to flexibly and quickly summarize and plot annotations of genomic regions. The annotatr package reports all intersections of regions and annotations, giving a better understanding of the genomic context of the regions. A variety of graphics functions are implemented to easily plot numerical or categorical data associated with the regions across the annotations, and across annotation intersections, providing insight into how characteristics of the regions differ across the annotations. We demonstrate that annotatr is up to 27× faster than comparable R packages. Overall, annotatr enables a richer biological interpretation of experiments. Availability and implementation http://bioconductor.org/packages/annotatr/ and https://github.com/rcavalcante/annotatr. Contact rcavalca@umich.edu. Supplementary information Supplementary data are available at Bioinformatics online.

Journal ArticleDOI
TL;DR: The Actinobacteriophage Database (PhagesDB) is a comprehensive, interactive, database-backed website that collects and shares information related to the discovery, characterization and genomics of viruses that infectactinobacterial hosts.
Abstract: The Actinobacteriophage Database (PhagesDB) is a comprehensive, interactive, database-backed website that collects and shares information related to the discovery, characterization and genomics of viruses that infect Actinobacterial hosts. To date, more than 8000 bacteriophages-including over 1600 with sequenced genomes-have been entered into the database. PhagesDB plays a crucial role in organizing the discoveries of phage biologists around the world-including students in the SEA-PHAGES program-and has been cited in over 50 peer-reviewed articles. Availability and implementation http://phagesdb.org/. Contact gfh@pitt.edu.

Journal ArticleDOI
TL;DR: The McPAS‐TCR database currently contains more than 5000 sequences of TCRs associated with various pathologic conditions and their respective antigens in humans and in mice and provides interesting insights on pathology‐associated TCR sequences.
Abstract: Motivation While growing numbers of T cell receptor (TCR) repertoires are being mapped by high-throughput sequencing, existing methods do not allow for computationally connecting a given TCR sequence to its target antigen, or relating it to a specific pathology. As an alternative, a manually-curated database can relate TCR sequences with their cognate antigens and associated pathologies based on published experimental data. Results We present McPAS-TCR, a manually curated database of TCR sequences associated with various pathologies and antigens based on published literature. Our database currently contains more than 5000 sequences of TCRs associated with various pathologic conditions (including pathogen infections, cancer and autoimmunity) and their respective antigens in humans and in mice. A web-based tool allows for searching the database based on different criteria, and for finding annotated sequences from the database in users' data. The McPAS-TCR website assembles information from a large number of studies that is very hard to dissect otherwise. Initial analyses of the data provide interesting insights on pathology-associated TCR sequences. Availability and implementation Free access at http://friedmanlab.weizmann.ac.il/McPAS-TCR/ . Contact nir.friedman@weizmann.ac.il.