scispace - formally typeset
Search or ask a question

Showing papers in "Bioinformatics in 2016"


Journal ArticleDOI
TL;DR: The power of ComplexHeatmap is demonstrated to easily reveal patterns and correlations among multiple sources of information with four real-world datasets.
Abstract: Summary: Parallel heatmaps with carefully designed annotation graphics are powerful for efficient visualization of patterns and relationships among high dimensional genomic data. Here we present the ComplexHeatmap package that provides rich functionalities for customizing heatmaps, arranging multiple parallel heatmaps and including user-defined annotation graphics. We demonstrate the power of ComplexHeatmap to easily reveal patterns and correlations among multiple sources of information with four real-world datasets. Availability and Implementation: The ComplexHeatmap package and documentation are freely available from the Bioconductor project: http://www.bioconductor.org/packages/devel/bioc/html/ComplexHeatmap.html. Contact: m.schlesner@dkfz.de Supplementary information: Supplementary data are available at Bioinformatics online.

4,733 citations


Journal ArticleDOI
TL;DR: This work presents MultiQC, a tool to create a single report visualising output from multiple tools across many samples, enabling global trends and biases to be quickly identified.
Abstract: Motivation: Fast and accurate quality control is essential for studies involving next-generation sequencing data. Whilst numerous tools exist to quantify QC metrics, there is no common approach to ...

3,688 citations


Journal ArticleDOI
TL;DR: The JSpeciesWS service indicates whether two genomes share genomic identities above or below the species embracing thresholds, and serves as a fast way to allocate unknown genomes in the frame of the hitherto sequenced species.
Abstract: This research leading to these results has received funding from the European Community's Seventh Framework Programme (FP7/2007-2013) under grant agreement no 311975. This publication reflects the views only of the author, and the European Union cannot be held responsible for any use which may be made of the information contained therein.

1,653 citations


Journal ArticleDOI
TL;DR: An expanded binning algorithm, MaxBin 2.0, is presented, which recovers genomes from co-assembly of a collection of metagenomic datasets, which is highly accurate in recovering individual genomes.
Abstract: Summary: The recovery of genomes from metagenomic datasets is a critical step to defining the functional roles of the underlying uncultivated populations. We previously developed MaxBin, an automated binning approach for high-throughput recovery of microbial genomes from metagenomes. Here we present an expanded binning algorithm, MaxBin 2.0, which recovers genomes from co-assembly of a collection of metagenomic datasets. Tests on simulated datasets revealed that MaxBin 2.0 is highly accurate in recovering individual genomes, and the application of MaxBin 2.0 to several metagenomes from environmental samples demonstrated that it could achieve two complementary goals: recovering more bacterial genomes compared to binning a single sample as well as comparing the microbial community composition between different sampling environments. Availability and implementation: MaxBin 2.0 is freely available at http://sourceforge.net/projects/maxbin/ under BSD license. Supplementary information: Supplementary data are available at Bioinformatics online.

1,265 citations


Journal ArticleDOI
TL;DR: Manta is optimized for rapid germline and somatic analysis, calling structural variants, medium-sized indels and large insertions on standard compute hardware in less than a tenth of the time that comparable methods require to identify only subsets of these variant types.
Abstract: UNLABELLED : We describe Manta, a method to discover structural variants and indels from next generation sequencing data. Manta is optimized for rapid germline and somatic analysis, calling structural variants, medium-sized indels and large insertions on standard compute hardware in less than a tenth of the time that comparable methods require to identify only subsets of these variant types: for example NA12878 at 50× genomic coverage is analyzed in less than 20 min. Manta can discover and score variants based on supporting paired and split-read evidence, with scoring models optimized for germline analysis of diploid individuals and somatic analysis of tumor-normal sample pairs. Call quality is similar to or better than comparable methods, as determined by pedigree consistency of germline calls and comparison of somatic calls to COSMIC database variants. Manta consistently assembles a higher fraction of its calls to base-pair resolution, allowing for improved downstream annotation and analysis of clinical significance. We provide Manta as a community resource to facilitate practical and routine structural variant analysis in clinical and research sequencing scenarios. AVAILABILITY AND IMPLEMENTATION Manta is released under the open-source GPLv3 license. Source code, documentation and Linux binaries are available from https://github.com/Illumina/manta. CONTACT csaunders@illumina.com SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.

1,224 citations


Journal ArticleDOI
Heng Li1
TL;DR: A new mapper, minimap and a de novo assembler, miniasm, is presented for efficiently mapping and assembling SMRT and ONT reads without an error correction stage.
Abstract: Motivation: Single Molecule Real-Time (SMRT) sequencing technology and Oxford Nanopore technologies (ONT) produce reads over 10 kb in length, which have enabled high-quality genome assembly at an affordable cost. However, at present, long reads have an error rate as high as 10–15%. Complex and computationally intensive pipelines are required to assemble such reads. Results: We present a new mapper, minimap and a de novo assembler, miniasm, for efficiently mapping and assembling SMRT and ONT reads without an error correction stage. They can often assemble a sequencing run of bacterial data into a single contig in a few minutes, and assemble 45-fold Caenorhabditis elegans data in 9 min, orders of magnitude faster than the existing pipelines, though the consensus sequence error rate is as high as raw reads. We also introduce a pairwise read mapping format and a graphical fragment assembly format, and demonstrate the interoperability between ours and current tools. Availability and implementation: https://github.com/lh3/minimap and https://github.com/lh3/miniasm Contact: gro.etutitsnidaorb@ilgneh Supplementary information: Supplementary data are available at Bioinformatics online.

1,060 citations


Journal ArticleDOI
TL;DR: It is shown that prediction methods based on alignments that include insertions and deletions have significantly higher performance than methods trained on peptides of single lengths and that the NetMHC-4.0 method can learn the length profile of different MHC molecules.
Abstract: Motivation: Many biological processes are guided by receptor interactions with linear ligands of variable length. One such receptor is the MHC class I molecule. The length preferences vary depending on the MHC allele, but are generally limited to peptides of length 8–11 amino acids. On this relatively simple system, we developed a sequence alignment method based on artificial neural networks that allows insertions and deletions in the alignment. Results: We show that prediction methods based on alignments that include insertions and deletions have significantly higher performance than methods trained on peptides of single lengths. Also, we illustrate how the location of deletions can aid the interpretation of the modes of binding of the peptide-MHC, as in the case of long peptides bulging out of the MHC groove or protruding at either terminus. Finally, we demonstrate that the method can learn the length profile of different MHC molecules, and quantified the reduction of the experimental effort required to identify potential epitopes using our prediction algorithm. Availability and implementation: The NetMHC-4.0 method for the prediction of peptide-MHC class I binding affinity using gapped sequence alignment is publicly available at: http://www.cbs.dtu.dk/services/NetMHC-4.0. Contact: kd.utd.sbc@leinm Supplementary information: Supplementary data are available at Bioinformatics online.

826 citations


Journal ArticleDOI
TL;DR: Baker1 is presented, a pipeline for unsupervised RNA-Seq-based genome annotation that combines the advantages of GeneMark-ET and AUGUSTUS and was observed that BRAKER1 was more accurate than MAKER2 when it is using RNA- Seq as sole source for training and prediction.
Abstract: MOTIVATION Gene finding in eukaryotic genomes is notoriously difficult to automate. The task is to design a work flow with a minimal set of tools that would reach state-of-the-art performance across a wide range of species. GeneMark-ET is a gene prediction tool that incorporates RNA-Seq data into unsupervised training and subsequently generates ab initio gene predictions. AUGUSTUS is a gene finder that usually requires supervised training and uses information from RNA-Seq reads in the prediction step. Complementary strengths of GeneMark-ET and AUGUSTUS provided motivation for designing a new combined tool for automatic gene prediction. RESULTS We present BRAKER1, a pipeline for unsupervised RNA-Seq-based genome annotation that combines the advantages of GeneMark-ET and AUGUSTUS. As input, BRAKER1 requires a genome assembly file and a file in bam-format with spliced alignments of RNA-Seq reads to the genome. First, GeneMark-ET performs iterative training and generates initial gene structures. Second, AUGUSTUS uses predicted genes for training and then integrates RNA-Seq read information into final gene predictions. In our experiments, we observed that BRAKER1 was more accurate than MAKER2 when it is using RNA-Seq as sole source for training and prediction. BRAKER1 does not require pre-trained parameters or a separate expert-prepared training step. AVAILABILITY AND IMPLEMENTATION BRAKER1 is available for download at http://bioinf.uni-greifswald.de/bioinf/braker/ and http://exon.gatech.edu/GeneMark/ CONTACT katharina.hoff@uni-greifswald.de or borodovsky@gatech.edu SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.

809 citations


Journal ArticleDOI
TL;DR: The MorphoLibJ library proposes a large collection of generic tools based on MM to process binary and grey-level 2D and 3D images, integrated into user-friendly plugins.
Abstract: Motivation: Mathematical morphology (MM) provides many powerful operators for processing 2D and 3D images. However, most MM plugins currently implemented for the popular ImageJ/Fiji platform are limited to the processing of 2D images. Results: The MorphoLibJ library proposes a large collection of generic tools based on MM to process binary and grey-level 2D and 3D images, integrated into user-friendly plugins. We illustrate how MorphoLibJ can facilitate the exploitation of 3D images of plant tissues. Availability and Implementation: MorphoLibJ is freely available at http://imagej.net/MorphoLibJ

796 citations


Journal ArticleDOI
TL;DR: PhenoScanner is a curated database of publicly available results from large-scale genetic association studies that aims to facilitate ‘phenome scans’, the cross-referencing of genetic variants with many phenotypes, to help aid understanding of disease pathways and biology.
Abstract: This work was supported by the UK Medical Research Council [G66840, G0800270], Pfizer [G73632], British Heart Foundation [SP/09/002], UK National Institute for Health Research Cambridge Biomedical Research Centre, European Research Council [268834], and European Commission Framework Programme 7 [HEALTH-F2-2012-279233].

766 citations


Journal ArticleDOI
TL;DR: Protein binDIng enerGY prediction (PRODIGY), a web server to predict the binding affinity of protein-protein complexes from their 3D structure based on intermolecular contacts and properties derived from non-interface surface is presented.
Abstract: Gaining insights into the structural determinants of protein-protein interactions holds the key for a deeper understanding of biological functions, diseases and development of therapeutics. An important aspect of this is the ability to accurately predict the binding strength for a given protein-protein complex. Here we present PROtein binDIng enerGY prediction (PRODIGY), a web server to predict the binding affinity of protein-protein complexes from their 3D structure. The PRODIGY server implements our simple but highly effective predictive model based on intermolecular contacts and properties derived from non-interface surface. AVAILABILITY AND IMPLEMENTATION: PRODIGY is freely available at: http://milou.science.uu.nl/services/PRODIGY CONTACT: a.m.j.j.bonvin@uu.nl, a.vangone@uu.nl.

Journal ArticleDOI
TL;DR: PHYLUCE is an efficient and easy-to-install software package that accomplishes targeted enrichment of conserved and ultraconserved genomic elements across hundreds of taxa and thousands of enriched loci.
Abstract: Summary: Targeted enrichment of conserved and ultraconserved genomic elements allows universal collection of phylogenomic data from hundreds of species at multiple time scales ( 300 Ma). Prior to downstream inference, data from these types of targeted enrichment studies must undergo preprocessing to assemble contigs from sequence data; identify targeted, enriched loci from the off-target background data; align enriched contigs representing conserved loci to one another; and prepare and manipulate these alignments for subsequent phylogenomic inference. PHYLUCE is an efficient and easy-to-install software package that accomplishes these tasks across hundreds of taxa and thousands of enriched loci. Availability and Implementation: PHYLUCE is written for Python 2.7. PHYLUCE is supported on OSX and Linux (RedHat/CentOS) operating systems. PHYLUCE source code is distributed under a BSD-style license from https://www.github.com/faircloth-lab/phyluce/. PHYLUCE is also available as a package (https://binstar.org/faircloth-lab/phyluce) for the Anaconda Python distribution that installs all dependencies, and users can request a PHYLUCE instance on iPlant Atmosphere (tag: phyluce). The software manual and a tutorial are available from http://phyluce.readthedocs.org/en/latest/ and test data are available from doi: 10.6084/m9.figshare.1284521. Contact: brant@faircloth-lab.org Supplementary information: Supplementary data are available at Bioinformatics online.

Journal ArticleDOI
TL;DR: The missMethyl package as mentioned in this paper is an R package with a suite of tools for performing normalization, removal of unwanted variation in differential methylation analysis, differential variability testing and gene set analysis for the Illumina HumanMethylation450 BeadChip.
Abstract: UNLABELLED: DNA methylation is one of the most commonly studied epigenetic modifications due to its role in both disease and development. The Illumina HumanMethylation450 BeadChip is a cost-effective way to profile >450 000 CpGs across the human genome, making it a popular platform for profiling DNA methylation. Here we introduce missMethyl, an R package with a suite of tools for performing normalization, removal of unwanted variation in differential methylation analysis, differential variability testing and gene set analysis for the 450K array. AVAILABILITY AND IMPLEMENTATION: missMethyl is an R package available from the Bioconductor project at www.bioconductor.org. CONTACT: alicia.oshlack@mcri.edu.au SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

Journal ArticleDOI
TL;DR: Calypso is an easy‐to‐use online software suite that allows non‐expert users to mine, interpret and compare taxonomic information from metagenomic or 16S rDNA datasets and has a focus on multivariate statistical approaches that can identify complex environment‐microbiome associations.
Abstract: Calypso is an easy-to-use online software suite that allows non-expert users to mine, interpret and compare taxonomic information from metagenomic or 16S rDNA datasets. Calypso has a focus on multivariate statistical approaches that can identify complex environment-microbiome associations. The software enables quantitative visualizations, statistical testing, multivariate analysis, supervised learning, factor analysis, multivariable regression, network analysis and diversity estimates. Comprehensive help pages, tutorials and videos are provided via a wiki page.

Journal ArticleDOI
TL;DR: Modifications to the minfi package required to support the HumanMethylationEPIC (‘EPIC’) array from Illumina are described and the single‐sample Noob (ssNoob) method is introduced, a normalization procedure suitable for incremental preprocessing of individual methylation arrays and concluded that this method should be used when integrating data from multiple generations of Infinium methylation array.
Abstract: Summary The minfi package is widely used for analyzing Illumina DNA methylation array data. Here we describe modifications to the minfi package required to support the HumanMethylationEPIC ('EPIC') array from Illumina. We discuss methods for the joint analysis and normalization of data from the HumanMethylation450 ('450k') and EPIC platforms. We introduce the single-sample Noob ( ssNoob ) method, a normalization procedure suitable for incremental preprocessing of individual methylation arrays and conclude that this method should be used when integrating data from multiple generations of Infinium methylation arrays. We show how to use reference 450k datasets to estimate cell type composition of samples on EPIC arrays. The cumulative effect of these updates is to ensure that minfi provides the tools to best integrate existing and forthcoming Illumina methylation array data. Availability and implementation The minfi package version 1.19.12 or higher is available for all platforms from the Bioconductor project. Contact khansen@jhsph.edu. Supplementary information Supplementary data are available at Bioinformatics online.

Journal ArticleDOI
TL;DR: FINEMAP is introduced, a software package to efficiently explore a set of the most important causal configurations of the region via a shotgun stochastic search algorithm, and is therefore a promising tool for analyzing growing amounts of data produced in genome-wide association studies and emerging sequencing projects.
Abstract: Motivation: The goal of fine-mapping in genomic regions associated with complex diseases and traits is to identify causal variants that point to molecular mechanisms behind the associations. Recent fine-mapping methods using summary data from genome-wide association studies rely on exhaustive search through all possible causal configurations, which is computationally expensive. Results: We introduce FINEMAP, a software package to efficiently explore a set of the most important causal configurations of the region via a shotgun stochastic search algorithm. We show that FINEMAP produces accurate results in a fraction of processing time of existing approaches and is therefore a promising tool for analyzing growing amounts of data produced in genome-wide association studies and emerging sequencing projects. Availability and implementation: FINEMAP v1.0 is freely available for Mac OS X and Linux at http://www.christianbenner.com. Contact: if.iknisleh@renneb.naitsirhc or if.iknisleh@nenirip.ittam

Journal ArticleDOI
TL;DR: This work exemplarily applies destiny, an efficient R implementation of the diffusion map algorithm, to a recent time-resolved mass cytometry dataset of cellular reprogramming and presents an efficient nearest-neighbour approximation.
Abstract: Diffusion maps are a spectral method for non-linear dimension reduction and have recently been adapted for the visualization of single-cell expression data. Here we present destiny, an efficient R implementation of the diffusion map algorithm. Our package includes a single-cell specific noise model allowing for missing and censored values. In contrast to previous implementations, we further present an efficient nearest-neighbour approximation that allows for the processing of hundreds of thousands of cells and a functionality for projecting new data on existing diffusion maps. We exemplarily apply destiny to a recent time-resolved mass cytometry dataset of cellular reprogramming. Availability and implementation destiny is an open-source R/Bioconductor package "bioconductor.org/packages/destiny" also available at www.helmholtz-muenchen.de/icb/destiny A detailed vignette describing functions and workflows is provided with the package. Contact carsten.marr@helmholtz-muenchen.de or f.buettner@helmholtz-muenchen.de Supplementary information Supplementary data are available at Bioinformatics online.

Journal ArticleDOI
TL;DR: BCFtools/RoH is presented and evaluated, an extension to the BCFtools software package, that detects regions of autozygosity in sequencing data, in particular exome data, using a hidden Markov model and it is shown that it has higher sensitivity and specificity than existing methods under a range of sequencing error rates and levels of autozykgosity.
Abstract: Summary: Runs of homozygosity (RoHs) are genomic stretches of a diploid genome that show identical alleles on both chromosomes. Longer RoHs are unlikely to have arisen by chance but are likely to denote autozygosity, whereby both copies of the genome descend from the same recent ancestor. Early tools to detect RoH used genotype array data, but substantially more information is available from sequencing data. Here, we present and evaluate BCFtools/RoH, an extension to the BCFtools software package, that detects regions of autozygosity in sequencing data, in particular exome data, using a hidden Markov model. By applying it to simulated data and real data from the 1000 Genomes Project we estimate its accuracy and show that it has higher sensitivity and specificity than existing methods under a range of sequencing error rates and levels of autozygosity. Availability and implementation: BCFtools/RoH and its associated binary/source files are freely available from https://github.com/samtools/BCFtools. Contact: ku.ca.regnas@2nv or ku.ca.regnas@3dp Supplementary information: Supplementary data are available at Bioinformatics online.

Journal ArticleDOI
TL;DR: This paper presents Combenefit, new free software tool that enables the visualization, analysis and quantification of drug combination effects in terms of synergy and/or antagonism, and provides laboratory scientists with an easy and systematic way to analyze their data.
Abstract: Motivation: Many drug combinations are routinely assessed to identify synergistic interactions in the attempt to develop novel treatment strategies. Appropriate software is required to analyze the results of these studies. Results: We present Combenefit, new free software tool that enables the visualization, analysis and quantification of drug combination effects in terms of synergy and/or antagonism. Data from combinations assays can be processed using classical Synergy models (Loewe, Bliss, HSA), as single experiments or in batch for High Throughput Screens. This user-friendly tool provides laboratory scientists with an easy and systematic way to analyze their data. The companion package provides bioinformaticians with critical implementations of routines enabling the processing of combination data. Availability and Implementation: Combenefit is provided as a Matlab package but also as standalone software for Windows (http://sourceforge.net/projects/combenefit/). Contact: Giovanni.DiVeroli@cruk.cam.ac.uk. Supplementary information:Supplementary data are available at Bioinformatics online.

Journal ArticleDOI
TL;DR: The results demonstrate that hybridSPAdes generates accurate assemblies (even in projects with relatively low coverage by long reads, thus reducing the overall cost of genome sequencing), and the first complete assembly of a genome from single cells using SMRT reads.
Abstract: Motivation: Recent advances in single molecule real-time (SMRT) and nanopore sequencing technologies have enabled high-quality assemblies from long and inaccurate reads. However, these approaches require high coverage by long reads and remain expensive. On the other hand, the inexpensive short reads technologies produce accurate but fragmented assemblies. Thus, a hybrid approach that assembles long reads (with low coverage) and short reads has a potential to generate high-quality assemblies at reduced cost. Results: We describe hybridSPAdes algorithm for assembling short and long reads and benchmark it on a variety of bacterial assembly projects. Our results demonstrate that hybridSPAdes generates accurate assemblies (even in projects with relatively low coverage by long reads) thus reducing the overall cost of genome sequencing. We further present the first complete assembly of a genome from single cells using SMRT reads. Availability and implementation: hybridSPAdes is implemented in C++ as a part of SPAdes genome assembler and is publicly available at http://bioinf.spbau.ru/en/spades Contact: d.antipov@spbu.ru Supplementary information: supplementary data are available at Bioinformatics online.

Journal ArticleDOI
TL;DR: MetaQUAST is presented, a modification of QUAST, the state-of-the-art tool for genome assembly evaluation based on alignment of contigs to a reference, and addresses such metagenome datasets features as unknown species content by detecting and downloading reference sequences.
Abstract: Summary: During the past years we have witnessed the rapid development of new metagenome assembly methods. Although there are many benchmark utilities designed for single-genome assemblies, there is no well-recognized evaluation and comparison tool for metagenomic-specific analogues. In this article, we present MetaQUAST, a modification of QUAST, the state-of-the-art tool for genome assembly evaluation based on alignment of contigs to a reference. MetaQUAST addresses such metagenome datasets features as (i) unknown species content by detecting and downloading reference sequences, (ii) huge diversity by giving comprehensive reports for multiple genomes and (iii) presence of highly relative species by detecting chimeric contigs. We demonstrate MetaQUAST performance by comparing several leading assemblers on one simulated and two real datasets. Availability and implementation: http://bioinf.spbau.ru/metaquast. Contact: aleksey.gurevich@spbu.ru Supplementary information: Supplementary data are available at Bioinformatics online.

Journal ArticleDOI
TL;DR: The plasmidSPAdes algorithm and software tool for assembling plasmids from whole genome sequencing data is presented and its performance is benchmarked on a diverse set of bacterial genomes.
Abstract: MOTIVATION Plasmids are stably maintained extra-chromosomal genetic elements that replicate independently from the host cell's chromosomes. Although plasmids harbor biomedically important genes, (such as genes involved in virulence and antibiotics resistance), there is a shortage of specialized software tools for extracting and assembling plasmid data from whole genome sequencing projects. RESULTS We present the plasmidSPAdes algorithm and software tool for assembling plasmids from whole genome sequencing data and benchmark its performance on a diverse set of bacterial genomes. AVAILABILITY AND IMPLEMENTATION plasmidSPAdes is publicly available at http://spades.bioinf.spbau.ru/plasmidSPAdes/ CONTACT: d.antipov@spbu.ruSupplementary information: Supplementary data are available at Bioinformatics online.

Journal ArticleDOI
TL;DR: A new neural network architecture is introduced that uses MIL to simultaneously classify and segment microscopy images with populations of cells and it is shown that training end-to-end MIL CNNs outperforms several previous methods on both mammalian and yeast datasets without requiring any segmentation steps.
Abstract: Motivation: High-content screening (HCS) technologies have enabled large scale imaging experiments for studying cell biology and for drug screening. These systems produce hundreds of thousands of microscopy images per day and their utility depends on automated image analysis. Recently, deep learning approaches that learn feature representations directly from pixel intensity values have dominated object recognition challenges. These tasks typically have a single centered object per image and existing models are not directly applicable to microscopy datasets. Here we develop an approach that combines deep convolutional neural networks (CNNs) with multiple instance learning (MIL) in order to classify and segment microscopy images using only whole image level annotations. Results: We introduce a new neural network architecture that uses MIL to simultaneously classify and segment microscopy images with populations of cells. We base our approach on the similarity between the aggregation function used in MIL and pooling layers used in CNNs. To facilitate aggregating across large numbers of instances in CNN feature maps we present the Noisy-AND pooling function, a new MIL operator that is robust to outliers. Combining CNNs with MIL enables training CNNs using whole microscopy images with image level labels. We show that training end-to-end MIL CNNs outperforms several previous methods on both mammalian and yeast datasets without requiring any segmentation steps. Availability and implementation: Torch7 implementation available upon request. Contact: ac.otnorotu.liam@suark.nero

Journal ArticleDOI
TL;DR: A new feature of the MAFFT multiple alignment program for suppressing over-alignment (aligning unrelated segments) by utilizing a variable scoring matrix for different pairs of sequences (or groups) in a single multiple sequence alignment, based on the global similarity of each pair.
Abstract: Motivation: We present a new feature of the MAFFT multiple alignment program for suppressing over-alignment (aligning unrelated segments). Conventional MAFFT is highly sensitive in aligning conserved regions in remote homologs, but the risk of over-alignment is recently becoming greater, as low-quality or noisy sequences are increasing in protein sequence databases, due, for example, to sequencing errors and difficulty in gene prediction. Results: The proposed method utilizes a variable scoring matrix for different pairs of sequences (or groups) in a single multiple sequence alignment, based on the global similarity of each pair. This method significantly increases the correctly gapped sites in real examples and in simulations under various conditions. Regarding sensitivity, the effect of the proposed method is slightly negative in real protein-based benchmarks, and mostly neutral in simulation-based benchmarks. This approach is based on natural biological reasoning and should be compatible with many methods based on dynamic programming for multiple sequence alignment. Availability and implementation: The new feature is available in MAFFT versions 7.263 and higher. http://mafft.cbrc.jp/alignment/software/ Contact: pj.ca.u-akaso.cerfi@hotak Supplementary information: Supplementary data are available at Bioinformatics online.

Journal ArticleDOI
TL;DR: Locus Overlap Analysis (LOLA) provides easy and automatable enrichment analysis for genomic region sets, thus facilitating the interpretation of functional genomics and epigenomics data.
Abstract: Summary: Genomic datasets are often interpreted in the context of large-scale reference databases. One approach is to identify significantly overlapping gene sets, which works well for gene-centric data. However, many types of high-throughput data are based on genomic regions. Locus Overlap Analysis (LOLA) provides easy and automatable enrichment analysis for genomic region sets, thus facilitating the interpretation of functional genomics and epigenomics data. Availability and Implementation: R package available in Bioconductor and on the following website: http://lola.computational-epigenetics.org. Contact: nsheffield@cemm.oeaw.ac.at or cbock@cemm.oeaw.ac.at

Journal ArticleDOI
TL;DR: RVTESTS is developed, which implements a broad set of rare variant association statistics and supports the analysis of autosomal and X-linked variants for both unrelated and related individuals and provides useful companion features for annotating sequence variants, integrating bioinformatics databases, performing data quality control and sample selection.
Abstract: Motivation: Next-generation sequencing technologies have enabled the large-scale assessment of the impact of rare and low-frequency genetic variants for complex human diseases. Gene-level association tests are often performed to analyze rare variants, where multiple rare variants in a gene region are analyzed jointly. Applying gene-level association tests to analyze sequence data often requires integrating multiple heterogeneous sources of information (e.g. annotations, functional prediction scores, allele frequencies, genotypes and phenotypes) to determine the optimal analysis unit and prioritize causal variants. Given the complexity and scale of current sequence datasets and bioinformatics databases, there is a compelling need for more efficient software tools to facilitate these analyses. To answer this challenge, we developed RVTESTS, which implements a broad set of rare variant association statistics and supports the analysis of autosomal and X-linked variants for both unrelated and related individuals. RVTESTS also provides useful companion features for annotating sequence variants, integrating bioinformatics databases, performing data quality control and sample selection. We illustrate the advantages of RVTESTS in functionality and efficiency using the 1000 Genomes Project data. Availability and implementation: RVTESTS is available on Linux, MacOS and Windows. Source code and executable files can be obtained at https://github.com/zhanxw/rvtests Contact: moc.liamg@wxnahz; ude.hcimu@olacnog; moc.kooltuo@uil.gnaijad Supplementary information: Supplementary data are available at Bioinformatics online.

Journal ArticleDOI
TL;DR: KAT is highlighted for its ability to provide valuable insights into assembly composition and quality of genome assemblies through pairwise comparison of k‐mers present in both input reads and the assemblies.
Abstract: Motivation De novo assembly of whole genome shotgun (WGS) next-generation sequencing (NGS) data benefits from high-quality input with high coverage. However, in practice, determining the quality and quantity of useful reads quickly and in a reference-free manner is not trivial. Gaining a better understanding of the WGS data, and how that data is utilized by assemblers, provides useful insights that can inform the assembly process and result in better assemblies. Results We present the K-mer Analysis Toolkit (KAT): a multi-purpose software toolkit for reference-free quality control (QC) of WGS reads and de novo genome assemblies, primarily via their k-mer frequencies and GC composition. KAT enables users to assess levels of errors, bias and contamination at various stages of the assembly process. In this paper we highlight KAT's ability to provide valuable insights into assembly composition and quality of genome assemblies through pairwise comparison of k-mers present in both input reads and the assemblies. Availability and implementation KAT is available under the GPLv3 license at: https://github.com/TGAC/KAT . Contact bernardo.clavijo@earlham.ac.uk. Supplementary information Supplementary data are available at Bioinformatics online.

Journal ArticleDOI
TL;DR: An R package that incorporates ARSER, JTK_CYCLE and Lomb-Scargle to conveniently evaluate periodicity in time-series data is presented, designed to analyze two-dimensional and three-dimensional time- series datasets, respectively.
Abstract: Detecting periodicity in large scale data remains a challenge. While efforts have been made to identify best of breed algorithms, relatively little research has gone into integrating these methods in a generalizable method. Here, we present MetaCycle, an R package that incorporates ARSER, JTK_CYCLE and Lomb-Scargle to conveniently evaluate periodicity in time-series data. MetaCycle has two functions, meta2d and meta3d, designed to analyze two-dimensional and three-dimensional time-series datasets, respectively. Meta2d implements N-version programming concepts using a suite of algorithms and integrating their results. Availability and implementation MetaCycle package is available on the CRAN repository (https://cran.r-project.org/web/packages/MetaCycle/index.html) and GitHub (https://github.com/gangwug/MetaCycle). Contact hogenesch@gmail.comSupplementary information: Supplementary data are available at Bioinformatics online.

Journal ArticleDOI
TL;DR: A deep learning method (abbreviated as D-GEX) is presented to infer the expression of target genes from theexpression of landmark genes, and shows that deep learning achieves lower error than LR in 99.97% of the target genes.
Abstract: Motivation: Large-scale gene expression profiling has been widely used to characterize cellular states in response to various disease conditions, genetic perturbations, etc. Although the cost of whole-genome expression profiles has been dropping steadily, generating a compendium of expression profiling over thousands of samples is still very expensive. Recognizing that gene expressions are often highly correlated, researchers from the NIH LINCS program have developed a cost-effective strategy of profiling only ∼1000 carefully selected landmark genes and relying on computational methods to infer the expression of remaining target genes. However, the computational approach adopted by the LINCS program is currently based on linear regression (LR), limiting its accuracy since it does not capture complex nonlinear relationship between expressions of genes. Results: We present a deep learning method (abbreviated as D-GEX) to infer the expression of target genes from the expression of landmark genes. We used the microarray-based Gene Expression Omnibus dataset, consisting of 111K expression profiles, to train our model and compare its performance to those from other methods. In terms of mean absolute error averaged across all genes, deep learning significantly outperforms LR with 15.33% relative improvement. A gene-wise comparative analysis shows that deep learning achieves lower error than LR in 99.97% of the target genes. We also tested the performance of our learned model on an independent RNA-Seq-based GTEx dataset, which consists of 2921 expression profiles. Deep learning still outperforms LR with 6.57% relative improvement, and achieves lower error in 81.31% of the target genes. Availability and implementation: D-GEX is available at https://github.com/uci-cbcl/D-GEX. Contact: ude.icu.sci@xhx Supplementary information: Supplementary data are available at Bioinformatics online.

Journal ArticleDOI
TL;DR: Rigorous cross-validation tests have indicated that IENHANCER-2L: holds very high potential to become a useful tool for genome analysis, and is the first computational predictor ever established for identifying not only enhancers, but also their strength.
Abstract: Motivation Enhancers are of short regulatory DNA elements. They can be bound with proteins (activators) to activate transcription of a gene, and hence play a critical role in promoting gene transcription in eukaryotes. With the avalanche of DNA sequences generated in the post-genomic age, it is a challenging task to develop computational methods for timely identifying enhancers from extremely complicated DNA sequences. Although some efforts have been made in this regard, they were limited at only identifying whether a query DNA element being of an enhancer or not. According to the distinct levels of biological activities and regulatory effects on target genes, however, enhancers should be further classified into strong and weak ones in strength. Results In view of this, a two-layer predictor called ' IENHANCER-2L: ' was proposed by formulating DNA elements with the 'pseudo k-tuple nucleotide composition', into which the six DNA local parameters were incorporated. To the best of our knowledge, it is the first computational predictor ever established for identifying not only enhancers, but also their strength. Rigorous cross-validation tests have indicated that IENHANCER-2L: holds very high potential to become a useful tool for genome analysis. Availability and implementation For the convenience of most experimental scientists, a web server for the two-layer predictor was established at http://bioinformatics.hitsz.edu.cn/iEnhancer-2L/, by which users can easily get their desired results without the need to go through the mathematical details. Contact bliu@gordonlifescience.org, bliu@insun.hit.edu.cn, xlan@stanford.edu, kcchou@gordonlifescience.org Supplementary information Supplementary data are available at Bioinformatics online.