scispace - formally typeset
Search or ask a question

Showing papers in "GigaScience in 2015"


Journal ArticleDOI
TL;DR: The second-generation versions of PLINK will offer dramatic improvements in performance and compatibility, and for the first time, users without access to high-end computing resources can perform several essential analyses of the feature-rich and very large genetic datasets coming into use.
Abstract: Background: PLINK 1 is a widely used open-source C/C++ toolset for genome-wide association studies (GWAS) and research in population genetics. However, the steady accumulation of data from imputation and whole-genome sequencing studies has exposed a strong need for faster and scalable implementations of key functions, such as logistic regression, linkage disequilibrium estimation, and genomic distance evaluation. In addition, GWAS and population-genetic data now frequently contain genotype likelihoods, phase information, and/or multiallelic variants, none of which can be represented by PLINK 1’s primary data format. Findings: To address these issues, we are developing a second-generation codebase for PLINK. The first major release from this codebase, PLINK 1.9, introduces extensive use of bit-level parallelism, O √ n -time/constant-space Hardy-Weinberg equilibrium and Fisher’s exact tests, and many other algorithmic improvements. In combination, these changes accelerate most operations by 1-4 orders of magnitude, and allow the program to handle datasets too large to fit in RAM. We have also developed an extension to the data format which adds low-overhead support for genotype likelihoods, phase, multiallelic variants, and reference vs. alternate alleles, which is the basis of our planned second release (PLINK 2.0). Conclusions: The second-generation versions of PLINK will offer dramatic improvements in performance and compatibility. For the first time, users without access to high-end computing resources can perform several essential analyses of the feature-rich and very large genetic datasets coming into use.

7,038 citations


Journal ArticleDOI
TL;DR: A k-mer based method, Rcorrector, to correct random sequencing errors in Illumina RNA-seq reads, which has an accuracy higher than or comparable to existing methods, including the only other method (SEECER), and is more time and memory efficient.
Abstract: Next-generation sequencing of cellular RNA (RNA-seq) is rapidly becoming the cornerstone of transcriptomic analysis. However, sequencing errors in the already short RNA-seq reads complicate bioinformatics analyses, in particular alignment and assembly. Error correction methods have been highly effective for whole-genome sequencing (WGS) reads, but are unsuitable for RNA-seq reads, owing to the variation in gene expression levels and alternative splicing. We developed a k-mer based method, Rcorrector, to correct random sequencing errors in Illumina RNA-seq reads. Rcorrector uses a De Bruijn graph to compactly represent all trusted k-mers in the input reads. Unlike WGS read correctors, which use a global threshold to determine trusted k-mers, Rcorrector computes a local threshold at every position in a read. Rcorrector has an accuracy higher than or comparable to existing methods, including the only other method (SEECER) designed for RNA-seq reads, and is more time and memory efficient. With a 5 GB memory footprint for 100 million reads, it can be run on virtually any desktop or server. The software is available free of charge under the GNU General Public License from https://github.com/mourisl/Rcorrector/ .

359 citations


Journal ArticleDOI
TL;DR: The output from SmileFinder can be used to plot percentile values to look for population diversity and divergence patterns that may suggest past actions of positive selection along chromosome maps, and to compare lists of suspected candidate genes under random gene sets to test for the overrepresentation of these patterns among gene categories.
Abstract: Background Adaptive alleles may rise in frequency as a consequence of positive selection, creating a pattern of decreased variation in the neighboring loci, known as a selective sweep. When the region containing this pattern is compared to another population with no history of selection, a rise in variance of allele frequencies between populations is observed. One challenge presented by large genome-wide datasets is the ability to differentiate between patterns that are remnants of natural selection from those expected to arise at random and/or as a consequence of selectively neutral demographic forces acting in the population.

349 citations


Journal ArticleDOI
TL;DR: LINKS, the Long Interval Nucleotide K-mer Scaffolder algorithm, a method that makes use of the sequence properties of nanopore sequence data and other error-containing sequence data, to scaffold high-quality genome assemblies, without the need for read alignment or base correction is presented.
Abstract: Owing to the complexity of the assembly problem, we do not yet have complete genome sequences. The difficulty in assembling reads into finished genomes is exacerbated by sequence repeats and the inability of short reads to capture sufficient genomic information to resolve those problematic regions. In this regard, established and emerging long read technologies show great promise, but their current associated higher error rates typically require computational base correction and/or additional bioinformatics pre-processing before they can be of value. We present LINKS, the Long Interval Nucleotide K-mer Scaffolder algorithm, a method that makes use of the sequence properties of nanopore sequence data and other error-containing sequence data, to scaffold high-quality genome assemblies, without the need for read alignment or base correction. Here, we show how the contiguity of an ABySS Escherichia coli K-12 genome assembly can be increased greater than five-fold by the use of beta-released Oxford Nanopore Technologies Ltd. long reads and how LINKS leverages long-range information in Saccharomyces cerevisiae W303 nanopore reads to yield assemblies whose resulting contiguity and correctness are on par with or better than that of competing applications. We also present the re-scaffolding of the colossal white spruce (Picea glauca) draft assembly (PG29, 20 Gbp) and demonstrate how LINKS scales to larger genomes. This study highlights the present utility of nanopore reads for genome scaffolding in spite of their current limitations, which are expected to diminish as the nanopore sequencing technology advances. We expect LINKS to have broad utility in harnessing the potential of long reads in connecting high-quality sequences of small and large genome assembly drafts.

195 citations


Journal ArticleDOI
Anna Kopf1, Anna Kopf2, Mesude Bicak3, Renzo Kottmann2  +166 moreInstitutions (77)
TL;DR: This commentary outlines the establishment, function and aims of the Ocean Sampling Day Consortium and describes the vision for a sustainable study of marine microbial communities and their embedded functional traits.
Abstract: Ocean Sampling Day was initiated by the EU-funded Micro B3 (Marine Microbial Biodiversity, Bioinformatics, Biotechnology) project to obtain a snapshot of the marine microbial biodiversity and function of the world’s oceans. It is a simultaneous global mega-sequencing campaign aiming to generate the largest standardized microbial data set in a single day. This will be achievable only through the coordinated efforts of an Ocean Sampling Day Consortium, supportive partnerships and networks between sites. This commentary outlines the establishment, function and aims of the Consortium and describes our vision for a sustainable study of marine microbial communities and their embedded functional traits.

173 citations


Journal ArticleDOI
TL;DR: Evidence of the utility of amplicon sequencing is offered by demonstrating that the current versions of MinION™ technology can accurately identify and differentiate both viral and bacterial species present within biological samples via Amplicon sequencing.
Abstract: The MinION™ nanopore sequencer was recently released to a community of alpha-testers for evaluation using a variety of sequencing applications. Recent reports have tested the ability of the MinION™ to act as a whole genome sequencer and have demonstrated that nanopore sequencing has tremendous potential utility. However, the current nanopore technology still has limitations with respect to error-rate, and this is problematic when attempting to assemble whole genomes without secondary rounds of sequencing to correct errors. In this study, we tested the ability of the MinION™ nanopore sequencer to accurately identify and differentiate bacterial and viral samples via directed sequencing of characteristic genes shared broadly across a target clade. Using a 6 hour sequencing run time, sufficient data were generated to identify an E. coli sample down to the species level from 16S rDNA amplicons. Three poxviruses (cowpox, vaccinia-MVA, and vaccinia-Lister) were identified and differentiated down to the strain level, despite over 98% identity between the vaccinia strains. The ability to differentiate strains by amplicon sequencing on the MinION™ was accomplished despite an observed per-base error rate of approximately 30%. While nanopore sequencing, using the MinION™ platform from Oxford Nanopore in particular, continues to mature into a commercially available technology, practical uses are sought for the current versions of the technology. This study offers evidence of the utility of amplicon sequencing by demonstrating that the current versions of MinION™ technology can accurately identify and differentiate both viral and bacterial species present within biological samples via amplicon sequencing.

157 citations


Journal ArticleDOI
TL;DR: The integration of the BLAST+ tool suite into Galaxy has the goal of making common BLAST tasks easy and advanced tasks possible.
Abstract: The NCBI BLAST suite has become ubiquitous in modern molecular biology and is used for small tasks such as checking capillary sequencing results of single PCR products, genome annotation or even larger scale pan-genome analyses. For early adopters of the Galaxy web-based biomedical data analysis platform, integrating BLAST into Galaxy was a natural step for sequence comparison workflows. The command line NCBI BLAST+ tool suite was wrapped for use within Galaxy. Appropriate datatypes were defined as needed. The integration of the BLAST+ tool suite into Galaxy has the goal of making common BLAST tasks easy and advanced tasks possible. This project is an informal international collaborative effort, and is deployed and used on Galaxy servers worldwide. Several examples of applications are described here.

152 citations


Journal ArticleDOI
TL;DR: These findings provide a comprehensive comparison of variations detection performance using SCRS amplified by different WGA methods, and will guide researchers to determine which WGA method is best suited to individual experimental needs at single-cell level.
Abstract: Single-cell resequencing (SCRS) provides many biomedical advances in variations detection at the single-cell level, but it currently relies on whole genome amplification (WGA). Three methods are commonly used for WGA: multiple displacement amplification (MDA), degenerate-oligonucleotide-primed PCR (DOP-PCR) and multiple annealing and looping-based amplification cycles (MALBAC). However, a comprehensive comparison of variations detection performance between these WGA methods has not yet been performed. We systematically compared the advantages and disadvantages of different WGA methods, focusing particularly on variations detection. Low-coverage whole-genome sequencing revealed that DOP-PCR had the highest duplication ratio, but an even read distribution and the best reproducibility and accuracy for detection of copy-number variations (CNVs). However, MDA had significantly higher genome recovery sensitivity (~84 %) than DOP-PCR (~6 %) and MALBAC (~52 %) at high sequencing depth. MALBAC and MDA had comparable single-nucleotide variations detection efficiency, false-positive ratio, and allele drop-out ratio. We further demonstrated that SCRS data amplified by either MDA or MALBAC from a gastric cancer cell line could accurately detect gastric cancer CNVs with comparable sensitivity and specificity, including amplifications of 12p11.22 (KRAS) and 9p24.1 (JAK2, CD274, and PDCD1LG2). Our findings provide a comprehensive comparison of variations detection performance using SCRS amplified by different WGA methods. It will guide researchers to determine which WGA method is best suited to individual experimental needs at single-cell level.

143 citations


Journal ArticleDOI
TL;DR: It is demonstrated that standard phylogenetic markers are capable of recovering sequences from a broad diversity of eukaryotes, in addition to prokaryotes by 16S, and the COI and 18S eDNA markers are the best proxies for aboveground biodiversity based on the high correlation between the pairwise beta diversities of these markers and those obtained using traditional methods.
Abstract: There is an increasing demand for rapid biodiversity assessment tools that have a broad taxonomic coverage. Here we evaluate a suite of environmental DNA (eDNA) markers coupled with next generation sequencing (NGS) that span the tree of life, comparing them with traditional biodiversity monitoring tools within ten 20×20 meter plots along a 700 meter elevational gradient. From six eDNA datasets (one from each of 16S, 18S, ITS, trnL and two from COI) we identified sequences from 109 NCBI taxonomy-defined phyla or equivalent, ranging from 31 to 60 for a given eDNA marker. Estimates of alpha and gamma diversity were sensitive to the number of sequence reads, whereas beta diversity estimates were less sensitive. The average within-plot beta diversity was lower than between plots for all markers. The soil beta diversity of COI and 18S markers showed the strongest response to the elevational variation of the eDNA markers (COI: r=0.49, p<0.001; 18S: r=0.48, p<0.001). Furthermore pairwise beta diversities for these two markers were strongly correlated with those calculated from traditional vegetation and invertebrate biodiversity measures. Using a soil-based eDNA approach, we demonstrate that standard phylogenetic markers are capable of recovering sequences from a broad diversity of eukaryotes, in addition to prokaryotes by 16S. The COI and 18S eDNA markers are the best proxies for aboveground biodiversity based on the high correlation between the pairwise beta diversities of these markers and those obtained using traditional methods.

136 citations


Journal ArticleDOI
TL;DR: RNA-Seq expression profiling showed that blueberry growth, maturation, and ripening involve dynamic gene expression changes, including coordinated up- and down-regulation of metabolic pathway enzymes and transcriptional regulators, providing an important new resource enabling high throughput studies in blueberry.
Abstract: Blueberries are a rich source of antioxidants and other beneficial compounds that can protect against disease. Identifying genes involved in synthesis of bioactive compounds could enable the breeding of berry varieties with enhanced health benefits. Toward this end, we annotated a previously sequenced draft blueberry genome assembly using RNA-Seq data from five stages of berry fruit development and ripening. Genome-guided assembly of RNA-Seq read alignments combined with output from ab initio gene finders produced around 60,000 gene models, of which more than half were similar to proteins from other species, typically the grape Vitis vinifera. Comparison of gene models to the PlantCyc database of metabolic pathway enzymes identified candidate genes involved in synthesis of bioactive compounds, including bixin, an apocarotenoid with potential disease-fighting properties, and defense-related cyanogenic glycosides, which are toxic. Cyanogenic glycoside (CG) biosynthetic enzymes were highly expressed in green fruit, and a candidate CG detoxification enzyme was up-regulated during fruit ripening. Candidate genes for ethylene, anthocyanin, and 400 other biosynthetic pathways were also identified. Homology-based annotation using Blast2GO and InterPro assigned Gene Ontology terms to around 15,000 genes. RNA-Seq expression profiling showed that blueberry growth, maturation, and ripening involve dynamic gene expression changes, including coordinated up- and down-regulation of metabolic pathway enzymes and transcriptional regulators. Analysis of RNA-seq alignments identified developmentally regulated alternative splicing, promoter use, and 3′ end formation. We report genome sequence, gene models, functional annotations, and RNA-Seq expression data that provide an important new resource enabling high throughput studies in blueberry.

125 citations


Journal ArticleDOI
TL;DR: The largest challenge of this task was the heterogeneity of the data, formats, and metadata, which made a large, complex, and integrated database reproducible and extensible, allowing users to ask new research questions with the existing database or through the addition of new data.
Abstract: Although there are considerable site-based data for individual or groups of ecosystems, these datasets are widely scattered, have different data formats and conventions, and often have limited accessibility. At the broader scale, national datasets exist for a large number of geospatial features of land, water, and air that are needed to fully understand variation among these ecosystems. However, such datasets originate from different sources and have different spatial and temporal resolutions. By taking an open-science perspective and by combining site-based ecosystem datasets and national geospatial datasets, science gains the ability to ask important research questions related to grand environmental challenges that operate at broad scales. Documentation of such complicated database integration efforts, through peer-reviewed papers, is recommended to foster reproducibility and future use of the integrated database. Here, we describe the major steps, challenges, and considerations in building an integrated database of lake ecosystems, called LAGOS (LAke multi-scaled GeOSpatial and temporal database), that was developed at the sub-continental study extent of 17 US states (1,800,000 km 2 ). LAGOS includes two modules: LAGOSGEO, with geospatial data on every lake with surface area larger than 4 ha in the study extent (~50,000 lakes), including climate, atmospheric deposition, land use/cover, hydrology, geology, and topography measured across a range of spatial and temporal extents; and LAGOSLIMNO, with lake water quality data compiled from ~100 individual datasets for a subset of lakes in the study extent (~10,000 lakes). Procedures for the integration of datasets included: creating a flexible database design; authoring and integrating metadata; documenting data provenance; quantifying spatial measures of geographic data; quality-controlling integrated and derived data; and extensively documenting the database. Our procedures make a large, complex, and integrated database reproducible and extensible, allowing users to ask new research questions with the existing database or through the addition of new data. The largest challenge of this task was the heterogeneity of the data, formats, and metadata. Many steps of data integration need manual input from experts in diverse fields, requiring close collaboration.

Journal ArticleDOI
TL;DR: The past, present and future of genomic and systems biology visualization is discussed, and the latest libraries and programming languages that enable more effective, efficient and faster approaches for visualizing biological concepts are focused on.
Abstract: “Α picture is worth a thousand words.” This widely used adage sums up in a few words the notion that a successful visual representation of a concept should enable easy and rapid absorption of large amounts of information. Although, in general, the notion of capturing complex ideas using images is very appealing, would 1000 words be enough to describe the unknown in a research field such as the life sciences? Life sciences is one of the biggest generators of enormous datasets, mainly as a result of recent and rapid technological advances; their complexity can make these datasets incomprehensible without effective visualization methods. Here we discuss the past, present and future of genomic and systems biology visualization. We briefly comment on many visualization and analysis tools and the purposes that they serve. We focus on the latest libraries and programming languages that enable more effective, efficient and faster approaches for visualizing biological concepts, and also comment on the future human-computer interaction trends that would enable for enhancing visualization further.

Journal ArticleDOI
TL;DR: This work proposes bioboxes: containers with standardised interfaces to make bioinformatics software interchangeable to solve the problems associated with sharing software.
Abstract: Software is now both central and essential to modern biology, yet lack of availability, difficult installations, and complex user interfaces make software hard to obtain and use. Containerisation, as exemplified by the Docker platform, has the potential to solve the problems associated with sharing software. We propose bioboxes: containers with standardised interfaces to make bioinformatics software interchangeable.

Journal ArticleDOI
TL;DR: The quality of the P. vitticeps assembly is comparable or superior to that of other published squamate genomes, and the annotated P.vitticeps genome can be accessed through a genome browser available at https://genomics.canberra.edu.au.
Abstract: The lizards of the family Agamidae are one of the most prominent elements of the Australian reptile fauna. Here, we present a genomic resource built on the basis of a wild-caught male ZZ central bearded dragon Pogona vitticeps. The genomic sequence for P. vitticeps, generated on the Illumina HiSeq 2000 platform, comprised 317 Gbp (179X raw read depth) from 13 insert libraries ranging from 250 bp to 40 kbp. After filtering for low-quality and duplicated reads, 146 Gbp of data (83X) was available for assembly. Exceptionally high levels of heterozygosity (0.85 % of single nucleotide polymorphisms plus sequence insertions or deletions) complicated assembly; nevertheless, 96.4 % of reads mapped back to the assembled scaffolds, indicating that the assembly included most of the sequenced genome. Length of the assembly was 1.8 Gbp in 545,310 scaffolds (69,852 longer than 300 bp), the longest being 14.68 Mbp. N50 was 2.29 Mbp. Genes were annotated on the basis of de novo prediction, similarity to the green anole Anolis carolinensis, Gallus gallus and Homo sapiens proteins, and P. vitticeps transcriptome sequence assemblies, to yield 19,406 protein-coding genes in the assembly, 63 % of which had intact open reading frames. Our assembly captured 99 % (246 of 248) of core CEGMA genes, with 93 % (231) being complete. The quality of the P. vitticeps assembly is comparable or superior to that of other published squamate genomes, and the annotated P. vitticeps genome can be accessed through a genome browser available at https://genomics.canberra.edu.au .

Journal ArticleDOI
TL;DR: The Avian Phylogenomics Project is the largest vertebrate phylogenomics project to date and the sequence, alignment, and tree data are expected to accelerate analyses in phylogenomics and other related areas.
Abstract: Background: Determining the evolutionary relationships among the major lineages of extant birds has been one of the biggest challenges in systematic biology. To address this challenge, we assembled or collected the genomes of 48 avian species spanning most orders of birds, including all Neognathae and two of the five Palaeognathae orders. We used these genomes to construct a genome-scale avian phylogenetic tree and perform comparative genomic analyses. Findings: Here we present the datasets associated with the phylogenomic analyses, which include sequence alignment files consisting of nucleotides, amino acids, indels, and transposable elements, as well as tree files containing gene trees and species trees. Inferring an accurate phylogeny required generating: 1) A well annotated data set across species based on genome synteny; 2) Alignments with unaligned or incorrectly overaligned sequences filtered out; and 3) Diverse data sets, including genes and their inferred trees, indels, and transposable elements. Our total evidence nucleotide tree (TENT) data set (consisting of exons, introns, and UCEs) gave what we consider our most reliable species tree when using the concatenation-based ExaML algorithm or when using statistical binning with the coalescence-based MP-EST algorithm (which we refer to as MP-EST*). Other data sets, such as the coding sequence of some exons, revealed other properties of genome evolution, namely convergence. Conclusions: The Avian Phylogenomics Project is the largest vertebrate phylogenomics project to date that we are aware of. The sequence, alignment, and tree data are expected to accelerate analyses in phylogenomics and other related areas.

Journal ArticleDOI
TL;DR: The ability to use machine learning to predict dysbiosis from microbiome community interaction data provides a potentially powerful tool for understanding the links between the human microbiome and human health, pointing to potential microbiome-based diagnostics and therapeutic interventions.
Abstract: Humans live in constant and vital symbiosis with a closely linked bacterial ecosystem called the microbiome, which influences many aspects of human health When this microbial ecosystem becomes disrupted, the health of the human host can suffer; a condition called dysbiosis However, the community compositions of human microbiomes also vary dramatically from individual to individual, and over time, making it difficult to uncover the underlying mechanisms linking the microbiome to human health We propose that a microbiome’s interaction with its human host is not necessarily dependent upon the presence or absence of particular bacterial species, but instead is dependent on its community metabolome; an emergent property of the microbiome Using data from a previously published, longitudinal study of microbiome populations of the human gut, we extrapolated information about microbiome community enzyme profiles and metabolome models Using machine learning techniques, we demonstrated that the aggregate predicted community enzyme function profiles and modeled metabolomes of a microbiome are more predictive of dysbiosis than either observed microbiome community composition or predicted enzyme function profiles Specific enzyme functions and metabolites predictive of dysbiosis provide insights into the molecular mechanisms of microbiome–host interactions The ability to use machine learning to predict dysbiosis from microbiome community interaction data provides a potentially powerful tool for understanding the links between the human microbiome and human health, pointing to potential microbiome-based diagnostics and therapeutic interventions

Journal ArticleDOI
TL;DR: This work assembled the metagenome and showed that it reconstructed most genes involved in the methane metabolism, a key pathway involving methanogenesis performed by methanogenic Archaea, indicating that there is sufficient sequencing coverage for most downstream analyses.
Abstract: Background: The production of biogas takes place under anaerobic conditions and involves microbial decomposition of organic matter. Most of the participating microbes are still unknown and non-cultivable. Accordingly, shotgun metagenome sequencing currently is the method of choice to obtain insights into community composition and the genetic repertoire. Findings: Here, we report on the deeply sequenced metagenome and metatranscriptome of a complex biogas-producing microbial community from an agricultural production-scale biogas plant. We assembled the metagenome and, as an example application, show that we reconstructed most genes involved in the methane metabolism, a key pathway involving methanogenesis performed by methanogenic Archaea. This result indicates that there is sufficient sequencing coverage for most downstream analyses. Conclusions: Sequenced at least one order of magnitude deeper than previous studies, our metagenome data will enable new insights into community composition and the genetic potential of important community members. Moreover, mapping of transcripts to reconstructed genome sequences will enable the identification of active metabolic pathways in target organisms.

Journal ArticleDOI
TL;DR: The draft danshen genome will provide a valuable resource for the investigation of novel bioactive compounds in this Chinese herb, and will help elucidate the biosynthetic pathways of important secondary metabolites, thereby advancing the Investigation of novel drugs from this plant.
Abstract: Danshen (Salvia miltiorrhiza Bunge), also known as Chinese red sage, is a member of Lamiaceae family. It is valued in traditional Chinese medicine, primarily for the treatment of cardiovascular and cerebrovascular diseases. Because of its pharmacological potential, ongoing research aims to identify novel bioactive compounds in danshen, and their biosynthetic pathways. To date, only expressed sequence tag (EST) and RNA-seq data for this herbal plant are available to the public. We therefore propose that the construction of a reference genome for danshen will help elucidate the biosynthetic pathways of important secondary metabolites, thereby advancing the investigation of novel drugs from this plant. We assembled the highly heterozygous danshen genome with the help of 395 × raw read coverage using Illumina technologies and about 10 × raw read coverage by using single molecular sequencing technology. The final draft genome is approximately 641 Mb, with a contig N50 size of 82.8 kb and a scaffold N50 size of 1.2 Mb. Further analyses predicted 34,598 protein-coding genes and 1,644 unique gene families in the danshen genome. The draft danshen genome will provide a valuable resource for the investigation of novel bioactive compounds in this Chinese herb.

Journal ArticleDOI
TL;DR: This work describes how optical mapping has been used in practice to produce high quality vertebrate genome assemblies and details the efforts undertaken by the Genome Reference Consortium (GRC), which maintains the reference genomes for human, mouse, zebrafish and chicken, and uses different optical mapping platforms for genome curation.
Abstract: Optical mapping is a technology that gathers long-range information on genome sequences similar to ordered restriction digest maps. Because it is not subject to cloning, amplification, hybridisation or sequencing bias, it is ideally suited to the improvement of fragmented genome assemblies that can no longer be improved by classical methods. In addition, its low cost and rapid turnaround make it equally useful during the scaffolding process of de novo assembly from high throughput sequencing reads. We describe how optical mapping has been used in practice to produce high quality vertebrate genome assemblies. In particular, we detail the efforts undertaken by the Genome Reference Consortium (GRC), which maintains the reference genomes for human, mouse, zebrafish and chicken, and uses different optical mapping platforms for genome curation.

Journal ArticleDOI
TL;DR: This review describes several outstanding problems in brain functional connectomics with the goal of engaging researchers from a broad spectrum of data sciences to help solve these problems.
Abstract: Estimating the functional interactions between brain regions and mapping those connections to corresponding inter-individual differences in cognitive, behavioral and psychiatric domains are central pursuits for understanding the human connectome. The number and complexity of functional interactions within the connectome and the large amounts of data required to study them position functional connectivity research as a “big data” problem. Maximizing the degree to which knowledge about human brain function can be extracted from the connectome will require developing a new generation of neuroimaging analysis algorithms and tools. This review describes several outstanding problems in brain functional connectomics with the goal of engaging researchers from a broad spectrum of data sciences to help solve these problems. Additionally it provides information about open science resources consisting of raw and preprocessed data to help interested researchers get started.

Journal ArticleDOI
TL;DR: The single chromosome assembly of Bacteroides fragilis strain BE1 was achieved using only modest amounts of data, publicly available software and commodity computing hardware.
Abstract: Background Second and third generation sequencing technologies have revolutionised bacterial genomics. Short-read Illumina reads result in cheap but fragmented assemblies, whereas longer reads are more expensive but result in more complete genomes. The Oxford Nanopore MinION device is a revolutionary mobile sequencer that can produce thousands of long, single molecule reads.

Journal ArticleDOI
TL;DR: A new high throughput platform to prepare single-cell RNA on a nanoliter scale based on a customized microwell chip provides a transcriptome characterization of HeLa S3 cells at the single cell level, and is a demonstration of the power of single cell RNA-seq analysis of virally infected cells and cancers.
Abstract: Viral infection causes multiple forms of human cancer, and HPV infection is the primary factor in cervical carcinomas. Recent single-cell RNA-seq studies highlight the tumor heterogeneity present in most cancers, but virally induced tumors have not been studied. HeLa is a well characterized HPV+ cervical cancer cell line. We developed a new high throughput platform to prepare single-cell RNA on a nanoliter scale based on a customized microwell chip. Using this method, we successfully amplified full-length transcripts of 669 single HeLa S3 cells and 40 of them were randomly selected to perform single-cell RNA sequencing. Based on these data, we obtained a comprehensive understanding of the heterogeneity of HeLa S3 cells in gene expression, alternative splicing and fusions. Furthermore, we identified a high diversity of HPV-18 expression and splicing at the single-cell level. By co-expression analysis we identified 283 E6, E7 co-regulated genes, including CDC25, PCNA, PLK4, BUB1B and IRF1 known to interact with HPV viral proteins. Our results reveal the heterogeneity of a virus-infected cell line. It not only provides a transcriptome characterization of HeLa S3 cells at the single cell level, but is a demonstration of the power of single cell RNA-seq analysis of virally infected cells and cancers.

Journal ArticleDOI
TL;DR: High-quality 3D imaging MS datasets from different biological systems at several labs were acquired, supplied with overview images and scripts demonstrating how to read them, and deposited into MetaboLights, an open repository for metabolomics data.
Abstract: Background Three-dimensional (3D) imaging mass spectrometry (MS) is an analytical chemistry technique for the 3D molecular analysis of a tissue specimen, entire organ, or microbial colonies on an agar plate. 3D-imaging MS has unique advantages over existing 3D imaging techniques, offers novel perspectives for understanding the spatial organization of biological processes, and has growing potential to be introduced into routine use in both biology and medicine. Owing to the sheer quantity of data generated, the visualization, analysis, and interpretation of 3D imaging MS data remain a significant challenge. Bioinformatics research in this field is hampered by the lack of publicly available benchmark datasets needed to evaluate and compare algorithms.

Journal ArticleDOI
TL;DR: The ability of optical mapping to assay long single DNA molecules nicely complements short-read sequencing which is more suitable for the identification of small and short-range variants.
Abstract: Optical mapping has been widely used to improve de novo plant genome assemblies, including rice, maize, Medicago, Amborella, tomato and wheat, with more genomes in the pipeline. Optical mapping provides long-range information of the genome and can more easily identify large structural variations. The ability of optical mapping to assay long single DNA molecules nicely complements short-read sequencing which is more suitable for the identification of small and short-range variants. Direct use of optical mapping to study population-level genetic diversity is currently limited to microbial strain typing and human diversity studies. Nonetheless, optical mapping shows great promise in the study of plant trait development, domestication and polyploid evolution. Here we review the current applications and future prospects of optical mapping in the field of plant comparative genomics.

Journal ArticleDOI
TL;DR: This research presents a novel probabilistic approach to estimating the response of the immune system to laser-spot assisted, 3D image recognition.
Abstract: [This corrects the article DOI: 10.1186/2047-217X-3-22.].

Journal ArticleDOI
TL;DR: The large number of whole-genome sequences generated as a result of the Canadian Cattle Genome Project will contribute to ongoing work aiming to catalogue the variation that exists in cattle as well as efforts to improve traits through genotype-guided selection.
Abstract: Background: The Canadian Cattle Genome Project is a large-scale international project that aims to develop genomics-based tools to enhance the efficiency and sustainability of beef and dairy production. Obtaining DNA sequence information is an important part of achieving this goal as it facilitates efforts to associate specific DNA differences with phenotypic variation. These associations can be used to guide breeding decisions and provide valuable insight into the molecular basis of traits. Findings: We describe a dataset of 379 whole-genome sequences, taken primarily from key historic Bos taurus animals, along with the analyses that were performed to assess data quality. The sequenced animals represent ten populations relevant to beef or dairy production. Animal information (name, breed, population), sequence data metrics (mapping rate, depth, concordance), and sequence repository identifiers (NCBI BioProject and BioSample IDs) are provided to enable others to access and exploit this sequence information. Conclusions: The large number of whole-genome sequences generated as a result of this project will contribute to ongoing work aiming to catalogue the variation that exists in cattle as well as efforts to improve traits through genotype-guided selection. Studies of gene function, population structure, and sequence evolution are also likely to benefit from the availability of this resource.

Journal ArticleDOI
TL;DR: A nationwide trial of BRCA testing of Latin American women with breast cancer began, finding that application of this strategy on a larger scale could lead to improved cancer care of minority and underserved populations.
Abstract: Germline mutations in the BRCA1 and BRCA2 genes account for 20–25 % of inherited breast cancers and about 10 % of all breast cancer cases. Detection of BRCA mutation carriers can lead to therapeutic interventions such as mastectomy, oophorectomy, hormonal prevention therapy, improved screening, and targeted therapies such as PARP-inhibition. We estimate that African Americans and Hispanics are 4–5 times less likely to receive BRCA screening, despite having similar mutation frequencies as non-Jewish Caucasians, who have higher breast cancer mortality. To begin addressing this health disparity, we initiated a nationwide trial of BRCA testing of Latin American women with breast cancer. Patients were recruited through community organizations, clinics, public events, and by mail and Internet. Subjects completed the consent process and questionnaire, and provided a saliva sample by mail or in person. DNA from 120 subjects was used to sequence the entirety of BRCA1 and BRCA2 coding regions and splice sites, and validate pathogenic mutations, with a total material cost of $85/subject. Subjects ranged in age from 23 to 81 years (mean age, 51 years), 6 % had bilateral disease, 57 % were ER/PR+, 23 % HER2+, and 17 % had triple-negative disease. A total of seven different predicted deleterious mutations were identified, one newly described and the rest rare. In addition, four variants of unknown effect were found. Application of this strategy on a larger scale could lead to improved cancer care of minority and underserved populations.

Journal ArticleDOI
TL;DR: An automated workflow covering data submission, compression, storage and public access is proposed to allow easy data retrieval and inter-study communication in high-throughput sequencing-based metabarcoding studies.
Abstract: High-throughput sequencing-based metabarcoding studies produce vast amounts of ecological data, but a lack of consensus on standardization of metadata and how to refer to the species recovered severely hampers reanalysis and comparisons among studies. Here we propose an automated workflow covering data submission, compression, storage and public access to allow easy data retrieval and inter-study communication. Such standardized and readily accessible datasets facilitate data management, taxonomic comparisons and compilation of global metastudies.

Journal ArticleDOI
TL;DR: Only by sharing experiments, data, metadata, derived data and analysis workflows will neuroimaging establish itself as a true data science.
Abstract: The ability to replicate an entire experiment is crucial to the scientific method. With the development of more and more complex paradigms, and the variety of analysis techniques available, fMRI studies are becoming harder to reproduce.

Journal ArticleDOI
TL;DR: A multi-step viral genome assembly pipeline named VirAmp, which combines existing tools and techniques and presents them to end users via a web-enabled Galaxy interface and automates the currently recommended best practices into a single, easy to use interface.
Abstract: Advances in next generation sequencing make it possible to obtain high-coverage sequence data for large numbers of viral strains in a short time. However, since most bioinformatics tools are developed for command line use, the selection and accessibility of computational tools for genome assembly and variation analysis limits the ability of individual labs to perform further bioinformatics analysis. We have developed a multi-step viral genome assembly pipeline named VirAmp, which combines existing tools and techniques and presents them to end users via a web-enabled Galaxy interface. Our pipeline allows users to assemble, analyze, and interpret high coverage viral sequencing data with an ease and efficiency that was not possible previously. Our software makes a large number of genome assembly and related tools available to life scientists and automates the currently recommended best practices into a single, easy to use interface. We tested our pipeline with three different datasets from human herpes simplex virus (HSV). VirAmp provides a user-friendly interface and a complete pipeline for viral genome analysis. We make our software available via an Amazon Elastic Cloud disk image that can be easily launched by anyone with an Amazon web service account. A fully functional demonstration instance of our system can be found at http://viramp.com/ . We also maintain detailed documentation on each tool and methodology at http://docs.viramp.com .