Showing papers in "GigaScience in 2015"

PDF

Open Access

Journal Article•DOI•

Second-generation PLINK: rising to the challenge of larger and richer datasets

[...]

Christopher C. Chang, Carson C. Chow¹, Laurent C. A. M. Tellier², Shashaank Vattikuti¹, Shaun Purcell³, James J. Lee⁴ - Show less +2 more•Institutions (4)

National Institutes of Health¹, University of Copenhagen², Icahn School of Medicine at Mount Sinai³, University of Minnesota⁴

25 Feb 2015-GigaScience

TL;DR: The second-generation versions of PLINK will offer dramatic improvements in performance and compatibility, and for the first time, users without access to high-end computing resources can perform several essential analyses of the feature-rich and very large genetic datasets coming into use.

...read moreread less

Abstract: Background: PLINK 1 is a widely used open-source C/C++ toolset for genome-wide association studies (GWAS) and research in population genetics. However, the steady accumulation of data from imputation and whole-genome sequencing studies has exposed a strong need for faster and scalable implementations of key functions, such as logistic regression, linkage disequilibrium estimation, and genomic distance evaluation. In addition, GWAS and population-genetic data now frequently contain genotype likelihoods, phase information, and/or multiallelic variants, none of which can be represented by PLINK 1’s primary data format. Findings: To address these issues, we are developing a second-generation codebase for PLINK. The first major release from this codebase, PLINK 1.9, introduces extensive use of bit-level parallelism, O √ n -time/constant-space Hardy-Weinberg equilibrium and Fisher’s exact tests, and many other algorithmic improvements. In combination, these changes accelerate most operations by 1-4 orders of magnitude, and allow the program to handle datasets too large to fit in RAM. We have also developed an extension to the data format which adds low-overhead support for genotype likelihoods, phase, multiallelic variants, and reference vs. alternate alleles, which is the basis of our planned second release (PLINK 2.0). Conclusions: The second-generation versions of PLINK will offer dramatic improvements in performance and compatibility. For the first time, users without access to high-end computing resources can perform several essential analyses of the feature-rich and very large genetic datasets coming into use.

...read moreread less

7,038 citations

Journal Article•DOI•

Rcorrector: efficient and accurate error correction for Illumina RNA-seq reads

[...]

Li Song¹, Liliana Florea²•Institutions (2)

Johns Hopkins University¹, Johns Hopkins University School of Medicine²

19 Oct 2015-GigaScience

TL;DR: A k-mer based method, Rcorrector, to correct random sequencing errors in Illumina RNA-seq reads, which has an accuracy higher than or comparable to existing methods, including the only other method (SEECER), and is more time and memory efficient.

...read moreread less

Abstract: Next-generation sequencing of cellular RNA (RNA-seq) is rapidly becoming the cornerstone of transcriptomic analysis. However, sequencing errors in the already short RNA-seq reads complicate bioinformatics analyses, in particular alignment and assembly. Error correction methods have been highly effective for whole-genome sequencing (WGS) reads, but are unsuitable for RNA-seq reads, owing to the variation in gene expression levels and alternative splicing. We developed a k-mer based method, Rcorrector, to correct random sequencing errors in Illumina RNA-seq reads. Rcorrector uses a De Bruijn graph to compactly represent all trusted k-mers in the input reads. Unlike WGS read correctors, which use a global threshold to determine trusted k-mers, Rcorrector computes a local threshold at every position in a read. Rcorrector has an accuracy higher than or comparable to existing methods, including the only other method (SEECER) designed for RNA-seq reads, and is more time and memory efficient. With a 5 GB memory footprint for 100 million reads, it can be run on virtually any desktop or server. The software is available free of charge under the GNU General Public License from https://github.com/mourisl/Rcorrector/ .

...read moreread less

359 citations

Journal Article•DOI•

SmileFinder: A Resampling-Based Approach to Evaluate Signatures of Selection from Genome-Wide Sets of Matching Allele Frequency Data in Two or More Diploid Populations

[...]

Wilfried Guiblet¹, Kai Zhao², Stephen J. O'Brien³, Stephen J. O'Brien⁴, Steven E. Massey⁵, Alfred L. Roca², Taras K. Oleksyk¹ - Show less +3 more•Institutions (5)

University of Puerto Rico at Mayagüez¹, University of Illinois at Urbana–Champaign², Nova Southeastern University³, Saint Petersburg State University⁴, University of Puerto Rico⁵

14 Jan 2015-GigaScience

TL;DR: The output from SmileFinder can be used to plot percentile values to look for population diversity and divergence patterns that may suggest past actions of positive selection along chromosome maps, and to compare lists of suspected candidate genes under random gene sets to test for the overrepresentation of these patterns among gene categories.

...read moreread less

Abstract: Background Adaptive alleles may rise in frequency as a consequence of positive selection, creating a pattern of decreased variation in the neighboring loci, known as a selective sweep. When the region containing this pattern is compared to another population with no history of selection, a rise in variance of allele frequencies between populations is observed. One challenge presented by large genome-wide datasets is the ability to differentiate between patterns that are remnants of natural selection from those expected to arise at random and/or as a consequence of selectively neutral demographic forces acting in the population.

...read moreread less

349 citations

Journal Article•DOI•

LINKS: Scalable, alignment-free scaffolding of draft genomes with long reads

[...]

René L. Warren¹, Chen Yang¹, Benjamin P. Vandervalk¹, Bahar Behsaz¹, Albert Lagman¹, Steven J. M. Jones¹, Inanc Birol¹ - Show less +3 more•Institutions (1)

BC Cancer Agency¹

04 Aug 2015-GigaScience

TL;DR: LINKS, the Long Interval Nucleotide K-mer Scaffolder algorithm, a method that makes use of the sequence properties of nanopore sequence data and other error-containing sequence data, to scaffold high-quality genome assemblies, without the need for read alignment or base correction is presented.

...read moreread less

Abstract: Owing to the complexity of the assembly problem, we do not yet have complete genome sequences. The difficulty in assembling reads into finished genomes is exacerbated by sequence repeats and the inability of short reads to capture sufficient genomic information to resolve those problematic regions. In this regard, established and emerging long read technologies show great promise, but their current associated higher error rates typically require computational base correction and/or additional bioinformatics pre-processing before they can be of value. We present LINKS, the Long Interval Nucleotide K-mer Scaffolder algorithm, a method that makes use of the sequence properties of nanopore sequence data and other error-containing sequence data, to scaffold high-quality genome assemblies, without the need for read alignment or base correction. Here, we show how the contiguity of an ABySS Escherichia coli K-12 genome assembly can be increased greater than five-fold by the use of beta-released Oxford Nanopore Technologies Ltd. long reads and how LINKS leverages long-range information in Saccharomyces cerevisiae W303 nanopore reads to yield assemblies whose resulting contiguity and correctness are on par with or better than that of competing applications. We also present the re-scaffolding of the colossal white spruce (Picea glauca) draft assembly (PG29, 20 Gbp) and demonstrate how LINKS scales to larger genomes. This study highlights the present utility of nanopore reads for genome scaffolding in spite of their current limitations, which are expected to diminish as the nanopore sequencing technology advances. We expect LINKS to have broad utility in harnessing the potential of long reads in connecting high-quality sequences of small and large genome assembly drafts.

...read moreread less

195 citations

Journal Article•DOI•

The ocean sampling day consortium

[...]

Anna Kopf¹, Anna Kopf², Mesude Bicak³, Renzo Kottmann² +166 more•Institutions (77)

19 Jun 2015-GigaScience

TL;DR: This commentary outlines the establishment, function and aims of the Ocean Sampling Day Consortium and describes the vision for a sustainable study of marine microbial communities and their embedded functional traits.

...read moreread less

Abstract: Ocean Sampling Day was initiated by the EU-funded Micro B3 (Marine Microbial Biodiversity, Bioinformatics, Biotechnology) project to obtain a snapshot of the marine microbial biodiversity and function of the world’s oceans. It is a simultaneous global mega-sequencing campaign aiming to generate the largest standardized microbial data set in a single day. This will be achievable only through the coordinated efforts of an Ocean Sampling Day Consortium, supportive partnerships and networks between sites. This commentary outlines the establishment, function and aims of the Consortium and describes our vision for a sustainable study of marine microbial communities and their embedded functional traits.

...read moreread less

173 citations

Journal Article•DOI•

Bacterial and viral identification and differentiation by amplicon sequencing on the MinION nanopore sequencer

[...]

Andy Kilianski¹, Jamie L. Haas, Elizabeth J Corriveau¹, Alvin T. Liem¹, Kristen L. Willis², Kristen L. Willis¹, Dana R Kadavy, C. Nicole Rosenzweig¹, Samuel S. Minot - Show less +5 more•Institutions (2)

Edgewood Chemical Biological Center¹, Defense Threat Reduction Agency²

26 Mar 2015-GigaScience

TL;DR: Evidence of the utility of amplicon sequencing is offered by demonstrating that the current versions of MinION™ technology can accurately identify and differentiate both viral and bacterial species present within biological samples via Amplicon sequencing.

...read moreread less

Abstract: The MinION™ nanopore sequencer was recently released to a community of alpha-testers for evaluation using a variety of sequencing applications. Recent reports have tested the ability of the MinION™ to act as a whole genome sequencer and have demonstrated that nanopore sequencing has tremendous potential utility. However, the current nanopore technology still has limitations with respect to error-rate, and this is problematic when attempting to assemble whole genomes without secondary rounds of sequencing to correct errors. In this study, we tested the ability of the MinION™ nanopore sequencer to accurately identify and differentiate bacterial and viral samples via directed sequencing of characteristic genes shared broadly across a target clade. Using a 6 hour sequencing run time, sufficient data were generated to identify an E. coli sample down to the species level from 16S rDNA amplicons. Three poxviruses (cowpox, vaccinia-MVA, and vaccinia-Lister) were identified and differentiated down to the strain level, despite over 98% identity between the vaccinia strains. The ability to differentiate strains by amplicon sequencing on the MinION™ was accomplished despite an observed per-base error rate of approximately 30%. While nanopore sequencing, using the MinION™ platform from Oxford Nanopore in particular, continues to mature into a commercially available technology, practical uses are sought for the current versions of the technology. This study offers evidence of the utility of amplicon sequencing by demonstrating that the current versions of MinION™ technology can accurately identify and differentiate both viral and bacterial species present within biological samples via amplicon sequencing.

...read moreread less

157 citations

Journal Article•DOI•

NCBI BLAST+ integrated into Galaxy.

[...]

Peter J. A. Cock¹, John Chilton², Björn Grüning³, James E. Johnson², Nicola Soranzo - Show less +1 more•Institutions (3)

James Hutton Institute¹, University of Minnesota², University of Freiburg³

25 Aug 2015-GigaScience

TL;DR: The integration of the BLAST+ tool suite into Galaxy has the goal of making common BLAST tasks easy and advanced tasks possible.

...read moreread less

Abstract: The NCBI BLAST suite has become ubiquitous in modern molecular biology and is used for small tasks such as checking capillary sequencing results of single PCR products, genome annotation or even larger scale pan-genome analyses. For early adopters of the Galaxy web-based biomedical data analysis platform, integrating BLAST into Galaxy was a natural step for sequence comparison workflows. The command line NCBI BLAST+ tool suite was wrapped for use within Galaxy. Appropriate datatypes were defined as needed. The integration of the BLAST+ tool suite into Galaxy has the goal of making common BLAST tasks easy and advanced tasks possible. This project is an informal international collaborative effort, and is deployed and used on Galaxy servers worldwide. Several examples of applications are described here.

...read moreread less

152 citations

Journal Article•DOI•

Comparison of variations detection between whole-genome amplification methods used in single-cell resequencing

[...]

Yong Hou, Kui Wu, Xulian Shi¹, Fuqiang Li, Luting Song, Hanjie Wu, Michael Dean, Guibo Li, Shirley Tsang, Runze Jiang, Xiaolong Zhang², Bo Li, Geng Liu, Niharika Bedekar³, Na Lu¹, Guoyun Xie, Liang Han, Liao Chang, Ting Wang⁴, Jianghao Chen⁴, Yingrui Li, Xiuqing Zhang, Huanming Yang⁵, Huanming Yang⁶, Xun Xu, Ling Wang⁴, Jun Wang⁷, Jun Wang⁶ - Show less +24 more•Institutions (7)

Southeast University¹, Chinese Academy of Sciences², Stanford University³, Fourth Military Medical University⁴, Zhejiang University⁵, King Abdulaziz University⁶, University of Copenhagen⁷

06 Aug 2015-GigaScience

TL;DR: These findings provide a comprehensive comparison of variations detection performance using SCRS amplified by different WGA methods, and will guide researchers to determine which WGA method is best suited to individual experimental needs at single-cell level.

...read moreread less

Abstract: Single-cell resequencing (SCRS) provides many biomedical advances in variations detection at the single-cell level, but it currently relies on whole genome amplification (WGA). Three methods are commonly used for WGA: multiple displacement amplification (MDA), degenerate-oligonucleotide-primed PCR (DOP-PCR) and multiple annealing and looping-based amplification cycles (MALBAC). However, a comprehensive comparison of variations detection performance between these WGA methods has not yet been performed. We systematically compared the advantages and disadvantages of different WGA methods, focusing particularly on variations detection. Low-coverage whole-genome sequencing revealed that DOP-PCR had the highest duplication ratio, but an even read distribution and the best reproducibility and accuracy for detection of copy-number variations (CNVs). However, MDA had significantly higher genome recovery sensitivity (~84 %) than DOP-PCR (~6 %) and MALBAC (~52 %) at high sequencing depth. MALBAC and MDA had comparable single-nucleotide variations detection efficiency, false-positive ratio, and allele drop-out ratio. We further demonstrated that SCRS data amplified by either MDA or MALBAC from a gastric cancer cell line could accurately detect gastric cancer CNVs with comparable sensitivity and specificity, including amplifications of 12p11.22 (KRAS) and 9p24.1 (JAK2, CD274, and PDCD1LG2). Our findings provide a comprehensive comparison of variations detection performance using SCRS amplified by different WGA methods. It will guide researchers to determine which WGA method is best suited to individual experimental needs at single-cell level.

...read moreread less

143 citations

Journal Article•DOI•

Evaluating a multigene environmental DNA approach for biodiversity assessment.

[...]

Alexei J. Drummond¹, Richard D. Newcomb², Richard D. Newcomb¹, Thomas R. Buckley³, Thomas R. Buckley¹, Dong Xie¹, Andrew Dopheide², Andrew Dopheide¹, Benjamin C. M. Potter¹, Joseph Heled¹, Howard A. Ross¹, Leah Tooman¹, Leah Tooman², Stefanie Grosser³, Stefanie Grosser¹, Duckchul Park³, Nicholas J. Demetras⁴, Mark I. Stevens⁵, Mark I. Stevens⁶, James C. Russell¹, Sandra H. Anderson¹, Anna Carter⁷, Anna Carter¹, Nicola J. Nelson¹, Nicola J. Nelson⁷ - Show less +21 more•Institutions (7)

University of Auckland¹, Plant & Food Research², Landcare Research³, University of Waikato⁴, South Australian Museum⁵, University of South Australia⁶, Victoria University of Wellington⁷

06 Oct 2015-GigaScience

TL;DR: It is demonstrated that standard phylogenetic markers are capable of recovering sequences from a broad diversity of eukaryotes, in addition to prokaryotes by 16S, and the COI and 18S eDNA markers are the best proxies for aboveground biodiversity based on the high correlation between the pairwise beta diversities of these markers and those obtained using traditional methods.

...read moreread less

Abstract: There is an increasing demand for rapid biodiversity assessment tools that have a broad taxonomic coverage. Here we evaluate a suite of environmental DNA (eDNA) markers coupled with next generation sequencing (NGS) that span the tree of life, comparing them with traditional biodiversity monitoring tools within ten 20×20 meter plots along a 700 meter elevational gradient. From six eDNA datasets (one from each of 16S, 18S, ITS, trnL and two from COI) we identified sequences from 109 NCBI taxonomy-defined phyla or equivalent, ranging from 31 to 60 for a given eDNA marker. Estimates of alpha and gamma diversity were sensitive to the number of sequence reads, whereas beta diversity estimates were less sensitive. The average within-plot beta diversity was lower than between plots for all markers. The soil beta diversity of COI and 18S markers showed the strongest response to the elevational variation of the eDNA markers (COI: r=0.49, p<0.001; 18S: r=0.48, p<0.001). Furthermore pairwise beta diversities for these two markers were strongly correlated with those calculated from traditional vegetation and invertebrate biodiversity measures. Using a soil-based eDNA approach, we demonstrate that standard phylogenetic markers are capable of recovering sequences from a broad diversity of eukaryotes, in addition to prokaryotes by 16S. The COI and 18S eDNA markers are the best proxies for aboveground biodiversity based on the high correlation between the pairwise beta diversities of these markers and those obtained using traditional methods.

...read moreread less

136 citations

Journal Article•DOI•

RNA-Seq analysis and annotation of a draft blueberry genome assembly identifies candidate genes involved in fruit ripening, biosynthesis of bioactive compounds, and stage-specific alternative splicing

[...]

Vikas Gupta¹, Vikas Gupta², April D. Estrada², Ivory C. Blakley², Robert W. Reid², Ketan Patel², Mason D. Meyer², Stig U. Andersen¹, Allan Brown³, Mary Ann Lila³, Ann E. Loraine² - Show less +7 more•Institutions (3)

Aarhus University¹, University of North Carolina at Charlotte², North Carolina State University³

13 Feb 2015-GigaScience

TL;DR: RNA-Seq expression profiling showed that blueberry growth, maturation, and ripening involve dynamic gene expression changes, including coordinated up- and down-regulation of metabolic pathway enzymes and transcriptional regulators, providing an important new resource enabling high throughput studies in blueberry.

...read moreread less

Abstract: Blueberries are a rich source of antioxidants and other beneficial compounds that can protect against disease. Identifying genes involved in synthesis of bioactive compounds could enable the breeding of berry varieties with enhanced health benefits. Toward this end, we annotated a previously sequenced draft blueberry genome assembly using RNA-Seq data from five stages of berry fruit development and ripening. Genome-guided assembly of RNA-Seq read alignments combined with output from ab initio gene finders produced around 60,000 gene models, of which more than half were similar to proteins from other species, typically the grape Vitis vinifera. Comparison of gene models to the PlantCyc database of metabolic pathway enzymes identified candidate genes involved in synthesis of bioactive compounds, including bixin, an apocarotenoid with potential disease-fighting properties, and defense-related cyanogenic glycosides, which are toxic. Cyanogenic glycoside (CG) biosynthetic enzymes were highly expressed in green fruit, and a candidate CG detoxification enzyme was up-regulated during fruit ripening. Candidate genes for ethylene, anthocyanin, and 400 other biosynthetic pathways were also identified. Homology-based annotation using Blast2GO and InterPro assigned Gene Ontology terms to around 15,000 genes. RNA-Seq expression profiling showed that blueberry growth, maturation, and ripening involve dynamic gene expression changes, including coordinated up- and down-regulation of metabolic pathway enzymes and transcriptional regulators. Analysis of RNA-seq alignments identified developmentally regulated alternative splicing, promoter use, and 3′ end formation. We report genome sequence, gene models, functional annotations, and RNA-Seq expression data that provide an important new resource enabling high throughput studies in blueberry.

...read moreread less

125 citations

Journal Article•DOI•

Building a multi-scaled geospatial temporal ecology database from disparate data sources: fostering open science and data reuse

[...]

Patricia A. Soranno¹, Edward G. Bissell¹, Kendra Spence Cheruvelil¹, Samuel T. Christel², Sarah M. Collins¹, C. Emi Fergus¹, Christopher T. Filstrup³, Jean-François Lapierre¹, Noah R. Lottig², Samantha K. Oliver², Caren E. Scott¹, Nicole J. Smith¹, Scott B. Stopyak¹, Shuai Yuan⁴, Mary T. Bremigan¹, John A. Downing³, Corinna Gries², Emily Norton Henry⁵, Nicholas K. Skaff¹, Emily H. Stanley², Craig A. Stow⁶, Pang-Ning Tan¹, Tyler Wagner⁷, Katherine E. Webster⁴ - Show less +20 more•Institutions (7)

Michigan State University¹, University of Wisconsin-Madison², Iowa State University³, Trinity College, Dublin⁴, Oregon State University⁵, National Oceanic and Atmospheric Administration⁶, Pennsylvania State University⁷

01 Jul 2015-GigaScience

TL;DR: The largest challenge of this task was the heterogeneity of the data, formats, and metadata, which made a large, complex, and integrated database reproducible and extensible, allowing users to ask new research questions with the existing database or through the addition of new data.

...read moreread less

Abstract: Although there are considerable site-based data for individual or groups of ecosystems, these datasets are widely scattered, have different data formats and conventions, and often have limited accessibility. At the broader scale, national datasets exist for a large number of geospatial features of land, water, and air that are needed to fully understand variation among these ecosystems. However, such datasets originate from different sources and have different spatial and temporal resolutions. By taking an open-science perspective and by combining site-based ecosystem datasets and national geospatial datasets, science gains the ability to ask important research questions related to grand environmental challenges that operate at broad scales. Documentation of such complicated database integration efforts, through peer-reviewed papers, is recommended to foster reproducibility and future use of the integrated database. Here, we describe the major steps, challenges, and considerations in building an integrated database of lake ecosystems, called LAGOS (LAke multi-scaled GeOSpatial and temporal database), that was developed at the sub-continental study extent of 17 US states (1,800,000 km 2 ). LAGOS includes two modules: LAGOSGEO, with geospatial data on every lake with surface area larger than 4 ha in the study extent (~50,000 lakes), including climate, atmospheric deposition, land use/cover, hydrology, geology, and topography measured across a range of spatial and temporal extents; and LAGOSLIMNO, with lake water quality data compiled from ~100 individual datasets for a subset of lakes in the study extent (~10,000 lakes). Procedures for the integration of datasets included: creating a flexible database design; authoring and integrating metadata; documenting data provenance; quantifying spatial measures of geographic data; quality-controlling integrated and derived data; and extensively documenting the database. Our procedures make a large, complex, and integrated database reproducible and extensible, allowing users to ask new research questions with the existing database or through the addition of new data. The largest challenge of this task was the heterogeneity of the data, formats, and metadata. Many steps of data integration need manual input from experts in diverse fields, requiring close collaboration.

...read moreread less

Journal Article•DOI•

Visualizing genome and systems biology: technologies, tools, implementation techniques and trends, past, present and future

[...]

Georgios A. Pavlopoulos¹, Dimitris Malliarakis¹, Nikolas Papanikolaou¹, Theodosis Theodosiou¹, Anton J. Enright², Ioannis Iliopoulos¹ - Show less +2 more•Institutions (2)

University of Crete¹, European Bioinformatics Institute²

25 Aug 2015-GigaScience

TL;DR: The past, present and future of genomic and systems biology visualization is discussed, and the latest libraries and programming languages that enable more effective, efficient and faster approaches for visualizing biological concepts are focused on.

...read moreread less

Abstract: “Α picture is worth a thousand words.” This widely used adage sums up in a few words the notion that a successful visual representation of a concept should enable easy and rapid absorption of large amounts of information. Although, in general, the notion of capturing complex ideas using images is very appealing, would 1000 words be enough to describe the unknown in a research field such as the life sciences? Life sciences is one of the biggest generators of enormous datasets, mainly as a result of recent and rapid technological advances; their complexity can make these datasets incomprehensible without effective visualization methods. Here we discuss the past, present and future of genomic and systems biology visualization. We briefly comment on many visualization and analysis tools and the purposes that they serve. We focus on the latest libraries and programming languages that enable more effective, efficient and faster approaches for visualizing biological concepts, and also comment on the future human-computer interaction trends that would enable for enhancing visualization further.

...read moreread less

Journal Article•DOI•

Bioboxes: standardised containers for interchangeable bioinformatics software.

[...]

Peter Belmann¹, Johannes Dröge, Andreas Bremges¹, Alice C. McHardy, Alexander Sczyrba¹, Michael D. Barton² - Show less +2 more•Institutions (2)

Bielefeld University¹, Joint Genome Institute²

15 Oct 2015-GigaScience

TL;DR: This work proposes bioboxes: containers with standardised interfaces to make bioinformatics software interchangeable to solve the problems associated with sharing software.

...read moreread less

Abstract: Software is now both central and essential to modern biology, yet lack of availability, difficult installations, and complex user interfaces make software hard to obtain and use. Containerisation, as exemplified by the Docker platform, has the potential to solve the problems associated with sharing software. We propose bioboxes: containers with standardised interfaces to make bioinformatics software interchangeable.

...read moreread less

Journal Article•DOI•

High-coverage sequencing and annotated assembly of the genome of the Australian dragon lizard Pogona vitticeps

[...]

Arthur Georges¹, Qiye Li², Jinmin Lian, Denis O’Meally¹, Janine E. Deakin¹, Zongji Wang³, Pei Zhang, Matthew K. Fujita⁴, Hardip R. Patel⁵, Clare E. Holleley¹, Yang Zhou, Xiuwen Zhang¹, Kazumi Matsubara¹, Paul D. Waters⁶, Jennifer A. Marshall Graves¹, Jennifer A. Marshall Graves⁷, Stephen D. Sarre¹, Guojie Zhang² - Show less +14 more•Institutions (7)

University of Canberra¹, University of Copenhagen², South China University of Technology³, University of Texas at Arlington⁴, Australian National University⁵, University of New South Wales⁶, La Trobe University⁷

28 Sep 2015-GigaScience

TL;DR: The quality of the P. vitticeps assembly is comparable or superior to that of other published squamate genomes, and the annotated P.vitticeps genome can be accessed through a genome browser available at https://genomics.canberra.edu.au.

...read moreread less

Abstract: The lizards of the family Agamidae are one of the most prominent elements of the Australian reptile fauna. Here, we present a genomic resource built on the basis of a wild-caught male ZZ central bearded dragon Pogona vitticeps. The genomic sequence for P. vitticeps, generated on the Illumina HiSeq 2000 platform, comprised 317 Gbp (179X raw read depth) from 13 insert libraries ranging from 250 bp to 40 kbp. After filtering for low-quality and duplicated reads, 146 Gbp of data (83X) was available for assembly. Exceptionally high levels of heterozygosity (0.85 % of single nucleotide polymorphisms plus sequence insertions or deletions) complicated assembly; nevertheless, 96.4 % of reads mapped back to the assembled scaffolds, indicating that the assembly included most of the sequenced genome. Length of the assembly was 1.8 Gbp in 545,310 scaffolds (69,852 longer than 300 bp), the longest being 14.68 Mbp. N50 was 2.29 Mbp. Genes were annotated on the basis of de novo prediction, similarity to the green anole Anolis carolinensis, Gallus gallus and Homo sapiens proteins, and P. vitticeps transcriptome sequence assemblies, to yield 19,406 protein-coding genes in the assembly, 63 % of which had intact open reading frames. Our assembly captured 99 % (246 of 248) of core CEGMA genes, with 93 % (231) being complete. The quality of the P. vitticeps assembly is comparable or superior to that of other published squamate genomes, and the annotated P. vitticeps genome can be accessed through a genome browser available at https://genomics.canberra.edu.au .

...read moreread less

Journal Article•DOI•

Phylogenomic analyses data of the avian phylogenomics project.

[...]

Erich D. Jarvis¹, Siavash Mirarab², Andre J. Aberer³, Bo Li⁴, Bo Li⁵, Peter Houde⁶, Cai Li⁵, Simon Y. W. Ho⁷, Brant C. Faircloth⁸, Brant C. Faircloth⁹, Benoit Nabholz¹⁰, Jason T. Howard¹, Alexander Suh¹¹, Claudia C. Weber¹¹, Rute R. da Fonseca⁵, Alonzo Alfaro-Núñez⁵, Nitish Narula¹², Nitish Narula⁶, Liang Liu¹³, Dave Burt¹⁴, Hans Ellegren¹¹, Scott V. Edwards¹⁵, Alexandros Stamatakis¹⁶, Alexandros Stamatakis³, David P. Mindell¹⁷, Joel Cracraft¹⁸, Edward L. Braun¹⁹, Tandy Warnow², Wang Jun, M. Thomas P. Gilbert⁵, M. Thomas P. Gilbert²⁰, Guojie Zhang⁵ - Show less +28 more•Institutions (20)

12 Feb 2015-GigaScience

TL;DR: The Avian Phylogenomics Project is the largest vertebrate phylogenomics project to date and the sequence, alignment, and tree data are expected to accelerate analyses in phylogenomics and other related areas.

...read moreread less

Abstract: Background: Determining the evolutionary relationships among the major lineages of extant birds has been one of the biggest challenges in systematic biology. To address this challenge, we assembled or collected the genomes of 48 avian species spanning most orders of birds, including all Neognathae and two of the five Palaeognathae orders. We used these genomes to construct a genome-scale avian phylogenetic tree and perform comparative genomic analyses. Findings: Here we present the datasets associated with the phylogenomic analyses, which include sequence alignment files consisting of nucleotides, amino acids, indels, and transposable elements, as well as tree files containing gene trees and species trees. Inferring an accurate phylogeny required generating: 1) A well annotated data set across species based on genome synteny; 2) Alignments with unaligned or incorrectly overaligned sequences filtered out; and 3) Diverse data sets, including genes and their inferred trees, indels, and transposable elements. Our total evidence nucleotide tree (TENT) data set (consisting of exons, introns, and UCEs) gave what we consider our most reliable species tree when using the concatenation-based ExaML algorithm or when using statistical binning with the coalescence-based MP-EST algorithm (which we refer to as MP-EST*). Other data sets, such as the coding sequence of some exons, revealed other properties of genome evolution, namely convergence. Conclusions: The Avian Phylogenomics Project is the largest vertebrate phylogenomics project to date that we are aware of. The sequence, alignment, and tree data are expected to accelerate analyses in phylogenomics and other related areas.

...read moreread less

Journal Article•DOI•

Metabolome of human gut microbiome is predictive of host dysbiosis

[...]

Peter E. Larsen¹, Yang Dai¹•Institutions (1)

University of Illinois at Chicago¹

14 Sep 2015-GigaScience

TL;DR: The ability to use machine learning to predict dysbiosis from microbiome community interaction data provides a potentially powerful tool for understanding the links between the human microbiome and human health, pointing to potential microbiome-based diagnostics and therapeutic interventions.

...read moreread less

Abstract: Humans live in constant and vital symbiosis with a closely linked bacterial ecosystem called the microbiome, which influences many aspects of human health When this microbial ecosystem becomes disrupted, the health of the human host can suffer; a condition called dysbiosis However, the community compositions of human microbiomes also vary dramatically from individual to individual, and over time, making it difficult to uncover the underlying mechanisms linking the microbiome to human health We propose that a microbiome’s interaction with its human host is not necessarily dependent upon the presence or absence of particular bacterial species, but instead is dependent on its community metabolome; an emergent property of the microbiome Using data from a previously published, longitudinal study of microbiome populations of the human gut, we extrapolated information about microbiome community enzyme profiles and metabolome models Using machine learning techniques, we demonstrated that the aggregate predicted community enzyme function profiles and modeled metabolomes of a microbiome are more predictive of dysbiosis than either observed microbiome community composition or predicted enzyme function profiles Specific enzyme functions and metabolites predictive of dysbiosis provide insights into the molecular mechanisms of microbiome–host interactions The ability to use machine learning to predict dysbiosis from microbiome community interaction data provides a potentially powerful tool for understanding the links between the human microbiome and human health, pointing to potential microbiome-based diagnostics and therapeutic interventions

...read moreread less

Journal Article•DOI•

Deeply sequenced metagenome and metatranscriptome of a biogas-producing microbial community from an agricultural production-scale biogas plant

[...]

Andreas Bremges¹, Irena Maus¹, Peter Belmann¹, Felix Gregor Eikmeyer¹, Anika Winkler¹, Andreas Albersmeier¹, Alfred Pühler¹, Andreas Schlüter¹, Alexander Sczyrba¹ - Show less +5 more•Institutions (1)

Bielefeld University¹

30 Jul 2015-GigaScience

TL;DR: This work assembled the metagenome and showed that it reconstructed most genes involved in the methane metabolism, a key pathway involving methanogenesis performed by methanogenic Archaea, indicating that there is sufficient sequencing coverage for most downstream analyses.

...read moreread less

Abstract: Background: The production of biogas takes place under anaerobic conditions and involves microbial decomposition of organic matter. Most of the participating microbes are still unknown and non-cultivable. Accordingly, shotgun metagenome sequencing currently is the method of choice to obtain insights into community composition and the genetic repertoire. Findings: Here, we report on the deeply sequenced metagenome and metatranscriptome of a complex biogas-producing microbial community from an agricultural production-scale biogas plant. We assembled the metagenome and, as an example application, show that we reconstructed most genes involved in the methane metabolism, a key pathway involving methanogenesis performed by methanogenic Archaea. This result indicates that there is sufficient sequencing coverage for most downstream analyses. Conclusions: Sequenced at least one order of magnitude deeper than previous studies, our metagenome data will enable new insights into community composition and the genetic potential of important community members. Moreover, mapping of transcripts to reconstructed genome sequences will enable the identification of active metabolic pathways in target organisms.

...read moreread less

Journal Article•DOI•

Hybrid de novo genome assembly of the Chinese herbal plant danshen (Salvia miltiorrhiza Bunge)

[...]

Guanghui Zhang¹, Yang Tian², Yang Tian¹, Jing Zhang³, Li-Ping Shu⁴, Sheng-Chao Yang¹, Wen Wang⁵, Jun Sheng¹, Yang Dong⁴, Wei Chen¹ - Show less +6 more•Institutions (5)

Yunnan Agricultural University¹, Jilin University², Huazhong University of Science and Technology³, Kunming University of Science and Technology⁴, Kunming Institute of Zoology⁵

14 Dec 2015-GigaScience

TL;DR: The draft danshen genome will provide a valuable resource for the investigation of novel bioactive compounds in this Chinese herb, and will help elucidate the biosynthetic pathways of important secondary metabolites, thereby advancing the Investigation of novel drugs from this plant.

...read moreread less

Abstract: Danshen (Salvia miltiorrhiza Bunge), also known as Chinese red sage, is a member of Lamiaceae family. It is valued in traditional Chinese medicine, primarily for the treatment of cardiovascular and cerebrovascular diseases. Because of its pharmacological potential, ongoing research aims to identify novel bioactive compounds in danshen, and their biosynthetic pathways. To date, only expressed sequence tag (EST) and RNA-seq data for this herbal plant are available to the public. We therefore propose that the construction of a reference genome for danshen will help elucidate the biosynthetic pathways of important secondary metabolites, thereby advancing the investigation of novel drugs from this plant. We assembled the highly heterozygous danshen genome with the help of 395 × raw read coverage using Illumina technologies and about 10 × raw read coverage by using single molecular sequencing technology. The final draft genome is approximately 641 Mb, with a contig N50 size of 82.8 kb and a scaffold N50 size of 1.2 Mb. Further analyses predicted 34,598 protein-coding genes and 1,644 unique gene families in the danshen genome. The draft danshen genome will provide a valuable resource for the investigation of novel bioactive compounds in this Chinese herb.

...read moreread less

Journal Article•DOI•

Using optical mapping data for the improvement of vertebrate genome assemblies

[...]

Kerstin Howe¹, Jonathan Wood¹•Institutions (1)

Wellcome Trust Sanger Institute¹

18 Mar 2015-GigaScience

TL;DR: This work describes how optical mapping has been used in practice to produce high quality vertebrate genome assemblies and details the efforts undertaken by the Genome Reference Consortium (GRC), which maintains the reference genomes for human, mouse, zebrafish and chicken, and uses different optical mapping platforms for genome curation.

...read moreread less

Abstract: Optical mapping is a technology that gathers long-range information on genome sequences similar to ordered restriction digest maps. Because it is not subject to cloning, amplification, hybridisation or sequencing bias, it is ideally suited to the improvement of fragmented genome assemblies that can no longer be improved by classical methods. In addition, its low cost and rapid turnaround make it equally useful during the scaffolding process of de novo assembly from high throughput sequencing reads. We describe how optical mapping has been used in practice to produce high quality vertebrate genome assemblies. In particular, we detail the efforts undertaken by the Genome Reference Consortium (GRC), which maintains the reference genomes for human, mouse, zebrafish and chicken, and uses different optical mapping platforms for genome curation.

...read moreread less

Journal Article•DOI•

Connectomics and new approaches for analyzing human brain functional connectivity.

[...]

R. Cameron Craddock¹, R. Cameron Craddock², Rosalia Tungaraza¹, Michael P. Milham¹, Michael P. Milham² - Show less +1 more•Institutions (2)

Nathan Kline Institute for Psychiatric Research¹, MIND Institute²

25 Mar 2015-GigaScience

TL;DR: This review describes several outstanding problems in brain functional connectomics with the goal of engaging researchers from a broad spectrum of data sciences to help solve these problems.

...read moreread less

Abstract: Estimating the functional interactions between brain regions and mapping those connections to corresponding inter-individual differences in cognitive, behavioral and psychiatric domains are central pursuits for understanding the human connectome. The number and complexity of functional interactions within the connectome and the large amounts of data required to study them position functional connectivity research as a “big data” problem. Maximizing the degree to which knowledge about human brain function can be extracted from the connectome will require developing a new generation of neuroimaging analysis algorithms and tools. This review describes several outstanding problems in brain functional connectomics with the goal of engaging researchers from a broad spectrum of data sciences to help solve these problems. Additionally it provides information about open science resources consisting of raw and preprocessed data to help interested researchers get started.

...read moreread less

Journal Article•DOI•

A single chromosome assembly of Bacteroides fragilis strain BE1 from Illumina and MinION nanopore sequencing data

[...]

Judith Risse¹, Marian Thomson¹, Sheila Patrick², Garry W. Blakely¹, Georgios Koutsovoulos¹, Mark Blaxter¹, Mick Watson¹ - Show less +3 more•Institutions (2)

University of Edinburgh¹, Queen's University Belfast²

04 Dec 2015-GigaScience

TL;DR: The single chromosome assembly of Bacteroides fragilis strain BE1 was achieved using only modest amounts of data, publicly available software and commodity computing hardware.

...read moreread less

Abstract: Background Second and third generation sequencing technologies have revolutionised bacterial genomics. Short-read Illumina reads result in cheap but fragmented assemblies, whereas longer reads are more expensive but result in more complete genomes. The Oxford Nanopore MinION device is a revolutionary mobile sequencer that can produce thousands of long, single molecule reads.

...read moreread less

Journal Article•DOI•

Full-length single-cell RNA-seq applied to a viral human cancer: applications to HPV expression and splicing analysis in HeLa S3 cells

[...]

Liang Wu, Xiaolong Zhang¹, Zhikun Zhao², Ling Wang³, Bo Li, Guibo Li⁴, Michael Dean, Qichao Yu¹, Yanhui Wang, Xinxin Lin, Weijian Rao, Zhanlong Mei, Yang Li, Runze Jiang, Huan Yang, Fuqiang Li, Guoyun Xie, Liqin Xu, Kui Wu, Jie Zhang, Jianghao Chen³, Ting Wang³, Karsten Kristiansen⁴, Xiuqing Zhang, Yingrui Li⁵, Huanming Yang⁶, Jian Wang⁶, Yong Hou⁴, Xun Xu - Show less +25 more•Institutions (6)

Chinese Academy of Sciences¹, Southeast University², Fourth Military Medical University³, University of Copenhagen⁴, University of Queensland⁵, Zhejiang University⁶

05 Nov 2015-GigaScience

TL;DR: A new high throughput platform to prepare single-cell RNA on a nanoliter scale based on a customized microwell chip provides a transcriptome characterization of HeLa S3 cells at the single cell level, and is a demonstration of the power of single cell RNA-seq analysis of virally infected cells and cancers.

...read moreread less

Abstract: Viral infection causes multiple forms of human cancer, and HPV infection is the primary factor in cervical carcinomas. Recent single-cell RNA-seq studies highlight the tumor heterogeneity present in most cancers, but virally induced tumors have not been studied. HeLa is a well characterized HPV+ cervical cancer cell line. We developed a new high throughput platform to prepare single-cell RNA on a nanoliter scale based on a customized microwell chip. Using this method, we successfully amplified full-length transcripts of 669 single HeLa S3 cells and 40 of them were randomly selected to perform single-cell RNA sequencing. Based on these data, we obtained a comprehensive understanding of the heterogeneity of HeLa S3 cells in gene expression, alternative splicing and fusions. Furthermore, we identified a high diversity of HPV-18 expression and splicing at the single-cell level. By co-expression analysis we identified 283 E6, E7 co-regulated genes, including CDC25, PCNA, PLK4, BUB1B and IRF1 known to interact with HPV viral proteins. Our results reveal the heterogeneity of a virus-infected cell line. It not only provides a transcriptome characterization of HeLa S3 cells at the single cell level, but is a demonstration of the power of single cell RNA-seq analysis of virally infected cells and cancers.

...read moreread less

Journal Article•DOI•

Benchmark datasets for 3D MALDI- and DESI-imaging mass spectrometry.

[...]

Janina Oetjen¹, Kirill Veselkov², Jeramie D. Watrous³, James S. McKenzie², Michael Becker, Lena Hauberg-Lotte, Jan Hendrik Kobarg, Nicole Strittmatter², Anna Mroz², Franziska Hoffmann⁴, Dennis Trede, Andrew Palmer, Stefan Schiffler, Klaus Steinhorst, Michaela Aichler, Robert D. Goldin², Orlando Guntinas-Lichius, Ferdinand von Eggeling, Herbert Thiele, Kathrin Maedler¹, Axel Walch, Peter Maass¹, Pieter C. Dorrestein⁵, Zoltan Takats², Theodore Alexandrov - Show less +21 more•Institutions (5)

University of Bremen¹, Imperial College London², University of California, San Diego³, University of Jena⁴, University of Montana⁵

04 May 2015-GigaScience

TL;DR: High-quality 3D imaging MS datasets from different biological systems at several labs were acquired, supplied with overview images and scripts demonstrating how to read them, and deposited into MetaboLights, an open repository for metabolomics data.

...read moreread less

Abstract: Background Three-dimensional (3D) imaging mass spectrometry (MS) is an analytical chemistry technique for the 3D molecular analysis of a tissue specimen, entire organ, or microbial colonies on an agar plate. 3D-imaging MS has unique advantages over existing 3D imaging techniques, offers novel perspectives for understanding the spatial organization of biological processes, and has growing potential to be introduced into routine use in both biology and medicine. Owing to the sheer quantity of data generated, the visualization, analysis, and interpretation of 3D imaging MS data remain a significant challenge. Bioinformatics research in this field is hampered by the lack of publicly available benchmark datasets needed to evaluate and compare algorithms.

...read moreread less

Journal Article•DOI•

Optical mapping in plant comparative genomics

[...]

Haibao Tang¹, Haibao Tang², Eric Lyons², Christopher D. Town³•Institutions (3)

Fujian Agriculture and Forestry University¹, University of Arizona², J. Craig Venter Institute³

10 Feb 2015-GigaScience

TL;DR: The ability of optical mapping to assay long single DNA molecules nicely complements short-read sequencing which is more suitable for the identification of small and short-range variants.

...read moreread less

Abstract: Optical mapping has been widely used to improve de novo plant genome assemblies, including rice, maize, Medicago, Amborella, tomato and wheat, with more genomes in the pipeline. Optical mapping provides long-range information of the genome and can more easily identify large structural variations. The ability of optical mapping to assay long single DNA molecules nicely complements short-read sequencing which is more suitable for the identification of small and short-range variants. Direct use of optical mapping to study population-level genetic diversity is currently limited to microbial strain typing and human diversity studies. Nonetheless, optical mapping shows great promise in the study of plant trait development, domestication and polyploid evolution. Here we review the current applications and future prospects of optical mapping in the field of plant comparative genomics.

...read moreread less

Journal Article•DOI•

Erratum to: A reference bacterial genome dataset generated on the MinION™ portable single-molecule nanopore sequencer [GigaScience. 2014;3:22.]

[...]

Joshua Quick¹, Aaron R. Quinlan², Nicholas J. Loman¹•Institutions (2)

University of Birmingham¹, University of Virginia²

13 Feb 2015-GigaScience

TL;DR: This research presents a novel probabilistic approach to estimating the response of the immune system to laser-spot assisted, 3D image recognition.

...read moreread less

Abstract: [This corrects the article DOI: 10.1186/2047-217X-3-22.].

...read moreread less

Journal Article•DOI•

A large and diverse collection of bovine genome sequences from the Canadian Cattle Genome Project

[...]

Paul Stothard¹, Xiaoping Liao², Xiaoping Liao¹, Adriano S. Arantes¹, Mary A De Pauw¹, Colin Coros, Graham Plastow¹, Mehdi Sargolzaei³, J. J. Crowley¹, John A. Basarab¹, Flavio S Schenkel³, Stephen S. Moore⁴, Stephen S. Moore¹, Stephen P. Miller⁵, Stephen P. Miller³, Stephen P. Miller¹ - Show less +12 more•Institutions (5)

University of Alberta¹, Chinese Academy of Sciences², University of Guelph³, University of Queensland⁴, AgResearch⁵

26 Oct 2015-GigaScience

TL;DR: The large number of whole-genome sequences generated as a result of the Canadian Cattle Genome Project will contribute to ongoing work aiming to catalogue the variation that exists in cattle as well as efforts to improve traits through genotype-guided selection.

...read moreread less

Abstract: Background: The Canadian Cattle Genome Project is a large-scale international project that aims to develop genomics-based tools to enhance the efficiency and sustainability of beef and dairy production. Obtaining DNA sequence information is an important part of achieving this goal as it facilitates efforts to associate specific DNA differences with phenotypic variation. These associations can be used to guide breeding decisions and provide valuable insight into the molecular basis of traits. Findings: We describe a dataset of 379 whole-genome sequences, taken primarily from key historic Bos taurus animals, along with the analyses that were performed to assess data quality. The sequenced animals represent ten populations relevant to beef or dairy production. Animal information (name, breed, population), sequence data metrics (mapping rate, depth, concordance), and sequence repository identifiers (NCBI BioProject and BioSample IDs) are provided to enable others to access and exploit this sequence information. Conclusions: The large number of whole-genome sequences generated as a result of this project will contribute to ongoing work aiming to catalogue the variation that exists in cattle as well as efforts to improve traits through genotype-guided selection. Studies of gene function, population structure, and sequence evolution are also likely to benefit from the availability of this resource.

...read moreread less

Journal Article•DOI•

Addressing health disparities in Hispanic breast cancer: accurate and inexpensive sequencing of BRCA1 and BRCA2

[...]

Michael Dean, Joseph Boland, Meredith Yeager, Kate M. Im, Lisa Garland, Maria Rodriguez-Herrera, Mylen Perez, Jason Mitchell, David Roberson, Kristine Jones, Hyojung Lee, Rebecca Eggebeen, Julie Sawitzke¹, Sara Bass, Xijun Zhang, Vivian Robles, Celia Hollis, Claudia Barajas, Edna Rath², Candy Arentz³, Jose A. Figueroa³, Diane D. Nguyen³, Zeina Nahleh² - Show less +19 more•Institutions (3)

Leidos¹, Texas Tech University Health Sciences Center at El Paso², Texas Tech University Health Sciences Center³

04 Nov 2015-GigaScience

TL;DR: A nationwide trial of BRCA testing of Latin American women with breast cancer began, finding that application of this strategy on a larger scale could lead to improved cancer care of minority and underserved populations.

...read moreread less

Abstract: Germline mutations in the BRCA1 and BRCA2 genes account for 20–25 % of inherited breast cancers and about 10 % of all breast cancer cases. Detection of BRCA mutation carriers can lead to therapeutic interventions such as mastectomy, oophorectomy, hormonal prevention therapy, improved screening, and targeted therapies such as PARP-inhibition. We estimate that African Americans and Hispanics are 4–5 times less likely to receive BRCA screening, despite having similar mutation frequencies as non-Jewish Caucasians, who have higher breast cancer mortality. To begin addressing this health disparity, we initiated a nationwide trial of BRCA testing of Latin American women with breast cancer. Patients were recruited through community organizations, clinics, public events, and by mail and Internet. Subjects completed the consent process and questionnaire, and provided a saliva sample by mail or in person. DNA from 120 subjects was used to sequence the entirety of BRCA1 and BRCA2 coding regions and splice sites, and validate pathogenic mutations, with a total material cost of $85/subject. Subjects ranged in age from 23 to 81 years (mean age, 51 years), 6 % had bilateral disease, 57 % were ER/PR+, 23 % HER2+, and 17 % had triple-negative disease. A total of seven different predicted deleterious mutations were identified, one newly described and the rest rare. In addition, four variants of unknown effect were found. Application of this strategy on a larger scale could lead to improved cancer care of minority and underserved populations.

...read moreread less

Journal Article•DOI•

Standardizing metadata and taxonomic identification in metabarcoding studies

[...]

Leho Tedersoo¹, Kelly S. Ramirez, Rolf Henrik Nilsson², Aivi Kaljuvee³, Urmas Kõljalg³, Kessy Abarenkov¹ - Show less +2 more•Institutions (3)

American Museum of Natural History¹, University of Gothenburg², University of Tartu³

31 Jul 2015-GigaScience

TL;DR: An automated workflow covering data submission, compression, storage and public access is proposed to allow easy data retrieval and inter-study communication in high-throughput sequencing-based metabarcoding studies.

...read moreread less

Abstract: High-throughput sequencing-based metabarcoding studies produce vast amounts of ecological data, but a lack of consensus on standardization of metadata and how to refer to the species recovered severely hampers reanalysis and comparisons among studies. Here we propose an automated workflow covering data submission, compression, storage and public access to allow easy data retrieval and inter-study communication. Such standardized and readily accessible datasets facilitate data management, taxonomic comparisons and compilation of global metastudies.

...read moreread less

Journal Article•DOI•

Improving functional magnetic resonance imaging reproducibility

[...]

Cyril Pernet¹, Jean-Baptiste Poline²•Institutions (2)

University of Edinburgh¹, University of California, Berkeley²

31 Mar 2015-GigaScience

TL;DR: Only by sharing experiments, data, metadata, derived data and analysis workflows will neuroimaging establish itself as a true data science.

...read moreread less

Abstract: The ability to replicate an entire experiment is crucial to the scientific method. With the development of more and more complex paradigms, and the variety of analysis techniques available, fMRI studies are becoming harder to reproduce.

...read moreread less

Journal Article•DOI•

VirAmp: a galaxy-based viral genome assembly pipeline.

[...]

Yinan Wan, Daniel W. Renner¹, Istvan Albert¹, Moriah L. Szpara¹•Institutions (1)

Pennsylvania State University¹

28 Apr 2015-GigaScience

TL;DR: A multi-step viral genome assembly pipeline named VirAmp, which combines existing tools and techniques and presents them to end users via a web-enabled Galaxy interface and automates the currently recommended best practices into a single, easy to use interface.

...read moreread less

Abstract: Advances in next generation sequencing make it possible to obtain high-coverage sequence data for large numbers of viral strains in a short time. However, since most bioinformatics tools are developed for command line use, the selection and accessibility of computational tools for genome assembly and variation analysis limits the ability of individual labs to perform further bioinformatics analysis. We have developed a multi-step viral genome assembly pipeline named VirAmp, which combines existing tools and techniques and presents them to end users via a web-enabled Galaxy interface. Our pipeline allows users to assemble, analyze, and interpret high coverage viral sequencing data with an ease and efficiency that was not possible previously. Our software makes a large number of genome assembly and related tools available to life scientists and automates the currently recommended best practices into a single, easy to use interface. We tested our pipeline with three different datasets from human herpes simplex virus (HSV). VirAmp provides a user-friendly interface and a complete pipeline for viral genome analysis. We make our software available via an Amazon Elastic Cloud disk image that can be easily launched by anyone with an Amazon web service account. A fully functional demonstration instance of our system can be found at http://viramp.com/ . We also maintain detailed documentation on each tool and methodology at http://docs.viramp.com .

...read moreread less