scispace - formally typeset
Search or ask a question

Showing papers in "GigaScience in 2016"


Journal ArticleDOI
TL;DR: The ginkgo genome consists mainly of LTR-RTs resulting from ancient gradual accumulation and two WGD events, which sheds light on sequencing large genomes, and opens an avenue for further genetic and evolutionary research.
Abstract: Ginkgo biloba L. (Ginkgoaceae) is one of the most distinctive plants. It possesses a suite of fascinating characteristics including a large genome, outstanding resistance/tolerance to abiotic and biotic stresses, and dioecious reproduction, making it an ideal model species for biological studies. However, the lack of a high-quality genome sequence has been an impediment to our understanding of its biology and evolution. The 10.61 Gb genome sequence containing 41,840 annotated genes was assembled in the present study. Repetitive sequences account for 76.58% of the assembled sequence, and long terminal repeat retrotransposons (LTR-RTs) are particularly prevalent. The diversity and abundance of LTR-RTs is due to their gradual accumulation and a remarkable amplification between 16 and 24 million years ago, and they contribute to the long introns and large genome. Whole genome duplication (WGD) may have occurred twice, with an ancient WGD consistent with that shown to occur in other seed plants, and a more recent event specific to ginkgo. Abundant gene clusters from tandem duplication were also evident, and enrichment of expanded gene families indicates a remarkable array of chemical and antibacterial defense pathways. The ginkgo genome consists mainly of LTR-RTs resulting from ancient gradual accumulation and two WGD events. The multiple defense mechanisms underlying the characteristic resilience of ginkgo are fostered by a remarkable enrichment in ancient duplicated and ginkgo-specific gene clusters. The present study sheds light on sequencing large genomes, and opens an avenue for further genetic and evolutionary research.

216 citations


Journal ArticleDOI
TL;DR: The ‘Biomes of Australian Soil Environments’ (BASE) project has generated a database of microbial diversity with associated metadata across extensive environmental gradients at continental scale, becoming the first Australian soil microbial diversity database.
Abstract: Microbial inhabitants of soils are important to ecosystem and planetary functions, yet there are large gaps in our knowledge of their diversity and ecology. The ‘Biomes of Australian Soil Environments’ (BASE) project has generated a database of microbial diversity with associated metadata across extensive environmental gradients at continental scale. As the characterisation of microbes rapidly expands, the BASE database provides an evolving platform for interrogating and integrating microbial diversity and function. BASE currently provides amplicon sequences and associated contextual data for over 900 sites encompassing all Australian states and territories, a wide variety of bioregions, vegetation and land-use types. Amplicons target bacteria, archaea and general and fungal-specific eukaryotes. The growing database will soon include metagenomics data. Data are provided in both raw sequence (FASTQ) and analysed OTU table formats and are accessed via the project’s data portal, which provides a user-friendly search tool to quickly identify samples of interest. Processed data can be visually interrogated and intersected with other Australian diversity and environmental data using tools developed by the ‘Atlas of Living Australia’. Developed within an open data framework, the BASE project is the first Australian soil microbial diversity database. The database will grow and link to other global efforts to explore microbial, plant, animal, and marine biodiversity. Its design and open access nature ensures that BASE will evolve as a valuable tool for documenting an often overlooked component of biodiversity and the many microbe-driven processes that are essential to sustain soil function and ecosystem services.

178 citations


Journal ArticleDOI
TL;DR: Although nanopore-based sequencing produces reads with lower per-base accuracy compared with other platforms, the MinION™ DNA sequencer is valuable for both high taxonomic resolution and microbial diversity analysis.
Abstract: The miniaturised and portable DNA sequencer MinION™ has been released to the scientific community within the framework of an early access programme to evaluate its application for a wide variety of genetic approaches. This technology has demonstrated great potential, especially in genome-wide analyses. In this study, we tested the ability of the MinION™ system to perform amplicon sequencing in order to design new approaches to study microbial diversity using nearly full-length 16S rDNA sequences. Using R7.3 chemistry, we generated more than 3.8 million events (nt) during a single sequencing run. These data were sufficient to reconstruct more than 90 % of the 16S rRNA gene sequences for 20 different species present in a mock reference community. After read mapping and 16S rRNA gene assembly, consensus sequences and 2d reads were recovered to assign taxonomic classification down to the species level. Additionally, we were able to measure the relative abundance of all the species present in a mock community and detected a biased species distribution originating from the PCR reaction using ‘universal’ primers. Although nanopore-based sequencing produces reads with lower per-base accuracy compared with other platforms, the MinION™ DNA sequencer is valuable for both high taxonomic resolution and microbial diversity analysis. Improvements in nanopore chemistry, such as minimising base-calling errors and the nucleotide bias reported here for 16S amplicon sequencing, will further deliver more reliable information that is useful for the specific detection of microbial species and strains in complex ecosystems.

168 citations


Journal ArticleDOI
TL;DR: The assembled draft genome of O. europaea will provide a valuable resource for the study of the evolution and domestication processes of this important tree, and allow determination of the genetic bases of key phenotypic traits.
Abstract: The Mediterranean olive tree (Olea europaea subsp. europaea) was one of the first trees to be domesticated and is currently of major agricultural importance in the Mediterranean region as the source of olive oil. The molecular bases underlying the phenotypic differences among domesticated cultivars, or between domesticated olive trees and their wild relatives, remain poorly understood. Both wild and cultivated olive trees have 46 chromosomes (2n). A total of 543 Gb of raw DNA sequence from whole genome shotgun sequencing, and a fosmid library containing 155,000 clones from a 1,000+ year-old olive tree (cv. Farga) were generated by Illumina sequencing using different combinations of mate-pair and pair-end libraries. Assembly gave a final genome with a scaffold N50 of 443 kb, and a total length of 1.31 Gb, which represents 95 % of the estimated genome length (1.38 Gb). In addition, the associated fungus Aureobasidium pullulans was partially sequenced. Genome annotation, assisted by RNA sequencing from leaf, root, and fruit tissues at various stages, resulted in 56,349 unique protein coding genes, suggesting recent genomic expansion. Genome completeness, as estimated using the CEGMA pipeline, reached 98.79 %. The assembled draft genome of O. europaea will provide a valuable resource for the study of the evolution and domestication processes of this important tree, and allow determination of the genetic bases of key phenotypic traits. Moreover, it will enhance breeding programs and the formation of new varieties.

155 citations


Journal ArticleDOI
TL;DR: No single strategy is sufficient for every scenario; thus it is often useful to combine approaches, and seven such strategies are described.
Abstract: When reporting research findings, scientists document the steps they followed so that others can verify and build upon the research. When those steps have been described in sufficient detail that others can retrace the steps and obtain similar results, the research is said to be reproducible. Computers play a vital role in many research disciplines and present both opportunities and challenges for reproducibility. Computers can be programmed to execute analysis tasks, and those programs can be repeated and shared with others. The deterministic nature of most computer programs means that the same analysis tasks, applied to the same data, will often produce the same outputs. However, in practice, computational findings often cannot be reproduced because of complexities in how software is packaged, installed, and executed—and because of limitations associated with how scientists document analysis steps. Many tools and techniques are available to help overcome these challenges; here we describe seven such strategies. With a broad scientific audience in mind, we describe the strengths and limitations of each approach, as well as the circumstances under which each might be applied. No single strategy is sufficient for every scenario; thus we emphasize that it is often useful to combine approaches.

135 citations


Journal ArticleDOI
TL;DR: INC-Seq reads enabled accurate species-level classification, identification of species at 0.1 % abundance and robust quantification of relative abundances, providing a cheap and effective approach for pathogen detection and microbiome profiling on the MinION system.
Abstract: Nanopore sequencing provides a rapid, cheap and portable real-time sequencing platform with the potential to revolutionize genomics. However, several applications are limited by relatively high single-read error rates (>10 %), including RNA-seq, haplotype sequencing and 16S sequencing. We developed the Intramolecular-ligated Nanopore Consensus Sequencing (INC-Seq) as a strategy for obtaining long and accurate nanopore reads, starting with low input DNA. Applying INC-Seq for 16S rRNA-based bacterial profiling generated full-length amplicon sequences with a median accuracy >97 %. INC-Seq reads enabled accurate species-level classification, identification of species at 0.1 % abundance and robust quantification of relative abundances, providing a cheap and effective approach for pathogen detection and microbiome profiling on the MinION system.

128 citations


Journal ArticleDOI
TL;DR: Using imaging, genetic and healthcare data, examples of processing heterogeneous datasets using distributed cloud services, automated and semi-automated classification techniques, and open-science protocols are provided.
Abstract: Managing, processing and understanding big healthcare data is challenging, costly and demanding. Without a robust fundamental theory for representation, analysis and inference, a roadmap for uniform handling and analyzing of such complex data remains elusive. In this article, we outline various big data challenges, opportunities, modeling methods and software techniques for blending complex healthcare data, advanced analytic tools, and distributed scientific computing. Using imaging, genetic and healthcare data we provide examples of processing heterogeneous datasets using distributed cloud services, automated and semi-automated classification techniques, and open-science protocols. Despite substantial advances, new innovative technologies need to be developed that enhance, scale and optimize the management and processing of large, complex and heterogeneous data. Stakeholder investments in data acquisition, research and development, computational infrastructure and education will be critical to realize the huge potential of big data, to reap the expected information benefits and to build lasting knowledge assets. Multi-faceted proprietary, open-source, and community developments will be essential to enable broad, reliable, sustainable and efficient data-driven discovery and analytics. Big data will affect every sector of the economy and their hallmark will be ‘team science’.

126 citations


Journal ArticleDOI
TL;DR: Mitochondrial metagenomics offers a promising avenue for unifying the ecological and evolutionary understanding of species diversity and makes it possible to obtain data on spatial and temporal turnover in whole-community phylogenetic and species composition, even in complex ecosystems where species-level taxonomy and biodiversity patterns are poorly known.
Abstract: ‘Mitochondrial metagenomics’ (MMG) is a methodology for shotgun sequencing of total DNA from specimen mixtures and subsequent bioinformatic extraction of mitochondrial sequences. The approach can be applied to phylogenetic analysis of taxonomically selected taxa, as an economical alternative to mitogenome sequencing from individual species, or to environmental samples of mixed specimens, such as from mass trapping of invertebrates. The routine generation of mitochondrial genome sequences has great potential both for systematics and community phylogenetics. Mapping of reads from low-coverage shotgun sequencing of environmental samples also makes it possible to obtain data on spatial and temporal turnover in whole-community phylogenetic and species composition, even in complex ecosystems where species-level taxonomy and biodiversity patterns are poorly known. In addition, read mapping can produce information on species biomass, and potentially allows quantification of within-species genetic variation. The success of MMG relies on the formation of numerous mitochondrial genome contigs, achievable with standard genome assemblers, but various challenges for the efficiency of assembly remain, particularly in the face of variable relative species abundance and intra-specific genetic variation. Nevertheless, several studies have demonstrated the power of mitogenomes from MMG for accurate phylogenetic placement, evolutionary analysis of species traits, biodiversity discovery and the establishment of species distribution patterns; it offers a promising avenue for unifying the ecological and evolutionary understanding of species diversity.

106 citations


Journal ArticleDOI
TL;DR: The assembled draft genome will provide a valuable resource for the study of essential developmental processes and genetic determination of important traits of the Chinese mitten crab, and also for investigating crustacean evolution.
Abstract: The Chinese mitten crab, Eriocheir sinensis, is one of the most studied and economically important crustaceans in China. Its transition from a swimming to a crawling method of movement during early development, anadromous migration during growth, and catadromous migration during breeding have been attractive features for research. However, knowledge of the underlying molecular mechanisms that regulate these processes is still very limited. A total of 258.8 gigabases (Gb) of raw reads from whole-genome sequencing of the crab were generated by the Illumina HiSeq2000 platform. The final genome assembly (1.12 Gb), about 67.5 % of the estimated genome size (1.66 Gb), is composed of 17,553 scaffolds (>2 kb) with an N50 of 224 kb. We identified 14,436 genes using AUGUSTUS, of which 7,549 were shown to have significant supporting evidence using the GLEAN pipeline. This gene number is much greater than that of the horseshoe crab, and the annotation completeness, as evaluated by CEGMA, reached 66.9 %. We report the first genome sequencing, assembly, and annotation of the Chinese mitten crab. The assembled draft genome will provide a valuable resource for the study of essential developmental processes and genetic determination of important traits of the Chinese mitten crab, and also for investigating crustacean evolution.

100 citations


Journal ArticleDOI
TL;DR: This work presents a framework for streaming analysis of MinION real-time sequence data, together with probabilistic streaming algorithms for species typing, strain typing and antibiotic resistance profile identification, and shows that the pipeline can process over 100 times more data than the current throughput of the MinION on a desktop computer.
Abstract: The recently introduced Oxford Nanopore MinION platform generates DNA sequence data in real-time. This has great potential to shorten the sample-to-results time and is likely to have benefits such as rapid diagnosis of bacterial infection and identification of drug resistance. However, there are few tools available for streaming analysis of real-time sequencing data. Here, we present a framework for streaming analysis of MinION real-time sequence data, together with probabilistic streaming algorithms for species typing, strain typing and antibiotic resistance profile identification. Using four culture isolate samples, as well as a mixed-species sample, we demonstrate that bacterial species and strain information can be obtained within 30 min of sequencing and using about 500 reads, initial drug-resistance profiles within two hours, and complete resistance profiles within 10 h. While strain identification with multi-locus sequence typing required more than 15x coverage to generate confident assignments, our novel gene-presence typing could detect the presence of a known strain with 0.5x coverage. We also show that our pipeline can process over 100 times more data than the current throughput of the MinION on a desktop computer.

86 citations


Journal ArticleDOI
TL;DR: This work presents an end-to-end mass spectrometry metabolomics workflow in the widely used platform, Galaxy, and recommends that Galaxy-M workflow files are included within the supplementary information of publications, enabling metabolomics studies to achieve greater reproducibility.
Abstract: Metabolomics is increasingly recognized as an invaluable tool in the biological, medical and environmental sciences yet lags behind the methodological maturity of other omics fields. To achieve its full potential, including the integration of multiple omics modalities, the accessibility, standardization and reproducibility of computational metabolomics tools must be improved significantly. Here we present our end-to-end mass spectrometry metabolomics workflow in the widely used platform, Galaxy. Named Galaxy-M, our workflow has been developed for both direct infusion mass spectrometry (DIMS) and liquid chromatography mass spectrometry (LC-MS) metabolomics. The range of tools presented spans from processing of raw data, e.g. peak picking and alignment, through data cleansing, e.g. missing value imputation, to preparation for statistical analysis, e.g. normalization and scaling, and principal components analysis (PCA) with associated statistical evaluation. We demonstrate the ease of using these Galaxy workflows via the analysis of DIMS and LC-MS datasets, and provide PCA scores and associated statistics to help other users to ensure that they can accurately repeat the processing and analysis of these two datasets. Galaxy and data are all provided pre-installed in a virtual machine (VM) that can be downloaded from the GigaDB repository. Additionally, source code, executables and installation instructions are available from GitHub. The Galaxy platform has enabled us to produce an easily accessible and reproducible computational metabolomics workflow. More tools could be added by the community to expand its functionality. We recommend that Galaxy-M workflow files are included within the supplementary information of publications, enabling metabolomics studies to achieve greater reproducibility.

Journal ArticleDOI
TL;DR: The results suggest that FAM84B and the NOTCH pathway are involved in the progression of ESCC and may be potential diagnostic targets for ESCC susceptibility.
Abstract: Esophageal squamous cell carcinoma (ESCC) is the sixth most lethal cancer worldwide and the fourth most lethal cancer in China. Genomic characterization of tumors, particularly those of different stages, is likely to reveal additional oncogenic mechanisms. Although copy number alterations and somatic point mutations associated with the development of ESCC have been identified by array-based technologies and genome-wide studies, the genomic characterization of ESCCs from different stages of the disease has not been explored. Here, we have performed either whole-genome sequencing or whole-exome sequencing on 51 stage I and 53 stage III ESCC patients to characterize the genomic alterations that occur during the various clinical stages of ESCC, and further validated these changes in 36 atypical hyperplasia samples. Recurrent somatic amplifications at 8q were found to be enriched in stage I tumors and the deletions of 4p-q and 5q were particularly identified in stage III tumors. In particular, the FAM84B gene was amplified and overexpressed in preclinical and ESCC tumors. Knockdown of FAM84B in ESCC cell lines significantly reduced in vitro cell growth, migration and invasion. Although the cancer-associated genes TP53, PIK3CA, CDKN2A and their pathways showed no significant difference between stage I and stage III tumors, we identified and validated a prevalence of mutations in NOTCH1 and in the NOTCH pathway that indicate that they are involved in the preclinical and early stages of ESCC. Our results suggest that FAM84B and the NOTCH pathway are involved in the progression of ESCC and may be potential diagnostic targets for ESCC susceptibility.

Journal ArticleDOI
TL;DR: Chromosomer is a reference-based genome arrangement tool, which rapidly builds chromosomes from genome contigs or scaffolds using their alignments to a reference genome of a closely related species, and is a useful tool for genomic analysis of species without chromosome maps.
Abstract: As the number of sequenced genomes rapidly increases, chromosome assembly is becoming an even more crucial step of any genome study. Since de novo chromosome assemblies are confounded by repeat-mediated artifacts, reference-assisted assemblies that use comparative inference have become widely used, prompting the development of several reference-assisted assembly programs for prokaryotic and eukaryotic genomes. We developed Chromosomer – a reference-based genome arrangement tool, which rapidly builds chromosomes from genome contigs or scaffolds using their alignments to a reference genome of a closely related species. Chromosomer does not require mate-pair libraries and it offers a number of auxiliary tools that implement common operations accompanying the genome assembly process. Despite implementing a straightforward alignment-based approach, Chromosomer is a useful tool for genomic analysis of species without chromosome maps. Putative chromosome assemblies by Chromosomer can be used in comparative genomic analysis, genomic variation assessment, potential linkage group inference and other kinds of analysis involving contig or scaffold mapping to a high-quality assembly.

Journal ArticleDOI
TL;DR: This large-scale transcriptomic dataset provides a foundation for studies on how parasitic species with complex life cycles modulate their response to changes in biotic and abiotic conditions experienced inside their various hosts, which is a fundamental objective of parasitology.
Abstract: Schistocephalus solidus is a well-established model organism for studying the complex life cycle of cestodes and the mechanisms underlying host-parasite interactions. However, very few large-scale genetic resources for this species are available. We have sequenced and de novo-assembled the transcriptome of S. solidus using tissues from whole worms at three key developmental states - non-infective plerocercoid, infective plerocercoid and adult plerocercoid - to provide a resource for studying the evolution of complex life cycles and, more specifically, how parasites modulate their interactions with their hosts during development. The de novo transcriptome assembly reconstructed the coding sequence of 10,285 high-confidence unigenes from which 24,765 non-redundant transcripts were derived. 7,920 (77 %) of these unigenes were annotated with a protein name and 7,323 (71 %) were assigned at least one Gene Ontology term. Our raw transcriptome assembly (unfiltered transcripts) covers 92 % of the predicted transcriptome derived from the S. solidus draft genome assembly currently available on WormBase. It also provides new ecological information and orthology relationships to further annotate the current WormBase transcriptome and genome. This large-scale transcriptomic dataset provides a foundation for studies on how parasitic species with complex life cycles modulate their response to changes in biotic and abiotic conditions experienced inside their various hosts, which is a fundamental objective of parasitology. Furthermore, this resource will help in the validation of the S solidus gene features that have been predicted based on genomic sequence.

Journal ArticleDOI
TL;DR: The new apple genome assembly will serve as a valuable resource for investigating complex apple traits at the genomic level, not only suitable for genome editing and gene cloning, but also for RNA-seq and whole-genome re-sequencing studies.
Abstract: Domesticated apple (Malus × domestica Borkh) is a popular temperate fruit with high nutrient levels and diverse flavors. In 2012, global apple production accounted for at least one tenth of all harvested fruits. A high-quality apple genome assembly is crucial for the selection and breeding of new cultivars. Currently, a single reference genome is available for apple, assembled from 16.9 × genome coverage short reads via Sanger and 454 sequencing technologies. Although a useful resource, this assembly covers only ~89 % of the non-repetitive portion of the genome, and has a relatively short (16.7 kb) contig N50 length. These downsides make it difficult to apply this reference in transcriptive or whole-genome re-sequencing analyses. Here we present an improved hybrid de novo genomic assembly of apple (Golden Delicious), which was obtained from 76 Gb (~102 × genome coverage) Illumina HiSeq data and 21.7 Gb (~29 × genome coverage) PacBio data. The final draft genome is approximately 632.4 Mb, representing ~ 90 % of the estimated genome. The contig N50 size is 111,619 bp, representing a 7 fold improvement. Further annotation analyses predicted 53,922 protein-coding genes and 2,765 non-coding RNA genes. The new apple genome assembly will serve as a valuable resource for investigating complex apple traits at the genomic level. It is not only suitable for genome editing and gene cloning, but also for RNA-seq and whole-genome re-sequencing studies.

Journal ArticleDOI
TL;DR: Research on metabolomics research on rice is discussed in order to elucidate the overall regulation of the metabolism as it is related to the growth and mechanisms of adaptation to genetic modifications and environmental stresses such as fungal infections, submergence, and oxidative stress.
Abstract: Metabolomics is widely employed to monitor the cellular metabolic state and assess the quality of plant-derived foodstuffs because it can be used to manage datasets that include a wide range of metabolites in their analytical samples. In this review, we discuss metabolomics research on rice in order to elucidate the overall regulation of the metabolism as it is related to the growth and mechanisms of adaptation to genetic modifications and environmental stresses such as fungal infections, submergence, and oxidative stress. We also focus on phytochemical genomics studies based on a combination of metabolomics and quantitative trait locus (QTL) mapping techniques. In addition to starch, rice produces many metabolites that also serve as nutrients for human consumers. The outcomes of recent phytochemical genomics studies of diverse natural rice resources suggest there is potential for using further effective breeding strategies to improve the quality of ingredients in rice grains.

Journal ArticleDOI
TL;DR: In this article, a male leopard gecko, Eublepharis macularius, was reported to have a 2.02 Gb genome, which was close to the 2.23 Gb estimated by k-mer analysis.
Abstract: Geckos are among the most species-rich reptile groups and the sister clade to all other lizards and snakes. Geckos possess a suite of distinctive characteristics, including adhesive digits, nocturnal activity, hard, calcareous eggshells, and a lack of eyelids. However, one gecko clade, the Eublepharidae, appears to be the exception to most of these ‘rules’ and lacks adhesive toe pads, has eyelids, and lays eggs with soft, leathery eggshells. These differences make eublepharids an important component of any investigation into the underlying genomic innovations contributing to the distinctive phenotypes in ‘typical’ geckos. We report high-depth genome sequencing, assembly, and annotation for a male leopard gecko, Eublepharis macularius (Eublepharidae). Illumina sequence data were generated from seven insert libraries (ranging from 170 to 20 kb), representing a raw sequencing depth of 136X from 303 Gb of data, reduced to 84X and 187 Gb after filtering. The assembled genome of 2.02 Gb was close to the 2.23 Gb estimated by k-mer analysis. Scaffold and contig N50 sizes of 664 and 20 kb, respectively, were comparable to the previously published Gekko japonicus genome. Repetitive elements accounted for 42 % of the genome. Gene annotation yielded 24,755 protein-coding genes, of which 93 % were functionally annotated. CEGMA and BUSCO assessment showed that our assembly captured 91 % (225 of 248) of the core eukaryotic genes, and 76 % of vertebrate universal single-copy orthologs. Assembly of the leopard gecko genome provides a valuable resource for future comparative genomic studies of geckos and other squamate reptiles.

Journal ArticleDOI
R. Cameron Craddock1, R. Cameron Craddock2, Pierre Bellec3, Daniel S. Margules4, B. Nolan Nichols5, B. Nolan Nichols6, Jörg P. Pfannmöller7, AmanPreet Badhwar3, David N. Kennedy8, Jean-Baptiste Poline9, Roberto Toro10, Ben Cipollini11, Ariel Rokem12, Daniel Clark1, Krzysztof J. Gorgolewski5, Daniel J. Clark1, Samir Das13, Cécile Madjar14, Ayan Sengupta15, Zia Mohades13, Sebastien Dery13, Weiran Deng16, Eric Earl17, Damion V. Demeter17, Kate Mills17, Glad Mihai18, Luka Ruzic19, Nicholas A. Ketz20, Andrew E. Reineberg21, Marianne C. Reddan20, Anne-Lise Goddings21, Javier Gonzalez-Castillo22, Caroline Froehlich2, Gil Dekel23, Daniel S. Margulies4, Ben D. Fulcher24, Tristan Glatard13, Tristan Glatard25, Reza Adalat13, Natacha Beck13, Rémi Bernard13, Najmeh Khalili-Mahani13, Pierre Rioux13, M. Rousseau13, Alan C. Evans13, Yaroslav O. Halchenko26, Matteo Visconti di Oleggio Castello26, Raúl Hernández-Pérez, Edgar A. Morales, Laura V. Cuaya, Kaori L. Ito27, Sook-Lei Liew27, Hans J. Johnson28, Erik Kan29, Erik Kan27, Julia Anglin, Michael R. Borich30, Neda Jahanshad27, Paul M. Thompson27, Marcel Falkiewicz4, Julia M. Huntenburg4, David H. O’Connor2, David H. O’Connor1, Michael P. Milham1, Michael P. Milham2, Ramon Fraga Pereira31, Anibal Sólon Heinsfeld31, Alexandre Rosa Franco31, Augusto Buchweitz31, Felipe Meneguzzi31, Rickson C. Mesquita32, Luis C. T. Herrera32, Daniela Dentico33, Vanessa Sochat5, Julio E. Villalon-Reina27, Eleftherios Garyfallidis34 
TL;DR: The 2015 Brainhack Proceedings focused onributed collaboration, big data meta-analyses for clinical neuroimaging through ENIGMA wrapper scripts, and self-organization and brain function.
Abstract: I1 Introduction to the 2015 Brainhack Proceedings R. Cameron Craddock, Pierre Bellec, Daniel S. Margules, B. Nolan Nichols, Jorg P. Pfannmoller A1 Distributed collaboration: the case for the enhancement of Brainspell’s interface AmanPreet Badhwar, David Kennedy, Jean-Baptiste Poline, Roberto Toro A2 Advancing open science through NiData Ben Cipollini, Ariel Rokem A3 Integrating the Brain Imaging Data Structure (BIDS) standard into C-PAC Daniel Clark, Krzysztof J. Gorgolewski, R. Cameron Craddock A4 Optimized implementations of voxel-wise degree centrality and local functional connectivity density mapping in AFNI R. Cameron Craddock, Daniel J. Clark A5 LORIS: DICOM anonymizer Samir Das, Cecile Madjar, Ayan Sengupta, Zia Mohades A6 Automatic extraction of academic collaborations in neuroimaging Sebastien Dery A7 NiftyView: a zero-footprint web application for viewing DICOM and NIfTI files Weiran Deng A8 Human Connectome Project Minimal Preprocessing Pipelines to Nipype Eric Earl, Damion V. Demeter, Kate Mills, Glad Mihai, Luka Ruzic, Nick Ketz, Andrew Reineberg, Marianne C. Reddan, Anne-Lise Goddings, Javier Gonzalez-Castillo, Krzysztof J. Gorgolewski A9 Generating music with resting-state fMRI data Caroline Froehlich, Gil Dekel, Daniel S. Margulies, R. Cameron Craddock A10 Highly comparable time-series analysis in Nitime Ben D. Fulcher A11 Nipype interfaces in CBRAIN Tristan Glatard, Samir Das, Reza Adalat, Natacha Beck, Remi Bernard, Najmeh Khalili-Mahani, Pierre Rioux, Marc-Etienne Rousseau, Alan C. Evans A12 DueCredit: automated collection of citations for software, methods, and data Yaroslav O. Halchenko, Matteo Visconti di Oleggio Castello A13 Open source low-cost device to register dog’s heart rate and tail movement Raul Hernandez-Perez, Edgar A. Morales, Laura V. Cuaya A14 Calculating the Laterality Index Using FSL for Stroke Neuroimaging Data Kaori L. Ito, Sook-Lei Liew A15 Wrapping FreeSurfer 6 for use in high-performance computing environments Hans J. Johnson A16 Facilitating big data meta-analyses for clinical neuroimaging through ENIGMA wrapper scripts Erik Kan, Julia Anglin, Michael Borich, Neda Jahanshad, Paul Thompson, Sook-Lei Liew A17 A cortical surface-based geodesic distance package for Python Daniel S Margulies, Marcel Falkiewicz, Julia M Huntenburg A18 Sharing data in the cloud David O’Connor, Daniel J. Clark, Michael P. Milham, R. Cameron Craddock A19 Detecting task-based fMRI compliance using plan abandonment techniques Ramon Fraga Pereira, Anibal Solon Heinsfeld, Alexandre Rosa Franco, Augusto Buchweitz, Felipe Meneguzzi A20 Self-organization and brain function Jorg P. Pfannmoller, Rickson Mesquita, Luis C.T. Herrera, Daniela Dentico A21 The Neuroimaging Data Model (NIDM) API Vanessa Sochat, B Nolan Nichols A22 NeuroView: a customizable browser-base utility Anibal Solon Heinsfeld, Alexandre Rosa Franco, Augusto Buchweitz, Felipe Meneguzzi A23 DIPY: Brain tissue classification Julio E. Villalon-Reina, Eleftherios Garyfallidis

Journal ArticleDOI
TL;DR: This work presents AGOUTI (Annotated Genome Optimization Using Transcriptome Information), a tool that uses RNA sequencing data to simultaneously combine contigs into scaffolds and fragmented gene models into single models and shows that it is highly accurate and achieves greater accuracy and contiguity when compared with other existing methods.
Abstract: Genomes sequenced using short-read, next-generation sequencing technologies can have many errors and may be fragmented into thousands of small contigs. These incomplete and fragmented assemblies lead to errors in gene identification, such that single genes spread across multiple contigs are annotated as separate gene models. Such biases can confound inferences about the number and identity of genes within species, as well as gene gain and loss between species. We present AGOUTI (Annotated Genome Optimization Using Transcriptome Information), a tool that uses RNA sequencing data to simultaneously combine contigs into scaffolds and fragmented gene models into single models. We show that AGOUTI improves both the contiguity of genome assemblies and the accuracy of gene annotation, providing updated versions of each as output. Running AGOUTI on both simulated and real datasets, we show that it is highly accurate and that it achieves greater accuracy and contiguity when compared with other existing methods. AGOUTI is a powerful and effective scaffolder and, unlike most scaffolders, is expected to be more effective in larger genomes because of the commensurate increase in intron length. AGOUTI is able to scaffold thousands of contigs while simultaneously reducing the number of gene models by hundreds or thousands. The software is available free of charge under the MIT license.

Journal ArticleDOI
TL;DR: RES-Scanner, as a software package written in the Perl programming language, provides a comprehensive solution that addresses read mapping, homozygous genotype calling, de novo RNA-editing site identification and annotation for any species with matching RNA-seq and DNA-seq data.
Abstract: High-throughput sequencing (HTS) provides a powerful solution for the genome-wide identification of RNA-editing sites. However, it remains a great challenge to distinguish RNA-editing sites from genetic variants and technical artifacts caused by sequencing or read-mapping errors. Here we present RES-Scanner, a flexible and efficient software package that detects and annotates RNA-editing sites using matching RNA-seq and DNA-seq data from the same individuals or samples. RES-Scanner allows the use of both raw HTS reads and pre-aligned reads in BAM format as inputs. When inputs are HTS reads, RES-Scanner can invoke the BWA mapper to align reads to the reference genome automatically. To rigorously identify potential false positives resulting from genetic variants, we have equipped RES-Scanner with sophisticated statistical models to infer the reliability of homozygous genotypes called from DNA-seq data. These models are applicable to samples from either single individuals or a pool of multiple individuals if the ploidy information is known. In addition, RES-Scanner implements statistical tests to distinguish genuine RNA-editing sites from sequencing errors, and provides a series of sophisticated filtering options to remove false positives resulting from mapping errors. Finally, RES-Scanner can improve the completeness and accuracy of editing site identification when the data of multiple samples are available. RES-Scanner, as a software package written in the Perl programming language, provides a comprehensive solution that addresses read mapping, homozygous genotype calling, de novo RNA-editing site identification and annotation for any species with matching RNA-seq and DNA-seq data. The package is freely available.

Journal ArticleDOI
TL;DR: Variation in conopeptides from different specimens of C. betulinus was observed, which suggested the presence of intraspecific variability in toxin production at the genetic level, and provide a potentially fertile resource for the development of new pharmaceuticals, and a pathway for the discovery of new conotoxins.
Abstract: The venom of predatory marine cone snails mainly contains a diverse array of unique bioactive peptides commonly referred to as conopeptides or conotoxins. These peptides have proven to be valuable pharmacological probes and potential drugs because of their high specificity and affinity to important ion channels, receptors and transporters of the nervous system. Most previous studies have focused specifically on the conopeptides from piscivorous and molluscivorous cone snails, but little attention has been devoted to the dominant vermivorous species. The vermivorous Chinese tubular cone snail, Conus betulinus, is the dominant Conus species inhabiting the South China Sea. The transcriptomes of venom ducts and venom bulbs from a variety of specimens of this species were sequenced using both next-generation sequencing and traditional Sanger sequencing technologies, resulting in the identification of a total of 215 distinct conopeptides. Among these, 183 were novel conopeptides, including nine new superfamilies. It appeared that most of the identified conopeptides were synthesized in the venom duct, while a handful of conopeptides were identified only in the venom bulb and at very low levels. We identified 215 unique putative conopeptide transcripts from the combination of five transcriptomes and one EST sequencing dataset. Variation in conopeptides from different specimens of C. betulinus was observed, which suggested the presence of intraspecific variability in toxin production at the genetic level. These novel conopeptides provide a potentially fertile resource for the development of new pharmaceuticals, and a pathway for the discovery of new conotoxins.

Journal ArticleDOI
TL;DR: An international project known as the “Transcriptomes of 1,000 Fishes” (Fish-T1K) project has been established to generate RNA-seq transcriptome sequences for 1, thousand diverse species of ray-finned fishes.
Abstract: Ray-finned fishes (Actinopterygii) represent more than 50 % of extant vertebrates and are of great evolutionary, ecologic and economic significance, but they are relatively underrepresented in ‘omics studies. Increased availability of transcriptome data for these species will allow researchers to better understand changes in gene expression, and to carry out functional analyses. An international project known as the “Transcriptomes of 1,000 Fishes” (Fish-T1K) project has been established to generate RNA-seq transcriptome sequences for 1,000 diverse species of ray-finned fishes. The first phase of this project has produced transcriptomes from more than 180 ray-finned fishes, representing 142 species and covering 51 orders and 109 families. Here we provide an overview of the goals of this project and the work done so far.

Journal ArticleDOI
TL;DR: The utility of skull-stripped anatomical images from the Neurofeedback sample is illustrated as a reference for comparing various automatic methods and the performance of the newly created library on independent data is evaluated.
Abstract: Skull-stripping is the procedure of removing non-brain tissue from anatomical MRI data. This procedure can be useful for calculating brain volume and for improving the quality of other image processing steps. Developing new skull-stripping algorithms and evaluating their performance requires gold standard data from a variety of different scanners and acquisition methods. We complement existing repositories with manually corrected brain masks for 125 T1-weighted anatomical scans from the Nathan Kline Institute Enhanced Rockland Sample Neurofeedback Study. Skull-stripped images were obtained using a semi-automated procedure that involved skull-stripping the data using the brain extraction based on nonlocal segmentation technique (BEaST) software, and manually correcting the worst results. Corrected brain masks were added into the BEaST library and the procedure was repeated until acceptable brain masks were available for all images. In total, 85 of the skull-stripped images were hand-edited and 40 were deemed to not need editing. The results are brain masks for the 125 images along with a BEaST library for automatically skull-stripping other data. Skull-stripped anatomical images from the Neurofeedback sample are available for download from the Preprocessed Connectomes Project. The resulting brain masks can be used by researchers to improve preprocessing of the Neurofeedback data, as training and testing data for developing new skull-stripping algorithms, and for evaluating the impact on other aspects of MRI preprocessing. We have illustrated the utility of these data as a reference for comparing various automatic methods and evaluated the performance of the newly created library on independent data.

Journal ArticleDOI
TL;DR: Brainhack as mentioned in this paper is an open neuroscience community that offers a novel workshop format with participant-generated content that caters to the rapidly growing open neuroscience research community, including components from hackathons and unconferences, as well as parallel educational sessions.
Abstract: Brainhack events offer a novel workshop format with participant-generated content that caters to the rapidly growing open neuroscience community. Including components from hackathons and unconferences, as well as parallel educational sessions, Brainhack fosters novel collaborations around the interests of its attendees. Here we provide an overview of its structure, past events, and example projects. Additionally, we outline current innovations such as regional events and post-conference publications. Through introducing Brainhack to the wider neuroscience community, we hope to provide a unique conference format that promotes the features of collaborative, open science.

Journal ArticleDOI
TL;DR: The genomes of P. nicotianae races 0 and 1 are assembled and annotated to provide not only high quality reference genomes of the disease, but also insights into the infection mechanisms of the soil-borne pathogen and its co-evolution with the host plant.
Abstract: Black shank is a severe plant disease caused by the soil-borne pathogen Phytophthora nicotianae. Two physiological races of P. nicotianae, races 0 and 1, are predominantly observed in cultivated tobacco fields around the world. Race 0 has been reported to be more aggressive, having a shorter incubation period, and causing worse root rot symptoms, while race 1 causes more severe necrosis. The molecular mechanisms underlying the difference in virulence between race 0 and 1 remain elusive. We assembled and annotated the genomes of P. nicotianae races 0 and 1, which were obtained by a combination of PacBio single-molecular real-time sequencing and second-generation sequencing (both HiSeq and MiSeq platforms). Gene family analysis revealed a highly expanded ATP-binding cassette transporter gene family in P. nicotianae. Specifically, more RxLR effector genes were found in the genome of race 0 than in that of race 1. In addition, RxLR effector genes were found to be mainly distributed in gene-sparse, repeat-rich regions of the P. nicotianae genome. These results provide not only high quality reference genomes of P. nicotianae, but also insights into the infection mechanisms of P. nicotianae and its co-evolution with the host plant. They also reveal insights into the difference in virulence between the two physiological races.

Journal ArticleDOI
TL;DR: A high-quality genome assembly for a channel catfish from a breeding stock inbred in China for more than three generations, which was originally imported to China from North America is reported, which is comparable to a recent report of the “Coco”Channel catfish.
Abstract: The channel catfish (Ictalurus punctatus), a species native to North America, is one of the most important commercial freshwater fish in the world, especially in the United States’ aquaculture industry. Since its introduction into China in 1984, both cultivation area and yield of this species have been dramatically increased such that China is now the leading producer of channel catfish. To aid genomic research in this species, data sets such as genetic linkage groups, long-insert libraries, physical maps, bacterial artificial clones (BAC) end sequences (BES), transcriptome assemblies, and reference genome sequences have been generated. Here, using diverse assembly methods, we provide a comparable high-quality genome assembly for a channel catfish from a breeding stock inbred in China for more than three generations, which was originally imported to China from North America. Approximately 201.6 gigabases (Gb) of genome reads were sequenced by the Illumina HiSeq 2000 platform. Subsequently, we generated high quality, cost-effective and easily assembled sequences of the channel catfish genome with a scaffold N50 of 7.2 Mb and 95.6 % completeness. We also predicted that the channel catfish genome contains 21,556 protein-coding genes and 275.3 Mb (megabase pairs) of repetitive sequences. We report a high-quality genome assembly of the channel catfish, which is comparable to a recent report of the “Coco” channel catfish. These generated genome data could be used as an initial platform for molecular breeding to obtain novel catfish varieties using genomic approaches.

Journal ArticleDOI
TL;DR: The only echinoderm species with a genome sequence available to date is Strongylocentrotus pupuratus (Echinoidea) as discussed by the authors, which is known for their pentaradial symmetry as adults, unique water vascular system, mutable collagenous tissues, and endoskeletons of high magnesium calcite.
Abstract: There are five major extant groups of Echinodermata: Crinoidea (feather stars and sea lillies), Ophiuroidea (brittle stars and basket stars), Asteroidea (sea stars), Echinoidea (sea urchins, sea biscuits, and sand dollars), and Holothuroidea (sea cucumbers) These animals are known for their pentaradial symmetry as adults, unique water vascular system, mutable collagenous tissues, and endoskeletons of high magnesium calcite To our knowledge, the only echinoderm species with a genome sequence available to date is Strongylocentrotus pupuratus (Echinoidea) The availability of additional echinoderm genome sequences is crucial for understanding the biology of these animals Here we present assembled draft genomes of the brittle star Ophionereis fasciata, the sea star Patiriella regularis, and the sea cucumber Australostichopus mollis from Illumina sequence data with coverages of 125x, 225x, and 214x, respectively These data provide a resource for mining gene superfamilies, identifying non-coding RNAs, confirming gene losses, and designing experimental constructs They will be important comparative resources for future genomic studies in echinoderms

Journal ArticleDOI
TL;DR: Results show that circadian controls affect diurnal CO2 and H2O flux patterns in entire canopies in field-like conditions, and its consideration significantly improves model performance.
Abstract: Molecular clocks drive oscillations in leaf photosynthesis, stomatal conductance, and other cell and leaf-level processes over ~24 h under controlled laboratory conditions. The influence of such circadian regulation over whole-canopy fluxes remains uncertain; diurnal CO2 and H2O vapor flux dynamics in the field are currently interpreted as resulting almost exclusively from direct physiological responses to variations in light, temperature and other environmental factors. We tested whether circadian regulation would affect plant and canopy gas exchange at the Montpellier European Ecotron. Canopy and leaf-level fluxes were constantly monitored under field-like environmental conditions, and under constant environmental conditions (no variation in temperature, radiation, or other environmental cues). We show direct experimental evidence at canopy scales of the circadian regulation of daytime gas exchange: 20–79 % of the daily variation range in CO2 and H2O fluxes occurred under circadian entrainment in canopies of an annual herb (bean) and of a perennial shrub (cotton). We also observed that considering circadian regulation improved performance by 8–17 % in commonly used stomatal conductance models. Our results show that circadian controls affect diurnal CO2 and H2O flux patterns in entire canopies in field-like conditions, and its consideration significantly improves model performance. Circadian controls act as a ‘memory’ of the past conditions experienced by the plant, which synchronizes metabolism across entire plant canopies.

Journal ArticleDOI
TL;DR: It is demonstrated that the extreme climate anomalies observed in most parts of South America during the current epidemic are not caused exclusively by El Niño or climate change, but by a combination of climate signals acting at multiple timescales.
Abstract: The emergence of Zika virus (ZIKV) in Latin America and the Caribbean in 2014–2016 occurred during a period of severe drought and unusually high temperatures, conditions that have been associated with the 2015–2016 El Nino event, and/or climate change; however, no quantitative assessment has been made to date. Analysis of related flaviviruses transmitted by the same vectors suggests that ZIKV dynamics are sensitive to climate seasonality and longer-term variability and trends. A better understanding of the climate conditions conducive to the 2014–2016 epidemic may permit the development of climate-informed short and long-term strategies for ZIKV prevention and control. Using a novel timescale-decomposition methodology, we demonstrate that the extreme climate anomalies observed in most parts of South America during the current epidemic are not caused exclusively by El Nino or climate change, but by a combination of climate signals acting at multiple timescales. In Brazil, the dry conditions present in 2013–2015 are primarily explained by year-to-year variability superimposed on decadal variability, but with little contribution of long-term trends. In contrast, the warm temperatures of 2014–2015 resulted from the compound effect of climate change, decadal and year-to-year climate variability. ZIKV response strategies made in Brazil during the drought concurrent with the 2015-2016 El Nino event, may require revision in light of the likely return of rainfall associated with the borderline La Nina event expected in 2016–2017. Temperatures are likely to remain warm given the importance of long term and decadal scale climate signals.

Journal ArticleDOI
TL;DR: Extensive genomic resources for the scabies mite are developed, including reference genomes and a preliminary annotation of this reference comprising 13,226 putative coding sequences based on sequence similarity to known proteins.
Abstract: The scabies mite, Sarcoptes scabiei, is a parasitic arachnid and cause of the infectious skin disease scabies in humans and mange in other animal species. Scabies infections are a major health problem, particularly in remote Indigenous communities in Australia, where secondary group A streptococcal and Staphylococcus aureus infections of scabies sores are thought to drive the high rate of rheumatic heart disease and chronic kidney disease. We sequenced the genome of two samples of Sarcoptes scabiei var. hominis obtained from unrelated patients with crusted scabies located in different parts of northern Australia using the Illumina HiSeq. We also sequenced samples of Sarcoptes scabiei var. suis from a pig model. Because of the small size of the scabies mite, these data are derived from pools of thousands of mites and are metagenomic, including host and microbiome DNA. We performed cleaning and de novo assembly and present Sarcoptes scabiei var. hominis and var. suis draft reference genomes. We have constructed a preliminary annotation of this reference comprising 13,226 putative coding sequences based on sequence similarity to known proteins. We have developed extensive genomic resources for the scabies mite, including reference genomes and a preliminary annotation.