scispace - formally typeset
Search or ask a question

Showing papers in "GigaScience in 2014"


Journal ArticleDOI
TL;DR: How to access the data used in a phylogenomics analysis of the first 85 species, and how to visualize the gene and species trees of the 1KP project.
Abstract: The 1,000 plants (1KP) project is an international multi-disciplinary consortium that has generated transcriptome data from over 1,000 plant species, with exemplars for all of the major lineages across the Viridiplantae (green plants) clade. Here, we describe how to access the data used in a phylogenomics analysis of the first 85 species, and how to visualize our gene and species trees. Users can develop computational pipelines to analyse these data, in conjunction with data of their own that they can upload. Computationally estimated protein-protein interactions and biochemical pathways can be visualized at another site. Finally, we comment on our future plans and how they fit within this scalable system for the dissemination, visualization, and analysis of large multi-species data sets.

521 citations


Journal ArticleDOI
TL;DR: An international resequencing effort of 3,000 rice genomes serves as a foundation for large-scale discovery of novel alleles for important rice phenotypes using various bioinformatics and/or genetic approaches and to understand the genomic diversity within O. sativa at a higher level of detail.
Abstract: Background Rice, Oryza sativa L., is the staple food for half the world’s population. By 2030, the production of rice must increase by at least 25% in order to keep up with global population growth and demand. Accelerated genetic gains in rice improvement are needed to mitigate the effects of climate change and loss of arable land, as well as to ensure a stable global food supply.

385 citations


Journal ArticleDOI
TL;DR: The immediate challenge now is to comprehensively and systematically mine this dataset to link genotypic variation to functional variation with the ultimate goal of creating new and sustainable rice varieties that can support a future world population that will approach 9.6 billion by 2050.
Abstract: Rice is the world’s most important staple grown by millions of small-holder farmers. Sustaining rice production relies on the intelligent use of rice diversity. The 3,000 Rice Genomes Project is a giga-dataset of publically available genome sequences (averaging 14× depth of coverage) derived from 3,000 accessions of rice with global representation of genetic and functional diversity. The seed of these accessions is available from the International Rice Genebank Collection. Together, they are an unprecedented resource for advancing rice science and breeding technology. Our immediate challenge now is to comprehensively and systematically mine this dataset to link genotypic variation to functional variation with the ultimate goal of creating new and sustainable rice varieties that can support a future world population that will approach 9.6 billion by 2050.

297 citations


Journal ArticleDOI
TL;DR: A read dataset from whole-genome shotgun sequencing of the model organism Escherichia coli K-12 substr is presented to demonstrate the nature of data produced by the MinION™ platform and to encourage the development of customised methods for alignment, consensus and variant calling, de novo assembly and scaffolding.
Abstract: The MinION™ is a new, portable single-molecule sequencer developed by Oxford Nanopore Technologies. It measures four inches in length and is powered from the USB 3.0 port of a laptop computer. The MinION™ measures the change in current resulting from DNA strands interacting with a charged protein nanopore. These measurements can then be used to deduce the underlying nucleotide sequence. We present a read dataset from whole-genome shotgun sequencing of the model organism Escherichia coli K-12 substr. MG1655 generated on a MinION™ device during the early-access MinION™ Access Program (MAP). Sequencing runs of the MinION™ are presented, one generated using R7 chemistry (released in July 2014) and one using R7.3 (released in September 2014). Base-called sequence data are provided to demonstrate the nature of data produced by the MinION™ platform and to encourage the development of customised methods for alignment, consensus and variant calling, de novo assembly and scaffolding. FAST5 files containing event data within the HDF5 container format are provided to assist with the development of improved base-calling methods.

224 citations


Journal ArticleDOI
TL;DR: This study highlights genome mapping technology as a comprehensive and cost-effective method for detecting structural variation and studying complex regions in the human genome, as well as deciphering viral integration into the host genome.
Abstract: Structural variants (SVs) are less common than single nucleotide polymorphisms and indels in the population, but collectively account for a significant fraction of genetic polymorphism and diseases. Base pair differences arising from SVs are on a much higher order (>100 fold) than point mutations; however, none of the current detection methods are comprehensive, and currently available methodologies are incapable of providing sufficient resolution and unambiguous information across complex regions in the human genome. To address these challenges, we applied a high-throughput, cost-effective genome mapping technology to comprehensively discover genome-wide SVs and characterize complex regions of the YH genome using long single molecules (>150 kb) in a global fashion. Utilizing nanochannel-based genome mapping technology, we obtained 708 insertions/deletions and 17 inversions larger than 1 kb. Excluding the 59 SVs (54 insertions/deletions, 5 inversions) that overlap with N-base gaps in the reference assembly hg19, 666 non-gap SVs remained, and 396 of them (60%) were verified by paired-end data from whole-genome sequencing-based re-sequencing or de novo assembly sequence from fosmid data. Of the remaining 270 SVs, 260 are insertions and 213 overlap known SVs in the Database of Genomic Variants. Overall, 609 out of 666 (90%) variants were supported by experimental orthogonal methods or historical evidence in public databases. At the same time, genome mapping also provides valuable information for complex regions with haplotypes in a straightforward fashion. In addition, with long single-molecule labeling patterns, exogenous viral sequences were mapped on a whole-genome scale, and sample heterogeneity was analyzed at a new level. Our study highlights genome mapping technology as a comprehensive and cost-effective method for detecting structural variation and studying complex regions in the human genome, as well as deciphering viral integration into the host genome.

164 citations


Journal ArticleDOI
TL;DR: In this article, the authors have implemented a complete genomics toolkit and annotation in a cloud-based Galaxy, called CGtag (Complete Genomics Toolkit and Annotation in a Cloudbased Galaxy), for the selection of candidate mutations from Complete Genomics data.
Abstract: Complete Genomics provides an open-source suite of command-line tools for the analysis of their CG-formatted mapped sequencing files. Determination of; for example, the functional impact of detected variants, requires annotation with various databases that often require command-line and/or programming experience; thus, limiting their use to the average research scientist. We have therefore implemented this CG toolkit, together with a number of annotation, visualisation and file manipulation tools in Galaxy called CGtag (Complete Genomics Toolkit and Annotation in a Cloud-based Galaxy). In order to provide research scientists with web-based, simple and accurate analytical and visualisation applications for the selection of candidate mutations from Complete Genomics data, we have implemented the open-source Complete Genomics tool set, CGATools, in Galaxy. In addition we implemented some of the most popular command-line annotation and visualisation tools to allow research scientists to select candidate pathological mutations (SNV, and indels). Furthermore, we have developed a cloud-based public Galaxy instance to host the CGtag toolkit and other associated modules. CGtag provides a user-friendly interface to all research scientists wishing to select candidate variants from CG or other next-generation sequencing platforms’ data. By using a cloud-based infrastructure, we can also assure sufficient and on-demand computation and storage resources to handle the analysis tasks. The tools are freely available for use from an NBIC/CTMM-TraIT (The Netherlands Bioinformatics Center/Center for Translational Molecular Medicine) cloud-based Galaxy instance, or can be installed to a local (production) Galaxy via the NBIC Galaxy tool shed.

132 citations


Journal ArticleDOI
TL;DR: The Avian Phylogenomics Project is the biggest vertebrate comparative genomics project to date and the genomic data presented here is expected to accelerate further analyses in many fields, including phylogenetics, comparativegenomics, evolution, neurobiology, development biology, and other related areas.
Abstract: Background: The evolutionary relationships of modern birds are among the most challenging to understand in systematic biology and have been debated for centuries. To address this challenge, we assembled or collected the genomes of 48 avian species spanning most orders of birds, including all Neognathae and two of the five Palaeognathae orders, and used the genomes to construct a genome-scale avian phylogenetic tree and perform comparative genomics analyses (Jarvis et al. in press; Zhang et al. in press). Here we release assemblies and datasets associated with the comparative genome analyses, which include 38 newly sequenced avian genomes plus previously released or simultaneously released genomes of Chicken, Zebra finch, Turkey, Pigeon, Peregrine falcon, Duck, Budgerigar, Adelie penguin, Emperor penguin and the Medium Ground Finch. We hope that this resource will serve future efforts in phylogenomics and comparative genomics. Findings: The 38 bird genomes were sequenced using the Illumina HiSeq 2000 platform and assembled using a whole genome shotgun strategy. The 48 genomes were categorized into two groups according to the N50 scaffold size of the assemblies: a high depth group comprising 23 species sequenced at high coverage (>50X) with multiple insert size libraries resulting in N50 scaffold sizes greater than 1 Mb (except the White-throated Tinamou and Bald Eagle); and a low depth group comprising 25 species sequenced at a low coverage (~30X) with two insert size libraries resulting in an average N50 scaffold size of about 50 kb. Repetitive elements comprised 4%-22% of the bird genomes. The assembled scaffolds allowed the homology-based annotation of 13,000 ~ 17000 protein coding genes in each avian genome relative to chicken, zebra finch and human, as well as comparative and sequence conservation analyses. Conclusions: Here we release full genome assemblies of 38 newly sequenced avian species, link genome assembly downloads for the 7 of the remaining 10 species, and provide a guideline of genomic data that has been generated and used in our Avian Phylogenomics Project. To the best of our knowledge, the Avian Phylogenomics Project is the biggest vertebrate comparative genomics project to date. The genomic data presented here is expected to accelerate further analyses in many fields, including phylogenetics, comparative genomics, evolution, neurobiology, development biology, and other related areas.

129 citations


Journal Article
TL;DR: This work has implemented the open-source Complete Genomics tool set, CGATools, in Galaxy, and implemented some of the most popular command-line annotation and visualisation tools to allow research scientists to select candidate pathological mutations.
Abstract: Background: Complete Genomics provides an open-source suite of command-line tools for the analysis of their CG-formatted mapped sequencing files. Determination of; for example, the functional impact of detected variants, requires annotation with various databases that often require command-line and/or programming experience; thus, limiting their use to the average research scientist. We have therefore implemented this CG toolkit, together with a number of annotation, visualisation and file manipulation tools in Galaxy called CGtag (Complete Genomics Toolkit and Annotation in a Cloud-based Galaxy). Findings: In order to provide research scientists with web-based, simple and accurate analytical and visualisation applications for the selection of candidate mutations from Complete Genomics data, we have implemented the open-source Complete Genomics tool set, CGATools, in Galaxy. In addition we implemented some of the most popular command-line annotation and visualisation tools to allow research scientists to select candidate pathological mutations (SNV, and indels). Furthermore, we have developed a cloud-based public Galaxy instance to host the CGtag toolkit and other associated modules. Conclusions: CGtag provides a user-friendly interface to all research scientists wishing to select candidate variants from CG or other next-generation sequencing platforms' data. By using a cloud-based infrastructure, we can also assure sufficient and on-demand computation and storage resources to handle the analysis tasks. The tools are freely available for use from an NBIC/CTMM-TraIT (The Netherlands Bioinformatics Center/Center for Translational Molecular Medicine) cloud-based Galaxy instance, or can be installed to a local (production) Galaxy via the NBIC Galaxy tool shed.

120 citations


Journal ArticleDOI
TL;DR: P predictive models have been used on neuroimaging data to ask new questions, i.e., to uncover new aspects of cognitive organization, and a statistical learning perspective on these progresses and on the remaining gaping holes is given.
Abstract: Functional brain images are rich and noisy data that can capture indirect signatures of neural activity underlying cognition in a given experimental setting. Can data mining leverage them to build models of cognition? Only if it is applied to well-posed questions, crafted to reveal cognitive mechanisms. Here we review how predictive models have been used on neuroimaging data to ask new questions, i.e., to uncover new aspects of cognitive organization. We also give a statistical learning perspective on these progresses and on the remaining gaping holes.

108 citations


Journal ArticleDOI
TL;DR: This study provides the first evaluation of the clinical outcomes of NGS-based preimplantation genetic diagnosis/screening compared with single nucleotide polymorphism (SNP) array-based PGD/PGS and shows the reliability of this method in a clinical and array- based laboratory setting.
Abstract: Background: Next generation sequencing (NGS) is now being used for detecting chromosomal abnormalities in blastocyst trophectoderm (TE) cells from in vitro fertilized embryos. However, few data are available regarding the clinical outcome, which provides vital reference for further application of the methodology. Here, we present a clinical evaluation of NGS-based preimplantation genetic diagnosis/screening (PGD/PGS) compared with single nucleotide polymorphism (SNP) array-based PGD/PGS as a control. Results: A total of 395 couples participated. They were carriers of either translocation or inversion mutations, or were patients with recurrent miscarriage and/or advanced maternal age. A total of 1,512 blastocysts were biopsied on D5 after fertilization, with 1,058 blastocysts set aside for SNP array testing and 454 blastocysts for NGS testing. In the NGS cycles group, the implantation, clinical pregnancy and miscarriage rates were 52.6% (60/114), 61.3% (49/80) and 14.3% (7/49), respectively. In the SNP array cycles group, the implantation, clinical pregnancy and miscarriage rates were 47.6% (139/292), 56.7% (115/203) and 14.8% (17/115), respectively. The outcome measures of both the NGS and SNP array cycles were the same with insignificant differences. There were 150 blastocysts that underwent both NGS and SNP array analysis, of which seven blastocysts were found with inconsistent signals. All other signals obtained from NGS analysis were confirmed to be accurate by validation with qPCR. The relative copy number of mitochondrial DNA (mtDNA) for each blastocyst that underwent NGS testing was evaluated, and a significant difference was found between the copy number of mtDNA for the euploid and the chromosomally abnormal blastocysts. So far, out of 42 ongoing pregnancies, 24 babies were born in NGS cycles; all of these babies are healthy and free of any developmental problems. Conclusions: This study provides the first evaluation of the clinical outcomes of NGS-based pre-implantation genetic diagnosis/screening, and shows the reliability of this method in a clinical and array-based laboratory setting. NGS provides an accurate approach to detect embryonic imbalanced segmental rearrangements, to avoid the potential risks of false signals from SNP array in this study.

102 citations


Journal ArticleDOI
TL;DR: Across several quality metrics, these budgerigar assemblies are comparable to or better than the chicken and zebra finch genome assemblies built from traditional Sanger sequencing reads, and are sufficient to analyze regions that are difficult to sequence and assemble.
Abstract: Background: Parrots belong to a group of behaviorally advanced vertebrates and have an advanced ability of vocal learning relative to other vocal-learning birds. They can imitate human speech, synchronize their body movements to a rhythmic beat, and understand complex concepts of referential meaning to sounds. However, little is known about the genetics of these traits. Elucidating the genetic bases would require whole genome sequencing and a robust assembly of a parrot genome. Findings: We present a genomic resource for the budgerigar, an Australian Parakeet (Melopsittacus undulatus) – the most widely studied parrot species in neuroscience and behavior. We present genomic sequence data that includes over 300× raw read coverage from multiple sequencing technologies and chromosome optical maps from a single male animal. The reads and optical maps were used to create three hybrid assemblies representing some of the largest genomic scaffolds to date for a bird; two of which were annotated based on similarities to reference sets of non-redundant human, zebra finch and chicken proteins, and budgerigar transcriptome sequence assemblies. The sequence reads for this project were in part generated and used for both the Assemblathon 2 competition and the first de novo assembly of a giga-scale vertebrate genome utilizing PacBio single-molecule sequencing. Conclusions: Across several quality metrics, these budgerigar assemblies are comparable to or better than the chicken and zebra finch genome assemblies built from traditional Sanger sequencing reads, and are sufficient to analyze regions that are difficult to sequence and assemble, including those not yet assembled in prior bird genomes, and promoter regions of genes differentially regulated in vocal learning brain regions. This work provides valuable data and material for genome technology development and for investigating the genomics of complex behavioral traits.

Journal ArticleDOI
TL;DR: Comparison with other metazoan genomes shows that the L. polyphemus genome preserves ancestral bilaterian linkage groups, and that a common ancestor of modern horseshoe crabs underwent one or more ancient whole genome duplications 300 million years ago, followed by extensive chromosome fusion.
Abstract: Horseshoe crabs are marine arthropods with a fossil record extending back approximately 450 million years. They exhibit remarkable morphological stability over their long evolutionary history, retaining a number of ancestral arthropod traits, and are often cited as examples of “living fossils.” As arthropods, they belong to the Ecdysozoa, an ancient super-phylum whose sequenced genomes (including insects and nematodes) have thus far shown more divergence from the ancestral pattern of eumetazoan genome organization than cnidarians, deuterostomes and lophotrochozoans. However, much of ecdysozoan diversity remains unrepresented in comparative genomic analyses. Here we apply a new strategy of combined de novo assembly and genetic mapping to examine the chromosome-scale genome organization of the Atlantic horseshoe crab, Limulus polyphemus. We constructed a genetic linkage map of this 2.7 Gbp genome by sequencing the nuclear DNA of 34 wild-collected, full-sibling embryos and their parents at a mean redundancy of 1.1x per sample. The map includes 84,307 sequence markers grouped into 1,876 distinct genetic intervals and 5,775 candidate conserved protein coding genes. Comparison with other metazoan genomes shows that the L. polyphemus genome preserves ancestral bilaterian linkage groups, and that a common ancestor of modern horseshoe crabs underwent one or more ancient whole genome duplications 300 million years ago, followed by extensive chromosome fusion. These results provide a counter-example to the often noted correlation between whole genome duplication and evolutionary radiations. The new, low-cost genetic mapping method for obtaining a chromosome-scale view of non-model organism genomes that we demonstrate here does not require laboratory culture, and is potentially applicable to a broad range of other species.

Journal ArticleDOI
TL;DR: An open dataset is presented – the first of its kind – to the radiation oncology community, which will allow researchers to compare methods for optimizing radiation dose delivery.
Abstract: We provide common datasets (which we call the CORT dataset: common optimization for radiation therapy) that researchers can use when developing and contrasting radiation treatment planning optimization algorithms. The datasets allow researchers to make one-to-one comparisons of algorithms in order to solve various instances of the radiation therapy treatment planning problem in intensity modulated radiation therapy (IMRT), including beam angle optimization, volumetric modulated arc therapy and direct aperture optimization. We provide datasets for a prostate case, a liver case, a head and neck case, and a standard IMRT phantom. We provide the dose-influence matrix from a variety of beam/couch angle pairs for each dataset. The dose-influence matrix is the main entity needed to perform optimizations: it contains the dose to each patient voxel from each pencil beam. In addition, the original Digital Imaging and Communications in Medicine (DICOM) computed tomography (CT) scan, as well as the DICOM structure file, are provided for each case. Here we present an open dataset – the first of its kind – to the radiation oncology community, which will allow researchers to compare methods for optimizing radiation dose delivery.

Journal ArticleDOI
TL;DR: Some of the ways in which GO can change should be carefully considered by all users of GO as they may have a significant impact on the resulting gene product annotations, and therefore the functional description of the gene product, or the interpretation of analyses performed on GO datasets.
Abstract: The Gene Ontology Consortium (GOC) is a major bioinformatics project that provides structured controlled vocabularies to classify gene product function and location. GOC members create annotations to gene products using the Gene Ontology (GO) vocabularies, thus providing an extensive, publicly available resource. The GO and its annotations to gene products are now an integral part of functional analysis, and statistical tests using GO data are becoming routine for researchers to include when publishing functional information. While many helpful articles about the GOC are available, there are certain updates to the ontology and annotation sets that sometimes go unobserved. Here we describe some of the ways in which GO can change that should be carefully considered by all users of GO as they may have a significant impact on the resulting gene product annotations, and therefore the functional description of the gene product, or the interpretation of analyses performed on GO datasets. GO annotations for gene products change for many reasons, and while these changes generally improve the accuracy of the representation of the underlying biology, they do not necessarily imply that previous annotations were incorrect. We additionally describe the quality assurance mechanisms we employ to improve the accuracy of annotations, which necessarily changes the composition of the annotation sets we provide. We use the Universal Protein Resource (UniProt) for illustrative purposes of how the GO Consortium, as a whole, manages these changes.

Journal ArticleDOI
TL;DR: Analysis of effective population sizes reveals that the two penguin species experienced population expansions from ~1 million years ago to ~100 thousand years ago, but responded differently to the climatic cooling of the last glacial period.
Abstract: Background: Penguins are flightless aquatic birds widely distributed in the Southern Hemisphere. The distinctive morphological and physiological features of penguins allow them to live an aquatic life, and some of them have successfully adapted to the hostile environments in Antarctica. To study the phylogenetic and population history of penguins and the molecular basis of their adaptations to Antarctica, we sequenced the genomes of the two Antarctic dwelling penguin species, the Adelie penguin [Pygoscelis adeliae] and emperor penguin [Aptenodytes forsteri]. Results: Phylogenetic dating suggests that early penguins arose ~60 million years ago, coinciding with a period of global warming. Analysis of effective population sizes reveals that the two penguin species experienced population expansions from ~1 million years ago to ~100 thousand years ago, but responded differently to the climatic cooling of the last glacial period. Comparative genomic analyses with other available avian genomes identified molecular changes in genes related to epidermal structure, phototransduction, lipid metabolism, and forelimb morphology. Conclusions: Our sequencing and initial analyses of the first two penguin genomes provide insights into the timing of penguin origin, fluctuations in effective population sizes of the two penguin species over the past 10 million years, and the potential associations between these biological patterns and global climate change. The molecular changes compared with other avian genomes reflect both shared and diverse adaptations of the two penguin species to the Antarctic environment.

Journal ArticleDOI
TL;DR: It is explained why the fern clade is pivotal for understanding genome evolution across land plants, and a rationale for how knowledge of fern genomes will enable progress in research beyond the f Ferns themselves is provided.
Abstract: Ferns are the only major lineage of vascular plants not represented by a sequenced nuclear genome. This lack of genome sequence information significantly impedes our ability to understand and reconstruct genome evolution not only in ferns, but across all land plants. Azolla and Ceratopteris are ideal and complementary candidates to be the first ferns to have their nuclear genomes sequenced. They differ dramatically in genome size, life history, and habit, and thus represent the immense diversity of extant ferns. Together, this pair of genomes will facilitate myriad large-scale comparative analyses across ferns and all land plants. Here we review the unique biological characteristics of ferns and describe a number of outstanding questions in plant biology that will benefit from the addition of ferns to the set of taxa with sequenced nuclear genomes. We explain why the fern clade is pivotal for understanding genome evolution across land plants, and we provide a rationale for how knowledge of fern genomes will enable progress in research beyond the ferns themselves.

Journal ArticleDOI
TL;DR: Practical measures of signal recovery are robust to linkage disequilibrium between a true causal variant and markers residing in the same genomic region and this approach to the GWAS analysis of height is applied.
Abstract: The aim of a genome-wide association study (GWAS) is to isolate DNA markers for variants affecting phenotypes of interest. This is constrained by the fact that the number of markers often far exceeds the number of samples. Compressed sensing (CS) is a body of theory regarding signal recovery when the number of predictor variables (i.e., genotyped markers) exceeds the sample size. Its applicability to GWAS has not been investigated. Using CS theory, we show that all markers with nonzero coefficients can be identified (selected) using an efficient algorithm, provided that they are sufficiently few in number (sparse) relative to sample size. For heritability equal to one (h 2 = 1), there is a sharp phase transition from poor performance to complete selection as the sample size is increased. For heritability below one, complete selection still occurs, but the transition is smoothed. We find for h 2 ∼ 0.5 that a sample size of approximately thirty times the number of markers with nonzero coefficients is sufficient for full selection. This boundary is only weakly dependent on the number of genotyped markers. Practical measures of signal recovery are robust to linkage disequilibrium between a true causal variant and markers residing in the same genomic region. Given a limited sample size, it is possible to discover a phase transition by increasing the penalization; in this case a subset of the support may be recovered. Applying this approach to the GWAS analysis of height, we show that 70-100% of the selected markers are strongly correlated with height-associated markers identified by the GIANT Consortium.

Journal ArticleDOI
TL;DR: An extensive prediction of effective TALEN and CRISPR/Cas9 target sites in the genomes of a broad range of taxonomic species is generated, which may offer an exciting prospect for connecting the gap between DNA sequence and phenotype in the near future.
Abstract: Genetic modification has long provided an approach for “reverse genetics”, analyzing gene function and linking DNA sequence to phenotype. However, traditional genome editing technologies have not kept pace with the soaring progress of the genome sequencing era, as a result of their inefficiency, time-consuming and labor-intensive methods. Recently, invented genome modification technologies, such as ZFN (Zinc Finger Nuclease), TALEN (Transcription Activator-Like Effector Nuclease), and CRISPR/Cas9 nuclease (Clustered Regularly Interspaced Short Palindromic Repeats/Cas9 nuclease) can initiate genome editing easily, precisely and with no limitations by organism. These new tools have also offered intriguing possibilities for conducting functional large-scale experiments. In this review, we begin with a brief introduction of ZFN, TALEN, and CRISPR/Cas9 technologies, then generate an extensive prediction of effective TALEN and CRISPR/Cas9 target sites in the genomes of a broad range of taxonomic species. Based on the evidence, we highlight the potential and practicalities of TALEN and CRISPR/Cas9 editing in non-model organisms, and also compare the technologies and test interesting issues such as the functions of candidate domesticated, as well as candidate genes in life-environment interactions. When accompanied with a high-throughput sequencing platform, we forecast their potential revolutionary impacts on evolutionary and ecological research, which may offer an exciting prospect for connecting the gap between DNA sequence and phenotype in the near future.

Journal ArticleDOI
TL;DR: This paper presents a curated repository of multielectrode array recordings of spontaneous activity in developing mouse and ferret retina, and describes the structure of the data, along with examples of reproducible research using these data files.
Abstract: During early development, neural circuits fire spontaneously, generating activity episodes with complex spatiotemporal patterns. Recordings of spontaneous activity have been made in many parts of the nervous system over the last 25 years, reporting developmental changes in activity patterns and the effects of various genetic perturbations. We present a curated repository of multielectrode array recordings of spontaneous activity in developing mouse and ferret retina. The data have been annotated with minimal metadata and converted into HDF5. This paper describes the structure of the data, along with examples of reproducible research using these data files. We also demonstrate how these data can be analysed in the CARMEN workflow system. This article is written as a literate programming document; all programs and data described here are freely available. 1. We hope this repository will lead to novel analysis of spontaneous activity recorded in different laboratories. 2. We encourage published data to be added to the repository. 3. This repository serves as an example of how multielectrode array recordings can be stored for long-term reuse.

Journal ArticleDOI
Neil M Davies1, Neil M Davies2, Dawn Field1, Linda A. Amaral-Zettler3, Melody S. Clark4, John Deck2, Alexei J. Drummond5, Daniel P. Faith6, Jonathan B. Geller7, Jack A. Gilbert8, Jack A. Gilbert9, Frank Oliver Glöckner10, Frank Oliver Glöckner11, Penny R. Hirsch12, Jo-Ann Leong13, Christopher P. Meyer14, Matthias Obst15, Serge Planes16, Chris Scholin17, Alfried P. Vogler18, Alfried P. Vogler19, Ruth D. Gates13, Robert J. Toonen13, Véronique Berteaux-Lecellier16, Michèle Barbier, Katherine Barker14, Stefan Bertilsson20, Mesude Bicak1, Matthew J. Bietz21, Jason Bobe, Levente Bodrossy22, Ángel Borja, Jonathan A. Coddington14, Jed A. Fuhrman23, Gunnar Gerdts24, Rosemary G. Gillespie2, Kelly D. Goodwin25, Paul C. Hanson26, Jean-Marc Hero27, David Hoekman28, Janet K. Jansson29, Christian Jeanthon16, Rebecca Hufft Kao30, Anna Klindworth11, Anna Klindworth10, Rob Knight31, Rob Knight32, Renzo Kottmann11, Renzo Kottmann10, Michelle S. Koo2, Georgios Kotoulas, Andrew J. Lowe33, Viggo Marteinsson, Folker Meyer8, Norman Morrison34, David D. Myrold35, Evangelos Pafilis, Stephanie M. Parker28, J. Jacob Parnell28, Paraskevi N. Polymenakou, Sujeevan Ratnasingham36, George K. Roderick2, Naiara Rodríguez-Ezpeleta, Karsten Schönrogge, Nathalie Simon16, Nathalie J. Valette-Silver25, Yuri P. Springer28, Graham N. Stone37, Steve Stones-Havas, Susanna-Assunta Sansone1, Kate M Thibault28, Patricia Wecker16, Antje Wichels23, John Wooley38, Tetsukazu Yahara39, Adriana Zingone40 
TL;DR: The co-authors of this paper state their intention to work together to launch the Genomic Observatories Network (GOs Network) for which this document will serve as its Founding Charter, and to describe their shared vision for its future.
Abstract: The co-authors of this paper hereby state their intention to work together to launch the Genomic Observatories Network (GOs Network) for which this document will serve as its Founding Charter. We define a Genomic Observatory as an ecosystem and/or site subject to long-term scientific research, including (but not limited to) the sustained study of genomic biodiversity from single-celled microbes to multicellular organisms. An international group of 64 scientists first published the call for a global network of Genomic Observatories in January 2012. The vision for such a network was expanded in a subsequent paper and developed over a series of meetings in Bremen (Germany), Shenzhen (China), Moorea (French Polynesia), Oxford (UK), Pacific Grove (California, USA), Washington (DC, USA), and London (UK). While this community-building process continues, here we express our mutual intent to establish the GOs Network formally, and to describe our shared vision for its future. The views expressed here are ours alone as individual scientists, and do not necessarily represent those of the institutions with which we are affiliated.

Journal ArticleDOI
TL;DR: Several algorithms and methods for building consensus optical maps and aligning restriction patterns to a reference map are reviewed, as well as methods for using optical maps with sequence assemblies.
Abstract: Optical mapping and newer genome mapping technologies based on nicking enzymes provide low resolution but long-range genomic information. The optical mapping technique has been successfully used for assessing the quality of genome assemblies and for detecting large-scale structural variants and rearrangements that cannot be detected using current paired end sequencing protocols. Here, we review several algorithms and methods for building consensus optical maps and aligning restriction patterns to a reference map, as well as methods for using optical maps with sequence assemblies.

Journal ArticleDOI
TL;DR: The presented genome annotation extends beyond earlier ones by closing gaps of sequence that were unavoidable with previous low-coverage shotgun genome sequencing and offer an important resource for connecting the rich veterinary and natural history of cats to genome discovery.
Abstract: Domestic cats enjoy an extensive veterinary medical surveillance which has described nearly 250 genetic diseases analogous to human disorders. Feline infectious agents offer powerful natural models of deadly human diseases, which include feline immunodeficiency virus, feline sarcoma virus and feline leukemia virus. A rich veterinary literature of feline disease pathogenesis and the demonstration of a highly conserved ancestral mammal genome organization make the cat genome annotation a highly informative resource that facilitates multifaceted research endeavors. Here we report a preliminary annotation of the whole genome sequence of Cinnamon, a domestic cat living in Columbia (MO, USA), bisulfite sequencing of Boris, a male cat from St. Petersburg (Russia), and light 30× sequencing of Sylvester, a European wildcat progenitor of cat domestication. The annotation includes 21,865 protein-coding genes identified by a comparative approach, 217 loci of endogenous retrovirus-like elements, repetitive elements which comprise about 55.7% of the whole genome, 99,494 new SNVs, 8,355 new indels, 743,326 evolutionary constrained elements, and 3,182 microRNA homologues. The methylation sites study shows that 10.5% of cat genome cytosines are methylated. An assisted assembly of a European wildcat, Felis silvestris silvestris, was performed; variants between F. silvestris and F. catus genomes were derived and compared to F. catus. The presented genome annotation extends beyond earlier ones by closing gaps of sequence that were unavoidable with previous low-coverage shotgun genome sequencing. The assembly and its annotation offer an important resource for connecting the rich veterinary and natural history of cats to genome discovery.

Journal ArticleDOI
TL;DR: Recent crowdfunding efforts to sequence the Azolla genome, a little fern with massive green potential, are described, showing that Crowdfunding is a worthy platform not only for obtaining seed money for exploratory research, but also for engaging directly with the general public as a rewarding form of outreach.
Abstract: Much of science progresses within the tight boundaries of what is often seen as a “black box”. Though familiar to funding agencies, researchers and the academic journals they publish in, it is an entity that outsiders rarely get to peek into. Crowdfunding is a novel means that allows the public to participate in, as well as to support and witness advancements in science. Here we describe our recent crowdfunding efforts to sequence the Azolla genome, a little fern with massive green potential. Crowdfunding is a worthy platform not only for obtaining seed money for exploratory research, but also for engaging directly with the general public as a rewarding form of outreach.

Journal ArticleDOI
TL;DR: mirnaTA is an open-source, bioinformatics tool to aid scientists in identifying differentially expressed miRNAs which could be further mined for biological significance and is expected to provide researchers with a means of interpreting raw data to statistical summaries in a fast and intuitive manner.
Abstract: Background Understanding the biological roles of microRNAs (miRNAs) is a an active area of research that has produced a surge of publications in PubMed, particularly in cancer research. Along with this increasing interest, many open-source bioinformatics tools to identify existing and/or discover novel miRNAs in next-generation sequencing (NGS) reads become available. While miRNA identification and discovery tools are significantly improved, the development of miRNA differential expression analysis tools, especially in temporal studies, remains substantially challenging. Further, the installation of currently available software is non-trivial and steps of testing with example datasets, trying with one’s own dataset, and interpreting the results require notable expertise and time. Subsequently, there is a strong need for a tool that allows scientists to normalize raw data, perform statistical analyses, and provide intuitive results without having to invest significant efforts.

Journal ArticleDOI
TL;DR: The presented datasets together with their metadata provide researchers with an opportunity to study the P300 component from different perspectives and can be used for BCI research.
Abstract: The event-related potentials technique is widely used in cognitive neuroscience research. The P300 waveform has been explored in many research articles because of its wide applications, such as lie detection or brain-computer interfaces (BCI). However, very few datasets are publicly available. Therefore, most researchers use only their private datasets for their analysis. This leads to minimally comparable results, particularly in brain-computer research interfaces. Here we present electroencephalography/event-related potentials (EEG/ERP) data. The data were obtained from 20 healthy subjects and was acquired using an odd-ball hardware stimulator. The visual stimulation was based on a three-stimulus paradigm and included target, non-target and distracter stimuli. The data and collected metadata are shared in the EEG/ERP Portal. The paper also describes the process and validation results of the presented data. The data were validated using two different methods. The first method evaluated the data by measuring the percentage of artifacts. The second method tested if the expectation of the experimental results was fulfilled (i.e., if the target trials contained the P300 component). The validation proved that most datasets were suitable for subsequent analysis. The presented datasets together with their metadata provide researchers with an opportunity to study the P300 component from different perspectives. Furthermore, they can be used for BCI research.

Journal ArticleDOI
TL;DR: This work proposes a novel framework for scalable network-based learning of multi-species protein functions based on both a local implementation of existing algorithms and the adoption of innovative technologies that allow the efficient use of the large memory available on disks, thus overcoming the main memory limitations of modern off-the-shelf computers.
Abstract: Network-based learning algorithms for automated function prediction (AFP) are negatively affected by the limited coverage of experimental data and limited a priori known functional annotations. As a consequence their application to model organisms is often restricted to well characterized biological processes and pathways, and their effectiveness with poorly annotated species is relatively limited. A possible solution to this problem might consist in the construction of big networks including multiple species, but this in turn poses challenging computational problems, due to the scalability limitations of existing algorithms and the main memory requirements induced by the construction of big networks. Distributed computation or the usage of big computers could in principle respond to these issues, but raises further algorithmic problems and require resources not satisfiable with simple off-the-shelf computers. We propose a novel framework for scalable network-based learning of multi-species protein functions based on both a local implementation of existing algorithms and the adoption of innovative technologies: we solve “locally” the AFP problem, by designing “vertex-centric” implementations of network-based algorithms, but we do not give up thinking “globally” by exploiting the overall topology of the network. This is made possible by the adoption of secondary memory-based technologies that allow the efficient use of the large memory available on disks, thus overcoming the main memory limitations of modern off-the-shelf computers. This approach has been applied to the analysis of a large multi-species network including more than 300 species of bacteria and to a network with more than 200,000 proteins belonging to 13 Eukaryotic species. To our knowledge this is the first work where secondary-memory based network analysis has been applied to multi-species function prediction using biological networks with hundreds of thousands of proteins. The combination of these algorithmic and technological approaches makes feasible the analysis of large multi-species networks using ordinary computers with limited speed and primary memory, and in perspective could enable the analysis of huge networks (e.g. the whole proteomes available in SwissProt), using well-equipped stand-alone machines.

Journal ArticleDOI
TL;DR: The dataset presented here shows that earthworms constitute suitable candidates for μCT scanning in combination with soft tissue staining and is comparable to results derived from traditional dissection techniques, but due to their digital nature the data also permit computer-based interactive exploration of earthworm morphology and anatomy.
Abstract: Background: Although molecular tools are increasingly employed to decipher invertebrate systematics, earthworm (Annelida: Clitellata: ‘Oligochaeta’) taxonomy is still largely based on conventional dissection, resulting in data that are mostly unsuitable for dissemination through online databases. In order to evaluate if micro-computed tomography (μCT) in combination with soft tissue staining techniques could be used to expand the existing set of tools available for studying internal and external structures of earthworms, μCT scans of freshly fixed and museum specimens were gathered. Findings: Scout images revealed full penetration of tissues by the staining agent. The attained isotropic voxel resolutions permit identification of internal and external structures conventionally used in earthworm taxonomy. The μCT projection and reconstruction images have been deposited in the online data repository GigaDB and are publicly available for download. Conclusions: The dataset presented here shows that earthworms constitute suitable candidates for μCT scanning in combination with soft tissue staining. Not only are the data comparable to results derived from traditional dissection techniques, but due to their digital nature the data also permit computer-based interactive exploration of earthworm morphology and anatomy. The approach pursued here can be applied to freshly fixed as well as museum specimens, which is of particular importance when considering the use of rare or valuable material. Finally, a number of aspects related to the deposition of digital morphological data are briefly discussed.

Journal ArticleDOI
TL;DR: The successful beginnings of an international interdisciplinary venture, the Avian Phylogenomics Project that lets us view, through a genomics lens, modern bird species and the evolutionary events that produced them are presented.
Abstract: Everyone loves the birds of the world. From their haunting songs and majesty of flight to dazzling plumage and mating rituals, bird watchers – both amateurs and professionals - have marveled for centuries at their considerable adaptations. Now, we are offered a special treat with the publication of a series of papers in dedicated issues of Science, Genome Biology and GigaScience (which also included pre-publication data release). These present the successful beginnings of an international interdisciplinary venture, the Avian Phylogenomics Project that lets us view, through a genomics lens, modern bird species and the evolutionary events that produced them.

Journal ArticleDOI
TL;DR: Ten recommendations to ensure the usability, sustainability and practicality of research software are addressed, in particular for young researchers new to programming.
Abstract: Research in the context of data-driven science requires a backbone of well-written software, but scientific researchers are typically not trained at length in software engineering, the principles for creating better software products. To address this gap, in particular for young researchers new to programming, we give ten recommendations to ensure the usability, sustainability and practicality of research software.

Journal ArticleDOI
TL;DR: The availability of elephant genome sequence data from all three elephant species will complement studies of behaviour, genetic diversity, evolution and disease resistance, and are an important addition to the available genetic and genomic information on Asian and African elephants.
Abstract: Background: There are three species of elephant that exist, the Asian elephant (Elephas maximus) and two species of African elephant (Loxodonta africana and Loxodonta cyclotis). The populations of all three species are dwindling, and are under threat due to factors, such as habitat destruction and ivory hunting. The species differ in many respects, including in their morphology and response to disease. The availability of elephant genome sequence data from all three elephant species will complement studies of behaviour, genetic diversity, evolution and disease resistance. Findings: We present low-coverage Illumina sequence data from two Asian elephants, representing approximately 5X and 2.5X coverage respectively. Both raw and aligned data are available, using the African elephant (L. africana) genome as a reference. Conclusions: The data presented here are an important addition to the available genetic and genomic information on Asian and African elephants.