scispace - formally typeset
Search or ask a question

Showing papers in "Genome Research in 2008"


Journal ArticleDOI
TL;DR: Velvet represents a new approach to assembly that can leverage very short reads in combination with read pairs to produce useful assemblies and is in close agreement with simulated results without read-pair information.
Abstract: We have developed a new set of algorithms, collectively called "Velvet," to manipulate de Bruijn graphs for genomic sequence assembly. A de Bruijn graph is a compact representation based on short words (k-mers) that is ideal for high coverage, very short read (25-50 bp) data sets. Applying Velvet to very short reads and paired-ends information only, one can produce contigs of significant length, up to 50-kb N50 length in simulations of prokaryotic data and 3-kb N50 on simulated mammalian BACs. When applied to real Solexa data sets without read pairs, Velvet generated contigs of approximately 8 kb in a prokaryote and 2 kb in a mammalian BAC, in close agreement with our simulated results without read-pair information. Velvet represents a new approach to assembly that can leverage very short reads in combination with read pairs to produce useful assemblies.

9,389 citations


Journal ArticleDOI
TL;DR: This work describes the software MAQ, software that can build assemblies by mapping shotgun short reads to a reference genome, using quality scores to derive genotype calls of the consensus sequence of a diploid genome, e.g., from a human sample.
Abstract: New sequencing technologies promise a new era in the use of DNA sequence. However, some of these technologies produce very short reads, typically of a few tens of base pairs, and to use these reads effectively requires new algorithms and software. In particular, there is a major issue in efficiently aligning short reads to a reference genome and handling ambiguity or lack of accuracy in this alignment. Here we introduce the concept of mapping quality, a measure of the confidence that a read actually comes from the position it is aligned to by the mapping algorithm. We describe the software MAQ that can build assemblies by mapping shotgun short reads to a reference genome, using quality scores to derive genotype calls of the consensus sequence of a diploid genome, e.g., from a human sample. MAQ makes full use of mate-pair information and estimates the error probability of each read alignment. Error probabilities are also derived for the final genotype calls, using a Bayesian statistical model that incorporates the mapping qualities, error probabilities from the raw sequence quality scores, sampling of the two haplotypes, and an empirical model for correlated errors at a site. Both read mapping and genotype calling are evaluated on simulated data and real data. MAQ is accurate, efficient, versatile, and user-friendly. It is freely available at http://maq.sourceforge.net.

2,927 citations


Journal ArticleDOI
TL;DR: It is found that the Illumina sequencing data are highly replicable, with relatively little technical variation, and thus, for many purposes, it may suffice to sequence each mRNA sample only once (i.e., using one lane).
Abstract: Ultra-high-throughput sequencing is emerging as an attractive alternative to microarrays for genotyping, analysis of methylation patterns, and identification of transcription factor binding sites. Here, we describe an application of the Illumina sequencing (formerly Solexa sequencing) platform to study mRNA expression levels. Our goals were to estimate technical variance associated with Illumina sequencing in this context and to compare its ability to identify differentially expressed genes with existing array technologies. To do so, we estimated gene expression differences between liver and kidney RNA samples using multiple sequencing replicates, and compared the sequencing data to results obtained from Affymetrix arrays using the same RNA samples. We find that the Illumina sequencing data are highly replicable, with relatively little technical variation, and thus, for many purposes, it may suffice to sequence each mRNA sample only once (i.e., using one lane). The information in a single lane of Illumina sequencing data appears comparable to that in a single array in enabling identification of differentially expressed genes, while allowing for additional analyses such as detection of low-expressed genes, alternative splice variants, and novel transcripts. Based on our observations, we propose an empirical protocol and a statistical framework for the analysis of gene expression using ultra-high-throughput sequencing technology.

2,834 citations


Journal ArticleDOI
TL;DR: The results demonstrate that MAKER provides a simple and effective means to convert a genome sequence into a community-accessible genome database, and should prove especially useful for emerging model organism genome projects for which extensive bioinformatics resources may not be readily available.
Abstract: We have developed a portable and easily configurable genome annotation pipeline called MAKER. Its purpose is to allow investigators to independently annotate eukaryotic genomes and create genome databases. MAKER identifies repeats, aligns ESTs and proteins to a genome, produces ab initio gene predictions, and automatically synthesizes these data into gene annotations having evidence-based quality indices. MAKER is also easily trainable: Outputs of preliminary runs are used to automatically retrain its gene-prediction algorithm, producing higher-quality gene-models on subsequent runs. MAKER’s inputs are minimal, and its outputs can be used to create a GMOD database. Its outputs can also be viewed in the Apollo Genome browser; this feature of MAKER provides an easy means to annotate, view, and edit individual contigs and BACs without the overhead of a database. As proof of principle, we have used MAKER to annotate the genome of the planarian Schmidtea mediterranea and to create a new genome database, SmedGD. We have also compared MAKER’s performance to other published annotation pipelines. Our results demonstrate that MAKER provides a simple and effective means to convert a genome sequence into a community-accessible genome database. MAKER should prove especially useful for emerging model organism genome projects for which extensive bioinformatics resources may not be readily available.

1,503 citations


Journal ArticleDOI
TL;DR: Application of this approach to RNA from human embryonic stem cells obtained before and after their differentiation into embryoid bodies revealed the sequences and expression levels of 334 known plus 104 novel miRNA genes, representing the deepest miRNA sampling to date.
Abstract: MicroRNAs (miRNAs) are emerging as important, albeit poorly characterized, regulators of biological processes. Key to further elucidation of their roles is the generation of more complete lists of their numbers and expression changes in different cell states. Here, we report a new method for surveying the expression of small RNAs, including microRNAs, using Illumina sequencing technology. We also present a set of methods for annotating sequences deriving from known miRNAs, identifying variability in mature miRNA sequences, and identifying sequences belonging to previously unidentified miRNA genes. Application of this approach to RNA from human embryonic stem cells obtained before and after their differentiation into embryoid bodies revealed the sequences and expression levels of 334 known plus 104 novel miRNA genes. One hundred seventy-one known and 23 novel microRNA sequences exhibited significant expression differences between these two developmental states. Owing to the increased number of sequence reads, these libraries represent the deepest miRNA sampling to date, spanning nearly six orders of magnitude of expression. The predicted targets of those miRNAs enriched in either sample shared common features. Included among the high-ranked predicted gene targets are those implicated in differentiation, cell cycle control, programmed cell death, and transcriptional regulation.

1,102 citations


Journal ArticleDOI
TL;DR: A general method for genome assembly that can be applied to all types of DNA sequence data, not only short read data, but also conventional sequence reads is described.
Abstract: New DNA sequencing technologies deliver data at dramatically lower costs but demand new analytical methods to take full advantage of the very short reads that they produce. We provide an initial, theoretical solution to the challenge of de novo assembly from whole-genome shotgun “microreads.” For 11 genomes of sizes up to 39 Mb, we generated high-quality assemblies from 80× coverage by paired 30-base simulated reads modeled after real Illumina-Solexa reads. The bacterial genomes of Campylobacter jejuni and Escherichia coli assemble optimally, yielding single perfect contigs, and larger genomes yield assemblies that are highly connected and accurate. Assemblies are presented in a graph form that retains intrinsic ambiguities such as those arising from polymorphism, thereby providing information that has been absent from previous genome assemblies. For both C. jejuni and E. coli, this assembly graph is a single edge encompassing the entire genome. Larger genomes produce more complicated graphs, but the vast majority of the bases in their assemblies are present in long edges that are nearly always perfect. We describe a general method for genome assembly that can be applied to all types of DNA sequence data, not only short read data, but also conventional sequence reads.

880 citations


Journal ArticleDOI
TL;DR: This study of healthy human skin microbiota will serve to direct future research addressing the role of skin microbiota in health and disease, and metagenomic projects addressing the complex physiological interactions between the skin and the microbes that inhabit this environment.
Abstract: The many layers and structures of the skin serve as elaborate hosts to microbes, including a diversity of commensal and pathogenic bacteria that contribute to both human health and disease. To determine the complexity and identity of the microbes inhabiting the skin, we sequenced bacterial 16S small-subunit ribosomal RNA genes isolated from the inner elbow of five healthy human subjects. This analysis revealed 113 operational taxonomic units (OTUs; "phylotypes") at the level of 97% similarity that belong to six bacterial divisions. To survey all depths of the skin, we sampled using three methods: swab, scrape, and punch biopsy. Proteobacteria dominated the skin microbiota at all depths of sampling. Interpersonal variation is approximately equal to intrapersonal variation when considering bacterial community membership and structure. Finally, we report strong similarities in the complexity and identity of mouse and human skin microbiota. This study of healthy human skin microbiota will serve to direct future research addressing the role of skin microbiota in health and disease, and metagenomic projects addressing the complex physiological interactions between the skin and the microbes that inhabit this environment.

853 citations


Journal ArticleDOI
TL;DR: Major changes in the topology of the parsimony tree are described and names for new and rearranged lineages within the tree following the rules presented by the Y Chromosome Consortium in 2002 are provided.
Abstract: Markers on the non-recombining portion of the human Y chromosome continue to have applications in many fields including evolutionary biology, forensics, medical genetics, and genealogical reconstruction. In 2002, the Y Chromosome Consortium published a single parsimony tree showing the relationships among 153 haplogroups based on 243 binary markers and devised a standardized nomenclature system to name lineages nested within this tree. Here we present an extensively revised Y chromosome tree containing 311 distinct haplogroups, including two new major haplogroups (S and T), and incorporating approximately 600 binary markers. We describe major changes in the topology of the parsimony tree and provide names for new and rearranged lineages within the tree following the rules presented by the Y Chromosome Consortium in 2002. Several changes in the tree topology have important implications for studies of human ancestry. We also present demography-independent age estimates for 11 of the major clades in the new Y chromosome tree.

831 citations


Journal ArticleDOI
TL;DR: Promising applications of protein networks to disease in four major areas are reviewed: identifying new disease genes; the study of their network properties; identifying disease-related subnetworks; and network-based disease classification.
Abstract: During a decade of proof-of-principle analysis in model organisms, protein networks have been used to further the study of molecular evolution, to gain insight into the robustness of cells to perturbation, and for assignment of new protein functions. Following these analyses, and with the recent rise of protein interaction measurements in mammals, protein networks are increasingly serving as tools to unravel the molecular basis of disease. We review promising applications of protein networks to disease in four major areas: identifying new disease genes; the study of their network properties; identifying disease-related subnetworks; and network-based disease classification. Applications in infectious disease, personalized medicine, and pharmacology are also forthcoming as the available protein network information improves in quality and coverage.

800 citations


Journal ArticleDOI
TL;DR: A new ab initio algorithm, GeneMark-ES version 2, that identifies protein-coding genes in fungal genomes that does not require a predetermined training set to estimate parameters of the underlying hidden Markov model (HMM).
Abstract: We describe a new ab initio algorithm, GeneMark-ES version 2, that identifies protein-coding genes in fungal genomes. The algorithm does not require a predetermined training set to estimate parameters of the underlying hidden Markov model (HMM). Instead, the anonymous genomic sequence in question is used as an input for iterative unsupervised training. The algorithm extends our previously developed method tested on genomes of Arabidopsis thaliana, Caenorhabditis elegans, and Drosophila melanogaster. To better reflect features of fungal gene organization, we enhanced the intron submodel to accommodate sequences with and without branch point sites. This design enables the algorithm to work equally well for species with the kinds of variations in splicing mechanisms seen in the fungal phyla Ascomycota, Basidiomycota, and Zygomycota. Upon self-training, the intron submodel switches on in several steps to reach its full complexity. We demonstrate that the algorithm accuracy, both at the exon and the whole gene level, is favorably compared to the accuracy of gene finders that employ supervised training. Application of the new method to known fungal genomes indicates substantial improvement over existing annotations. By eliminating the effort necessary to build comprehensive training sets, the new algorithm can streamline and accelerate the process of annotation in a large number of fungal genome sequencing projects.

737 citations


Journal ArticleDOI
TL;DR: The data indicate that long ncRNAs are likely to be important in processes directing pluripotency and alternative differentiation programs, in some cases through engagement of the epigenetic machinery.
Abstract: The transcriptional networks that regulate embryonic stem (ES) cell pluripotency and lineage specification are the subject of considerable attention. To date such studies have focused almost exclusively on protein-coding transcripts. However, recent transcriptome analyses show that the mammalian genome contains thousands of long noncoding RNAs (ncRNAs), many of which appear to be expressed in a developmentally regulated manner. The functions of these remain untested. To identify ncRNAs involved in ES cell biology, we used a custom-designed microarray to examine the expression profiles of mouse ES cells differentiating as embryoid bodies (EBs) over a 16-d time course. We identified 945 ncRNAs expressed during EB differentiation, of which 174 were differentially expressed, many correlating with pluripotency or specific differentiation events. Candidate ncRNAs were identified for further characterization by an integrated examination of expression profiles, genomic context, chromatin state, and promoter analysis. Many ncRNAs showed coordinated expression with genomically associated developmental genes, such as Dlx1, Dlx4, Gata6, and Ecsit. We examined two novel developmentally regulated ncRNAs, Evx1as and Hoxb5/6as, which are derived from homeotic loci and share similar expression patterns and localization in mouse embryos with their associated protein-coding genes. Using chromatin immunoprecipitation, we provide evidence that both ncRNAs are associated with trimethylated H3K4 histones and histone methyltransferase MLL1, suggesting a role in epigenetic regulation of homeotic loci during ES cell differentiation. Taken together, our data indicate that long ncRNAs are likely to be important in processes directing pluripotency and alternative differentiation programs, in some cases through engagement of the epigenetic machinery.

Journal ArticleDOI
TL;DR: Evidence is presented that the organization of nucleosomes throughout genes is largely a consequence of statistical packing principles, and a high-resolution genome-wide map of TFIIB locations that implicates 3' NFRs in gene looping is presented.
Abstract: Most nucleosomes are well-organized at the 5' ends of S. cerevisiae genes where "-1" and "+1" nucleosomes bracket a nucleosome-free promoter region (NFR). How nucleosomal organization is specified by the genome is less clear. Here we establish and inter-relate rules governing genomic nucleosome organization by sequencing DNA from more than one million immunopurified S. cerevisiae nucleosomes (displayed at http://atlas.bx.psu.edu/). Evidence is presented that the organization of nucleosomes throughout genes is largely a consequence of statistical packing principles. The genomic sequence specifies the location of the -1 and +1 nucleosomes. The +1 nucleosome forms a barrier against which nucleosomes are packed, resulting in uniform positioning, which decays at farther distances from the barrier. We present evidence for a novel 3' NFR that is present at >95% of all genes. 3' NFRs may be important for transcription termination and anti-sense initiation. We present a high-resolution genome-wide map of TFIIB locations that implicates 3' NFRs in gene looping.

Journal ArticleDOI
TL;DR: This study proposes a de novo assembler software that generates a set of accurate contigs of several kilobases that cover most of the bacterial genome on the Illumina sequencing platform that produces millions of very short sequences that are 35 bases in length.
Abstract: Novel high-throughput DNA sequencing technologies allow researchers to characterize a bacterial genome during a single experiment and at a moderate cost. However, the increase in sequencing throughput that is allowed by using such platforms is obtained at the expense of individual sequence read length, which must be assembled into longer contigs to be exploitable. This study focuses on the Illumina sequencing platform that produces millions of very short sequences that are 35 bases in length. We propose a de novo assembler software that is dedicated to process such data. Based on a classical overlap graph representation and on the detection of potentially spurious reads, our software generates a set of accurate contigs of several kilobases that cover most of the bacterial genome. The assembly results were validated by comparing data sets that were obtained experimentally for Staphylococcus aureus strain MW2 and Helicobacter acinonychis strain Sheeba with that of their published genomes acquired by conventional sequencing of 1.5- to 3.0-kb fragments. We also provide indications that the broad coverage achieved by high-throughput sequencing might allow for the detection of clonal polymorphisms in the set of DNA molecules being sequenced.

Journal ArticleDOI
TL;DR: These analyses provide a global view of the chromatin architecture of a multicellular animal at extremely high density and resolution and release this data set, via the UCSC Genome Browser, as a resource for the high-resolution analysis of chromatin conformation and DNA accessibility at individual loci within the C. elegans genome.
Abstract: Using the massively parallel technique of sequencing by oligonucleotide ligation and detection (SOLiD; Applied Biosystems), we have assessed the in vivo positions of more than 44 million putative nucleosome cores in the multicellular genetic model organism Caenorhabditis elegans. These analyses provide a global view of the chromatin architecture of a multicellular animal at extremely high density and resolution. While we observe some degree of reproducible positioning throughout the genome in our mixed stage population of animals, we note that the major chromatin feature in the worm is a diversity of allowed nucleosome positions at the vast majority of individual loci. While absolute positioning of nucleosomes can vary substantially, relative positioning of nucleosomes (in a repeated array structure likely to be maintained at least in part by steric constraints) appears to be a significant property of chromatin structure. The high density of nucleosomal reads enabled a substantial extension of previous analysis describing the usage of individual oligonucleotide sequences along the span of the nucleosome core and linker. We release this data set, via the UCSC Genome Browser, as a resource for the high-resolution analysis of chromatin conformation and DNA accessibility at individual loci within the C. elegans genome.

Journal ArticleDOI
TL;DR: It is shown that a shared ancient hexaploidy event (or perhaps two roughly concurrent genome fusions) can be inferred based on the sequences from several divergent plant genomes, laying the foundation for approximating the number and arrangement of genes in the last universal common ancestor of angiosperms.
Abstract: Large-scale (segmental or whole) genome duplication has been recurring in angiosperm evolution. Subsequent gene loss and rearrangements further affect gene copy numbers and fractionate ancestral gene linkages across multiple chromosomes. The fragmented "multiple-to-multiple" correspondences resulting from this distinguishing feature of angiosperm evolution complicates comparative genomic studies. Using a robust computational framework that combines information from multiple orthologous and duplicated regions to construct local syntenic networks, we show that a shared ancient hexaploidy event (or perhaps two roughly concurrent genome fusions) can be inferred based on the sequences from several divergent plant genomes. This "paleo-hexaploidy" clearly preceded the rosid-asterid split, but it remains equivocal whether it also affected monocots. The model resulting from our multi-alignments lays the foundation for approximating the number and arrangement of genes in the last universal common ancestor of angiosperms. Comparative analysis of inferred homologous genes derived from this model shows patterns of preferential gene retention or loss after polyploidy and reveals large variability of nucleotide substitution rates among plant nuclear genomes.

Journal ArticleDOI
TL;DR: It is established that these repeat-associated binding sites (RABS) have been associated with significant regulatory expansions throughout the mammalian phylogeny and that transposable elements play an important role in expanding the repertoire of binding sites.
Abstract: Identification of lineage-specific innovations in genomic control elements is critical for understanding transcriptional regulatory networks and phenotypic heterogeneity. We analyzed, from an evolutionary perspective, the binding regions of seven mammalian transcription factors (ESR1, TP53, MYC, RELA, POU5F1, SOX2, and CTCF) identified on a genome-wide scale by different chromatin immunoprecipitation approaches and found that only a minority of sites appear to be conserved at the sequence level. Instead, we uncovered a pervasive association with genomic repeats by showing that a large fraction of the bona fide binding sites for five of the seven transcription factors (ESR1, TP53, POU5F1, SOX2, and CTCF) are embedded in distinctive families of transposable elements. Using the age of the repeats, we established that these repeat-associated binding sites (RABS) have been associated with significant regulatory expansions throughout the mammalian phylogeny. We validated the functional significance of these RABS by showing that they are over-represented in proximity of regulated genes and that the binding motifs within these repeats have undergone evolutionary selection. Our results demonstrate that transcriptional regulatory networks are highly dynamic in eukaryotic genomes and that transposable elements play an important role in expanding the repertoire of binding sites.

Journal ArticleDOI
TL;DR: The Velvet assembler was incorporated into a targeted de novo assembly method and yielded 10,921 high-confidence contigs that were anchored to flanking sequences and harbored indels as large as 641 bp, and the methods are broadly applicable for polymorphism discovery in moderate to large genomes even at highly diverged loci.
Abstract: Whole-genome hybridization studies have suggested that the nuclear genomes of accessions (natural strains) of Arabidopsis thaliana can differ by several percent of their sequence. To examine this variation, and as a first step in the 1001 Genomes Project for this species, we produced 15- to 25-fold coverage in Illumina sequencing-by-synthesis (SBS) reads for the reference accession, Col-0, and two divergent strains, Bur-0 and Tsu-1. We aligned reads to the reference genome sequence to assess data quality metrics and to detect polymorphisms. Alignments revealed 823,325 unique single nucleotide polymorphisms (SNPs) and 79,961 unique 1- to 3-bp indels in the divergent accessions at a specificity of >99%, and over 2000 potential errors in the reference genome sequence. We also identified >3.4 Mb of the Bur-0 and Tsu-1 genomes as being either extremely dissimilar, deleted, or duplicated relative to the reference genome. To obtain sequences for these regions, we incorporated the Velvet assembler into a targeted de novo assembly method. This approach yielded 10,921 high-confidence contigs that were anchored to flanking sequences and harbored indels as large as 641 bp. Our methods are broadly applicable for polymorphism discovery in moderate to large genomes even at highly diverged loci, and we established by subsampling the Illumina SBS coverage depth required to inform a broad range of functional and evolutionary studies. Our pipeline for aligning reads and predicting SNPs and indels, SHORE, is available for download at http://1001genomes.org.

Journal ArticleDOI
TL;DR: The genome of the M strain of M. marinum comprises a 6,636,827-bp circular chromosome with 5424 CDS, 10 prophages, and a 23-kb mercury-resistance plasmid as discussed by the authors.
Abstract: Mycobacterium marinum, a ubiquitous pathogen of fish and amphibia, is a near relative of Mycobacterium tuberculosis, the etiologic agent of tuberculosis in humans. The genome of the M strain of M. marinum comprises a 6,636,827-bp circular chromosome with 5424 CDS, 10 prophages, and a 23-kb mercury-resistance plasmid. Prominent features are the very large number of genes (57) encoding polyketide synthases (PKSs) and nonribosomal peptide synthases (NRPSs) and the most extensive repertoire yet reported of the mycobacteria-restricted PE and PPE proteins, and related-ESX secretion systems. Some of the NRPS genes comprise a novel family and seem to have been acquired horizontally. M. marinum is used widely as a model organism to study M. tuberculosis pathogenesis, and genome comparisons confirmed the close genetic relationship between these two species, as they share 3000 orthologs with an average amino acid identity of 85%. Comparisons with the more distantly related Mycobacterium avium subspecies paratuberculosis and Mycobacterium smegmatis reveal how an ancestral generalist mycobacterium evolved into M. tuberculosis and M. marinum. M. tuberculosis has undergone genome downsizing and extensive lateral gene transfer to become a specialized pathogen of humans and other primates without retaining an environmental niche. M. marinum has maintained a large genome so as to retain the capacity for environmental survival while becoming a broad host range pathogen that produces disease strikingly similar to M. tuberculosis. The work described herein provides a foundation for using M. marinum to better understand the determinants of pathogenesis of tuberculosis.

Journal ArticleDOI
TL;DR: The results indicate that the amphioxus genome is elemental to an understanding of the biology and evolution of nonchordate deuterostomes, invertebrate chordates, and vertebrates.
Abstract: Cephalochordates, urochordates, and vertebrates evolved from a common ancestor over 520 million years ago To improve our understanding of chordate evolution and the origin of vertebrates, we intensively searched for particular genes, gene families, and conserved noncoding elements in the sequenced genome of the cephalochordate Branchiostoma floridae, commonly called amphioxus or lancelets Special attention was given to homeobox genes, opsin genes, genes involved in neural crest development, nuclear receptor genes, genes encoding components of the endocrine and immune systems, and conserved cis-regulatory enhancers The amphioxus genome contains a basic set of chordate genes involved in development and cell signaling, including a fifteenth Hox gene This set includes many genes that were co-opted in vertebrates for new roles in neural crest development and adaptive immunity However, where amphioxus has a single gene, vertebrates often have two, three, or four paralogs derived from two whole-genome duplication events In addition, several transcriptional enhancers are conserved between amphioxus and vertebrates--a very wide phylogenetic distance In contrast, urochordate genomes have lost many genes, including a diversity of homeobox families and genes involved in steroid hormone function The amphioxus genome also exhibits derived features, including duplications of opsins and genes proposed to function in innate immunity and endocrine systems Our results indicate that the amphioxus genome is elemental to an understanding of the biology and evolution of nonchordate deuterostomes, invertebrate chordates, and vertebrates

Journal ArticleDOI
TL;DR: A new Eulerian assembler is presented that generates nearly optimal short read assemblies of bacterial genomes and an approach to assemble reads in the case of the popular hybrid protocol when short and long Sanger-based reads are combined.
Abstract: In the last year, high-throughput sequencing technologies have progressed from proof-of-concept to production quality. While these methods produce high-quality reads, they have yet to produce reads comparable in length to Sanger-based sequencing. Current fragment assembly algorithms have been implemented and optimized for mate-paired Sanger-based reads, and thus do not perform well on short reads produced by short read technologies. We present a new Eulerian assembler that generates nearly optimal short read assemblies of bacterial genomes and describe an approach to assemble reads in the case of the popular hybrid protocol when short and long Sanger-based reads are combined.

Journal ArticleDOI
TL;DR: This study uses high-throughput pyrosequencing to identify conserved and nonconserved miRNAs and other short RNAs in tomato fruit and leaf and raises the possibility that fruit development and ripening may be under miRNA regulation.
Abstract: In plants there are several classes of 21–24-nt short RNAs that regulate gene expression. The most conserved class is the microRNAs (miRNAs), although some miRNAs are found only in specific species. We used high-throughput pyrosequencing to identify conserved and nonconserved miRNAs and other short RNAs in tomato fruit and leaf. Several conserved miRNAs showed tissue-specific expression, which, combined with target gene validation results, suggests that miRNAs may play a role in fleshy fruit development. We also identified four new nonconserved miRNAs. One of the validated targets of a novel miRNA is a member of the CTR family involved in fruit ripening. However, 62 predicted targets showing near perfect complementarity to potential new miRNAs did not validate experimentally. This suggests that target prediction of plant short RNAs could have a high false-positive rate and must therefore be validated experimentally. We also found short RNAs from a Solanaceae-specific foldback transposon, which showed a miRNA/miRNA*-like distribution, suggesting that this element may function as a miRNA gene progenitor. The other Solanaceae-specific class of short RNA was derived from an endogenous pararetrovirus sequence inserted into the tomato chromosomes. This study opens a new avenue in the field of fleshy fruit biology by raising the possibility that fruit development and ripening may be under miRNA regulation.

Journal ArticleDOI
TL;DR: This study found that with an original array design strategy using tiling arrays and statistical procedures that average information from neighboring genomic locations, much improved specificity and sensitivity could be achieved, e.g., approximately 100% sensitivity at 90% specificity with McrBC.
Abstract: This study was originally conceived to test in a rigorous way the specificity of three major approaches to high-throughput array-based DNA methylation analysis: (1) MeDIP, or methylated DNA immunoprecipitation, an example of antibody-mediated methyl-specific fractionation; (2) HELP, or HpaII tiny fragment enrichment by ligation-mediated PCR, an example of differential amplification of methylated DNA; and (3) fractionation by McrBC, an enzyme that cuts most methylated DNA. These results were validated using 1466 Illumina methylation probes on the GoldenGate methylation assay and further resolved discrepancies among the methods through quantitative methylation pyrosequencing analysis. While all three methods provide useful information, there were significant limitations to each, specifically bias toward CpG islands in MeDIP, relatively incomplete coverage in HELP, and location imprecision in McrBC. However, we found that with an original array design strategy using tiling arrays and statistical procedures that average information from neighboring genomic locations, much improved specificity and sensitivity could be achieved, e.g., approximately 100% sensitivity at 90% specificity with McrBC. We term this approach "comprehensive high-throughput arrays for relative methylation" (CHARM). While this approach was applied to McrBC analysis, the array design and computational algorithms are fractionation method-independent and make this a simple, general, relatively inexpensive tool suitable for genome-wide analysis, and in which individual samples can be assayed reliably at very high density, allowing locus-level genome-wide epigenetic discrimination of individuals, not just groups of samples. Furthermore, unlike the other approaches, CHARM is highly quantitative, a substantial advantage in application to the study of human disease.

Journal ArticleDOI
TL;DR: The connection between patterns of nucleosome occupancy and the capacity to modulate gene expression upon changing conditions, i.e., transcriptional plasticity, is examined and two distinct strategies for gene regulation by chromatin are suggested, which are selectively employed by different genes.
Abstract: Chromatin structure is central for the regulation of gene expression, but its genome-wide organization is only beginning to be understood. Here, we examine the connection between patterns of nucleosome occupancy and the capacity to modulate gene expression upon changing conditions, i.e., transcriptional plasticity. By analyzing genome-wide data of nucleosome positioning in yeast, we find that the presence of nucleosomes close to the transcription start site is associated with high transcriptional plasticity, while nucleosomes at more distant upstream positions are negatively correlated with transcriptional plasticity. Based on this, we identify two typical promoter structures associated with low or high plasticity, respectively. The first class is characterized by a relatively large nucleosome-free region close to the start site coupled with well-positioned nucleosomes further upstream, whereas the second class displays a more evenly distributed and dynamic nucleosome positioning, with high occupancy close to the start site. The two classes are further distinguished by multiple promoter features, including histone turnover, binding site locations, H2A.Z occupancy, expression noise, and expression diversity. Analysis of nucleosome positioning in human promoters reproduces the main observations. Our results suggest two distinct strategies for gene regulation by chromatin, which are selectively employed by different genes.

Journal ArticleDOI
TL;DR: Genome comparisons between these and other Salmonella isolates indicate that S. Gallinarum 287/91 is a recently evolved descendent of S. Enteritidis, and it is proposed that experimental analysis in chickens and mice could provide an experimentally tractable route toward unraveling the genetic basis of host adaptation in S. enterica.
Abstract: We have determined the complete genome sequences of a host-promiscuous Salmonella enterica serovar Enteritidis PT4 isolate P125109 and a chicken-restricted Salmonella enterica serovar Gallinarum isolate 287/91. Genome comparisons between these and other Salmonella isolates indicate that S. Gallinarum 287/91 is a recently evolved descendent of S. Enteritidis. Significantly, the genome of S. Gallinarum has undergone extensive degradation through deletion and pseudogene formation. Comparison of the pseudogenes in S. Gallinarum with those identified previously in other host-adapted bacteria reveals the loss of many common functional traits and provides insights into possible mechanisms of host and tissue adaptation. We propose that experimental analysis in chickens and mice of S. Enteritidis-harboring mutations in functional homologs of the pseudogenes present in S. Gallinarum could provide an experimentally tractable route toward unraveling the genetic basis of host adaptation in S. enterica.

Journal ArticleDOI
TL;DR: The utility and implications of the findings with respect to the regulatory potential of regions with varied CpG density, gene expression, transcription factor motifs, gene ontology, and correlation with other epigenetic marks such as histone modifications are discussed.
Abstract: We report a novel resource (methylation profiles of DNA, or mPod) for human genome-wide tissue-specific DNA methylation profiles. mPod consists of three fully integrated parts, genome-wide DNA methylation reference profiles of 13 normal somatic tissues, placenta, sperm, and an immortalized cell line, a visualization tool that has been integrated with the Ensembl genome browser and a new algorithm for the analysis of immunoprecipitation-based DNA methylation profiles. We demonstrate the utility of our resource by identifying the first comprehensive genome-wide set of tissue-specific differentially methylated regions (tDMRs) that may play a role in cellular identity and the regulation of tissue-specific genome function. We also discuss the implications of our findings with respect to the regulatory potential of regions with varied CpG density, gene expression, transcription factor motifs, gene ontology, and correlation with other epigenetic marks such as histone modifications.

Journal ArticleDOI
TL;DR: A putative mirtron is identified, indicating that plants may also use spliced introns as a source of miRNAs, and a miRNA-like long hairpin is identified that generates phased 21 nt small RNAs, strongly expressed in developing grains, and show that these smallRNAs act in trans to cleave target mRNAs.
Abstract: Endogenous small RNAs, including microRNAs (miRNAs) and short-interfering RNAs (siRNAs), function as post-transcriptional or transcriptional regulators in plants. miRNA function is essential for normal plant development and therefore is likely to be important in the growth of the rice grain. To investigate the roles of miRNAs in rice grain development, we carried out deep sequencing of the small RNA populations of rice grains at two developmental stages. In a data set of ∼5.5 million sequences, we found representatives of all 20 conserved plant miRNA families. We used an approach based on the presence of miRNA and miRNA* sequences to identify 39 novel, nonconserved rice miRNA families expressed in grains. Cleavage of predicted target mRNAs was confirmed for a number of the new miRNAs. We identified a putative mirtron, indicating that plants may also use spliced introns as a source of miRNAs. We also identified a miRNA-like long hairpin that generates phased 21 nt small RNAs, strongly expressed in developing grains, and show that these small RNAs act in trans to cleave target mRNAs. Comparison of the population of miRNAs and miRNA-like siRNAs in grains to those in other parts of the rice plant reveals that many are expressed in an organ-specific manner.

Journal ArticleDOI
TL;DR: A strong correlation is found between (2)H(2)O incorporation into islet DNA in vivo and the expression pattern of the cell cycle module and the pattern is highly correlated with that of several individual genes in insulin target tissues, including Igf2, which has been shown to promote beta-cell proliferation.
Abstract: Insulin resistance is necessary but not sufficient for the development of type 2 diabetes. Diabetes results when pancreatic beta-cells fail to compensate for insulin resistance by increasing insulin production through an expansion of beta-cell mass or increased insulin secretion. Communication between insulin target tissues and beta-cells may initiate this compensatory response. Correlated changes in gene expression between tissues can provide evidence for such intercellular communication. We profiled gene expression in six tissues of mice from an obesity-induced diabetes-resistant and a diabetes-susceptible strain before and after the onset of diabetes. We studied the correlation structure of mRNA abundance and identified 105 co-expression gene modules. We provide an interactive gene network model showing the correlation structure between the expression modules within and among the six tissues. This resource also provides a searchable database of gene expression profiles for all genes in six tissues in lean and obese diabetes-resistant and diabetes-susceptible mice, at 4 and 10 wk of age. A cell cycle regulatory module in islets predicts diabetes susceptibility. The module predicts islet replication; we found a strong correlation between (2)H(2)O incorporation into islet DNA in vivo and the expression pattern of the cell cycle module. This pattern is highly correlated with that of several individual genes in insulin target tissues, including Igf2, which has been shown to promote beta-cell proliferation, suggesting that these genes may provide a link between insulin resistance and beta-cell proliferation.

Journal ArticleDOI
TL;DR: The first comprehensive genomic survey of the immune gene repertoire of the Amphioxus Branchiostoma floridae suggests that the amphioxus, a species without vertebrate-type adaptive immunity, holds extraordinary innate complexity and diversity.
Abstract: It has been speculated that before vertebrates evolved somatic diversity-based adaptive immunity, the germline-encoded diversity of innate immunity may have been more developed. Amphioxus occupies the basal position of the chordate phylum and hence is an important reference to the evolution of vertebrate immunity. Here we report the first comprehensive genomic survey of the immune gene repertoire of the amphioxus Branchiostoma floridae. It has been reported that the purple sea urchin has a vastly expanded innate receptor repertoire not previously seen in other species, which includes 222 toll-like receptors (TLRs), 203 NOD/NALP-like receptors (NLRs), and 218 scavenger receptors (SRs). We discovered that the amphioxus genome contains comparable expansion with 71 TLR gene models, 118 NLR models, and 270 SR models. Amphioxus also expands other receptor-like families, including 1215 C-type lectin models, 240 LRR and IGcam-containing models, 1363 other LRR-containing models, 75 C1q-like models, 98 ficolin-like models, and hundreds of models containing complement-related domains. The expansion is not restricted to receptors but is likely to extend to intermediate signal transducers because there are 58 TIR adapter-like models, 36 TRAF models, 44 initiator caspase models, and 541 death-fold domain-containing models in the genome. Amphioxus also has a sophisticated TNF system and a complicated complement system not previously seen in other invertebrates. Besides the increase of gene number, domain combinations of immune proteins are also increased. Altogether, this survey suggests that the amphioxus, a species without vertebrate-type adaptive immunity, holds extraordinary innate complexity and diversity.

Journal ArticleDOI
TL;DR: The authors' analysis of Tribolium indicates that, during insect evolution, genes for neuropeptides and protein hormones are often duplicated or lost.
Abstract: Neuropeptides and protein hormones are ancient molecules that mediate cell-to-cell communication. The whole genome sequence from the red flour beetle Tribolium castaneum, along with those from other insect species, provides an opportunity to study the evolution of the genes encoding neuropeptide and protein hormones. We identified 41 of these genes in the Tribolium genome by using a combination of bioinformatic and peptidomic approaches. These genes encode >80 mature neuropeptides and protein hormones, 49 peptides of which were experimentally identified by peptidomics of the central nervous system and other neuroendocrine organs. Twenty-three genes have orthologs in Drosophila melanogaster: Sixteen genes in five different groups are likely the result of recent gene expansions during beetle evolution. These five groups contain peptides related to antidiuretic factor-b (ADF-b), CRF-like diuretic hormone (DH37 and DH47 of Tribolium), adipokinetic hormone (AKH), eclosion hormone, and insulin-like peptide. In addition, we found a gene encoding an arginine-vasopressin-like (AVPL) peptide and one for its receptor. Both genes occur only in Tribolium and not in other holometabolous insects with a sequenced genome. The presence of many additional osmoregulatory peptides in Tribolium agrees well with its ability to live in very dry surroundings. In contrast to these extra genes, there are at least nine neuropeptide genes missing in Tribolium, including the genes encoding the prepropeptides for corazonin, kinin, and allatostatin-A. The cognate receptor genes for these three peptides also appear to be absent in the Tribolium genome. Our analysis of Tribolium indicates that, during insect evolution, genes for neuropeptides and protein hormones are often duplicated or lost.

Journal ArticleDOI
TL;DR: A SNP detection method, with variants for low coverage, high coverage, and PCR amplicon applications, and evaluated it on known-truth data sets, and demonstrates good specificity in single reads, and excellent specificity in high-coverage data.
Abstract: Promising new sequencing technologies, based on sequencing-by-synthesis (SBS), are starting to deliver large amounts of DNA sequence at very low cost. Polymorphism detection is a key application. We describe general methods for improved quality scores and accurate automated polymorphism detection, and apply them to data from the Roche (454) Genome Sequencer 20. We assess our methods using known-truth data sets, which is critical to the validity of the assessments. We developed informative, base-by-base error predictors for this sequencer and used a variant of the phred binning algorithm to combine them into a single empirically derived quality score. These quality scores are more useful than those produced by the system software: They both better predict actual error rates and identify many more high-quality bases. We developed a SNP detection method, with variants for low coverage, high coverage, and PCR amplicon applications, and evaluated it on known-truth data sets. We demonstrate good specificity in single reads, and excellent specificity (no false positives in 215 kb of genome) in high-coverage data.