scispace - formally typeset
Search or ask a question

Showing papers in "Genome Research in 2009"


Journal ArticleDOI
TL;DR: Circos uses a circular ideogram layout to facilitate the display of relationships between pairs of positions by the use of ribbons, which encode the position, size, and orientation of related genomic elements.
Abstract: We created a visualization tool called Circos to facilitate the identification and analysis of similarities and differences arising from comparisons of genomes. Our tool is effective in displaying variation in genome structure and, generally, any other kind of positional relationships between genomic intervals. Such data are routinely produced by sequence alignments, hybridization arrays, genome mapping, and genotyping studies. Circos uses a circular ideogram layout to facilitate the display of relationships between pairs of positions by the use of ribbons, which encode the position, size, and orientation of related genomic elements. Circos is capable of displaying data as scatter, line, and histogram plots, heat maps, tiles, connectors, and text. Bitmap or vector images can be created from GFF-style data inputs and hierarchical configuration files, which can be easily generated by automated tools, making Circos suitable for rapid deployment in data analysis and reporting pipelines.

8,315 citations


Journal ArticleDOI
TL;DR: This work overhauled its tool for finding preferential conservation of sequence motifs and applied it to the analysis of human 3'UTRs, increasing by nearly threefold the detected number of preferentially conserved miRNA target sites.
Abstract: MicroRNAs (miRNAs) are small endogenous RNAs that pair to sites in mRNAs to direct post-transcriptional repression. Many sites that match the miRNA seed (nucleotides 2–7), particularly those in 3 untranslated regions (3UTRs), are preferentially conserved. Here, we overhauled our tool for finding preferential conservation of sequence motifs and applied it to the analysis of human 3UTRs, increasing by nearly threefold the detected number of preferentially conserved miRNA target sites. The new tool more efficiently incorporates new genomes and more completely controls for background conservation by accounting for mutational biases, dinucleotide conservation rates, and the conservation rates of individual UTRs. The improved background model enabled preferential conservation of a new site type, the “offset 6mer,” to be detected. In total, >45,000 miRNA target sites within human 3UTRs are conserved above background levels, and >60% of human protein-coding genes have been under selective pressure to maintain pairing to miRNAs. Mammalian-specific miRNAs have far fewer conserved targets than do the more broadly conserved miRNAs, even when considering only more recently emerged targets. Although pairing to the 3 end of miRNAs can compensate for seed mismatches, this class of sites constitutes less than 2% of all preferentially conserved sites detected. The new tool enables statistically powerful analysis of individual miRNA target sites, with the probability of preferentially conserved targeting (PCT) correlating with experimental measurements of repression. Our expanded set of target predictions (including conserved 3-compensatory sites), are available at the TargetScan website, which displays the PCT for each site and each predicted target.

7,744 citations


Journal ArticleDOI
TL;DR: The results show that ADMIXTURE's computational speed opens up the possibility of using a much larger set of markers in model-based ancestry estimation and that its estimates are suitable for use in correcting for population stratification in association studies.
Abstract: Population stratification has long been recognized as a confounding factor in genetic association studies. Estimated ancestries, derived from multi-locus genotype data, can be used to perform a statistical correction for population stratification. One popular technique for estimation of ancestry is the model-based approach embodied by the widely applied program structure. Another approach, implemented in the program EIGENSTRAT, relies on Principal Component Analysis rather than model-based estimation and does not directly deliver admixture fractions. EIGENSTRAT has gained in popularity in part owing to its remarkable speed in comparison to structure. We present a new algorithm and a program, ADMIXTURE, for model-based estimation of ancestry in unrelated individuals. ADMIXTURE adopts the likelihood model embedded in structure. However, ADMIXTURE runs considerably faster, solving problems in minutes that take structure hours. In many of our experiments, we have found that ADMIXTURE is almost as fast as EIGENSTRAT. The runtime improvements of ADMIXTURE rely on a fast block relaxation scheme using sequential quadratic programming for block updates, coupled with a novel quasi-Newton acceleration of convergence. Our algorithm also runs faster and with greater accuracy than the implementation of an Expectation-Maximization (EM) algorithm incorporated in the program FRAPPE. Our simulations show that ADMIXTURE's maximum likelihood estimates of the underlying admixture coefficients and ancestral allele frequencies are as accurate as structure's Bayesian estimates. On real-world data sets, ADMIXTURE's estimates are directly comparable to those from structure and EIGENSTRAT. Taken together, our results show that ADMIXTURE's computational speed opens up the possibility of using a much larger set of markers in model-based ancestry estimation and that its estimates are suitable for use in correcting for population stratification in association studies.

5,846 citations


Journal ArticleDOI
TL;DR: ABySS (Assembly By Short Sequences), a parallelized sequence assembler, was developed and assembled 3.5 billion paired-end reads from the genome of an African male publicly released by Illumina, Inc, representing 68% of the reference human genome.
Abstract: Widespread adoption of massively parallel deoxyribonucleic acid (DNA) sequencing instruments has prompted the recent development of de novo short read assembly algorithms. A common shortcoming of the available tools is their inability to efficiently assemble vast amounts of data generated from large-scale sequencing projects, such as the sequencing of individual human genomes to catalog natural genetic variation. To address this limitation, we developed ABySS (Assembly By Short Sequences), a parallelized sequence assembler. As a demonstration of the capability of our software, we assembled 3.5 billion paired-end reads from the genome of an African male publicly released by Illumina, Inc. Approximately 2.76 million contigs > or =100 base pairs (bp) in length were created with an N50 size of 1499 bp, representing 68% of the reference human genome. Analysis of these contigs identified polymorphic and novel sequences not present in the human reference assembly, which were validated by alignment to alternate human assemblies and to other primate genomes.

3,483 citations


Journal ArticleDOI
TL;DR: The ultimate objective of the HMP is to demonstrate that there are opportunities to improve human health through monitoring or manipulation of the human microbiome.
Abstract: The Human Microbiome Project (HMP), funded as an initiative of the NIH Roadmap for Biomedical Research (http://nihroadmap.nih.gov), is a multi-component community resource. The goals of the HMP are: (1) to take advantage of new, high-throughput technologies to characterize the human microbiome more fully by studying samples from multiple body sites from each of at least 250 "normal" volunteers; (2) to determine whether there are associations between changes in the microbiome and health/disease by studying several different medical conditions; and (3) to provide both a standardized data resource and new technological approaches to enable such studies to be undertaken broadly in the scientific community. The ethical, legal, and social implications of such research are being systematically studied as well. The ultimate objective of the HMP is to demonstrate that there are opportunities to improve human health through monitoring or manipulation of the human microbiome. The history and implementation of this new program are described here.

1,820 citations


Journal ArticleDOI
TL;DR: A comprehensive gene orientated phylogenetic resource, EnsemblCompara GeneTrees, based on a computational pipeline to handle clustering, multiple alignment, and tree generation, including the handling of large gene families, is developed.
Abstract: The use of phylogenetic trees to describe the evolution of biological processes was established in the 1950s (Hennig 1952) and remains a fundamental approach to understanding the evolution of individual genes through to complete genomes; for example, in the mouse (Mouse Genome Sequencing Consortium 2002), rat (Gibbs et al. 2004), chicken (International Chicken Genome Sequencing Consortium 2004), and monodelphis (Mikkelsen et al. 2007) genome papers, and numerous papers on individual sequences. Now routine, the determination of vertebrate genome sequences provides a rich data source to understand evolution, and using phylogenetic trees of the genes is one of the best ways to organize these data. However, the increased set of genomes makes the compute and engineering tasks to form all the gene trees progressively more complex and harder for individual groups to use. The Ensembl project provides an accurate and consistent protein-coding gene set for all vertebrate genomes (International Human Genome Sequencing Consortium 2001; Dehal et al. 2002; Mouse Genome Sequencing Consortium 2002; Gibbs et al. 2004; Xie et al. 2005; Mikkelsen et al. 2007; Rhesus Macaque Genome Sequencing and Analysis Consortium 2007). Previously (until April 2006), Ensembl provided a basic method for tracing orthologs via the Best Reciprocal BLAST method, similar to approaches used in other genome analyses, such as Drosophila melanogaster (Adams et al. 2000) or human (International Human Genome Sequencing Consortium 2001). In June 2006 (Hubbard et al. 2007), we replaced this system with a phylogenetically sound, gene tree-based approach, providing a complete set of phylogenetic trees spanning 91% of genes across vertebrates. In addition to the vertebrates we have included a few important non-vertebrate species (fly, worm, and yeast) to act both as out groups and provide links to these model organisms. In this paper we provide the motivation, implementation, and benchmarking of this method and document the display and access methods for these trees. There have been a number of methods proposed for routine generation of genomewide orthology descriptions, including Inparanoid (Remm et al. 2001), MSOAR (Fu et al. 2007), OrthoMCL (Li et al. 2003), HomoloGene (Wheeler et al. 2008), TreeFam (Li et al. 2006), PhyOP (Goodstadt and Ponting 2006), and PhiGs (Dehal and Boore 2006). The first four, Inparanoid, MSOAR, OrthoMCL, and HomoloGene, focus on providing clusters (or linked clusters) of genes, without an explicit tree topology. PhyOP (Goodstadt and Ponting 2006) uses a tree-based method, but between pairs of closely related species, resolving paralogs accurately by using neutral substitution (as measured by d S, the synonymous substitution rate). TreeFam provides an explicit gene tree across multiple species, using both d S, d N (nonsynonymous substitution rate), nucleotide and protein distance measures, and the standard species tree to balance duplications vs. deletions to inform the tree construction, using the program TreeBeST (http://treesoft.sourceforge.net/treebest.shtml; L. Heng, A.J. Vilella, E. Birney, and R. Durbin, in prep.). The PhiGs method (Dehal and Boore 2006) is a leading phylogenetic-based method that produced a comprehensive phylogenetic resource for the genomes at the time it was run, and the basic outline of its analysis, which was clustering of protein sequences, followed by phylogenetic trees, is similar to the method presented here. However, the PhiGs resource covered a smaller number of species (23 vs. 45) and has been difficult to keep up to date with the advances in gene sets and genomes. Another major difference between PhiG-based phylogenetic trees and the phylogenetic trees presented here is that the former was calculated using a single maximum likelihood method based on protein evolution. In contrast, the Ensembl gene trees are calculated using a new method, TreeBeST, which integrates multiple tree topologies, in particular both DNA level and protein level models and combines this with a species-tree aware penalization of topologies, which are inconsistent with known species relationships. We show in this paper that this method produces trees that are more consistent with synteny relationships and less anomalous topologies than single protein-based phylogenetic methods. There are also many single phylogenetic tree-building approaches, many of them based on maximum likelihood methods; one leading method is PhyML (Guindon and Gascuel 2003). It is unclear what is the best method to use, in particular in the context of genome-wide tree building with constraints on computational costs and the need to robustly handle many complex scenarios usually involving large families with heterogeneous phylogenetic depths. In this paper, we benchmark in vertebrates the tree programs TreeBeST and PhyML, and the resulting trees to basic best reciprocal hit (BRH) methods, and cluster frameworks, in particular Inparanoid and HomoloGene. We also benchmark to a recent PhyOP data set. The PhyOP pipeline has recently switched to use the same tree-building program (TreeBeST) that we use, but differs in its input clusters. Although we adopted this same tree-building method, we describe here considerable novel engineering in the deployment of these methods across all vertebrates. Similar to the PhiGs resource, we have used the dense coverage of genomes to provide topologically based timings (i.e., the standard use of outgroups vs. subsequent lineages to bracket a duplication), in order to label duplication events.

1,135 citations


Journal ArticleDOI
TL;DR: A consensus-calling and SNP-detection method for sequencing-by-synthesis Illumina Genome Analyzer technology that has a very low false call rate at any sequencing depth and excellent genome coverage at a high sequencing depth.
Abstract: Next-generation massively parallel sequencing technologies provide ultrahigh throughput at two orders of magnitude lower unit cost than capillary Sanger sequencing technology. One of the key applications of next-generation sequencing is studying genetic variation between individuals using whole-genome or target region resequencing. Here, we have developed a consensus-calling and SNP-detection method for sequencing-by-synthesis Illumina Genome Analyzer technology. We designed this method by carefully considering the data quality, alignment, and experimental errors common to this technology. All of this information was integrated into a single quality score for each base under Bayesian theory to measure the accuracy of consensus calling. We tested this methodology using a large-scale human resequencing data set of 36× coverage and assembled a high-quality nonrepetitive consensus sequence for 92.25% of the diploid autosomes and 88.07% of the haploid X chromosome. Comparison of the consensus sequence with Illumina human 1M BeadChip genotyped alleles from the same DNA sample showed that 98.6% of the 37,933 genotyped alleles on the X chromosome and 98% of 999,981 genotyped alleles on autosomes were covered at 99.97% and 99.84% consistency, respectively. At a low sequencing depth, we used prior probability of dbSNP alleles and were able to improve coverage of the dbSNP sites significantly as compared to that obtained using a nonimputation model. Our analyses demonstrate that our method has a very low false call rate at any sequencing depth and excellent genome coverage at a high sequencing depth.

968 citations


Journal ArticleDOI
TL;DR: Using a comparative genomics data set of 32 vertebrate species, it is shown that a likelihood ratio test (LRT) can accurately identify a subset of deleterious mutations that disrupt highly conserved amino acids within protein-coding sequences, which are likely to be unconditionally deleteriously.
Abstract: Each human carries a large number of deleterious mutations. Together, these mutations make a significant contribution to human disease. Identification of deleterious mutations within individual genome sequences could substantially impact an individual’s health through personalized prevention and treatment of disease. Yet, distinguishing deleterious mutations from the massive number of nonfunctional variants that occur within a single genome is a considerable challenge. Using a comparative genomics data set of 32 vertebrate species we show that a likelihood ratio test (LRT) can accurately identify a subset of deleterious mutations that disrupt highly conserved amino acids within protein-coding sequences, which are likely to be unconditionally deleterious. The LRT is also able to identify known human disease alleles and performs as well as two commonly used heuristic methods, SIFT and PolyPhen. Application of the LRT to three human genomes reveals 796–837 deleterious mutations per individual, ;40% of which are estimated to be at <5% allele frequency. However, the overlap between predictions made by the LRT, SIFT, and PolyPhen, is low; 76% of predictions are unique to one of the three methods, and only 5% of predictions are shared across all three methods. Our results indicate that only a small subset of deleterious mutations can be reliably identified, but that this subset provides the raw material for personalized medicine. [Supplemental material is available online at http://www.genome.org.]

956 citations


Journal ArticleDOI
TL;DR: Some of the different approaches to community profiling are discussed, highlighting strengths and weaknesses of various experimental approaches, sequencing methodologies, and analytical methods and addressing one key question emerging from various Human Microbiome Projects.
Abstract: High-throughput sequencing studies and new software tools are revolutionizing microbial community analyses, yet the variety of experimental and computational methods can be daunting. In this review, we discuss some of the different approaches to community profiling, highlighting strengths and weaknesses of various experimental approaches, sequencing methodologies, and analytical methods. We also address one key question emerging from various Human Microbiome Projects: Is there a substantial core of abundant organisms or lineages that we all share? It appears that in some human body habitats, such as the hand and the gut, the diversity among individuals is so great that we can rule out the possibility that any species is at high abundance in all individuals: It is possible that the focus should instead be on higher-level taxa or on functional genes instead.

895 citations


Journal ArticleDOI
TL;DR: A high-throughput genome-based method for genotyping recombinant populations utilizing whole-genome resequencing data generated by the Illumina Genome Analyzer is developed and located a quantitative trait locus of large effect on plant height in a 100-kb region containing the rice "green revolution" gene.
Abstract: The next-generation sequencing technology coupled with the growing number of genome sequences opens the opportunity to redesign genotyping strategies for more effective genetic mapping and genome analysis. We have developed a high-throughput method for genotyping recombinant populations utilizing whole-genome resequencing data generated by the Illumina Genome Analyzer. A sliding window approach is designed to collectively examine genome-wide single nucleotide polymorphisms for genotype calling and recombination breakpoint determination. Using this method, we constructed a genetic map for 150 rice recombinant inbred lines with an expected genotype calling accuracy of 99.94% and a resolution of recombination breakpoints within an average of 40 kb. In comparison to the genetic map constructed with 287 PCR-based markers for the rice population, the sequencing-based method was approximately 20x faster in data collection and 35x more precise in recombination breakpoint determination. Using the sequencing-based genetic map, we located a quantitative trait locus of large effect on plant height in a 100-kb region containing the rice "green revolution" gene. Through computer simulation, we demonstrate that the method is robust for different types of mapping populations derived from organisms with variable quality of genome sequences and is feasible for organisms with large genome sizes and low polymorphisms. With continuous advances in sequencing technologies, this genome-based method may replace the conventional marker-based genotyping approach to provide a powerful tool for large-scale gene discovery and for addressing a wide range of biological questions.

773 citations


Journal ArticleDOI
TL;DR: Analysis of recent selection in a global sample of 53 populations, using genotype data from the Human Genome Diversity-CEPH Panel, suggests that there has been selection on loci involved in susceptibility to type II diabetes.
Abstract: Genome-wide scans for recent positive selection in humans have yielded insight into the mechanisms underlying the extensive phenotypic diversity in our species, but have focused on a limited number of populations. Here, we present an analysis of recent selection in a global sample of 53 populations, using genotype data from the Human Genome Diversity-CEPH Panel. We refine the geographic distributions of known selective sweeps, and find extensive overlap between these distributions for populations in the same continental region but limited overlap between populations outside these groupings. We present several examples of previously unrecognized candidate targets of selection, including signals at a number of genes in the NRG-ERBB4 developmental pathway in non-African populations. Analysis of recently identified genes involved in complex diseases suggests that there has been selection on loci involved in susceptibility to type II diabetes. Finally, we search for local adaptation between geographically close populations, and highlight several examples.

Journal ArticleDOI
TL;DR: An open source, portable, JavaScript-based genome browser that can be used to navigate genome annotations over the web, and a simple wiki plug-in that allows users to upload and share annotation tracks is described.
Abstract: We describe an open source, portable, JavaScript-based genome browser, JBrowse, that can be used to navigate genome annotations over the web. JBrowse helps preserve the user's sense of location by avoiding discontinuous transitions, instead offering smoothly animated panning, zooming, navigation, and track selection. Unlike most existing genome browsers, where the genome is rendered into images on the webserver and the role of the client is restricted to displaying those images, JBrowse distributes work between the server and client and therefore uses significantly less server overhead than previous genome browsers. We report benchmark results empirically comparing server- and client-side rendering strategies, review the architecture and design considerations of JBrowse, and describe a simple wiki plug-in that allows users to upload and share annotation tracks.

Journal ArticleDOI
TL;DR: The data indicate that C TCF may play important roles in the barrier activity of insulators, and this study provides a resource for further investigation of the CTCF function in organizing chromatin in the human genome.
Abstract: Insulators, which are DNA elements that prevent inappropriate interactions between the neighboring regions of the genome, can be functionally classified into enhancer blockers and barriers. The enhancer-blocking insulators prevent enhancers from interacting with unrelated genes, and the barrier insulators protect genes and regulatory regions from the adjacent heterochromatin or repressive domain-mediated effects, thus preventing position effects (Gerasimova and Corces 1996; Bell et al. 1999; Felsenfeld et al. 2004). Identified originally in Drosophila, insulators are known to bind proteins that mediate the insulator activity (Gerasimova and Corces 2001). While several such proteins have been identified in Drosophila, the only major insulator-binding protein identified in vertebrates is CTCF (CCCTC-binding factor) (Bell et al. 1999; Gerasimova and Corces 2001; West et al. 2002; Felsenfeld et al. 2004). CTCF, a ubiquitously-expressed 11-zinc finger protein, is a critical transcription factor, which is involved in transcriptional activation and repression in addition to binding the chromatin insulators (Ohlsson et al. 2001; Gaszner and Felsenfeld 2006; Williams and Flavell 2008). It was originally identified as a repressor (Lobanenkov et al. 1990; Filippova et al. 1996) and later shown to be an activator of transcription (Vostrov and Quitschke 1997). Recently, it has been implicated in X chromosome inactivation (Filippova et al. 2005; Xu et al. 2007). The enhancer-blocking insulator activity of CTCF was first demonstrated at the HS4 insulator located at the 5′ end of the chicken beta-globin locus (Bell et al. 1999). The insulator function of CTCF has also been implicated in imprinting at the Igf2/H19 locus (Bell and Felsenfeld 2000; Hark et al. 2000; Kanduri et al. 2000; Fedoriw et al. 2004). Recently, several genome-scale mapping experiments for CTCF-binding sites have been performed for a better understanding of the CTCF function. A study in mouse identified ∼200 CTCF-bound DNA fragments displaying enhancer-blocking activity (Mukhopadhyay et al. 2004). In a computational analysis of the human conserved noncoding elements, nearly 15,000 potential CTCF-binding sites were identified (Xie et al. 2007). A recent chromatin immunoprecipitation with microarray hybridization (ChIP-chip) study in human IMR90 cells identified 13,804 CTCF-binding regions (Kim et al. 2007). A cell-type invariance of CTCF binding was reported in this study by comparing the binding sites in IMR90 cells with that of the 232 sites identified in U937 cells (Kim et al. 2007). In our earlier chromatin immunoprecipitation with massively parallel sequencing (ChIP-seq) studies, we had observed CTCF-binding sites flanking active domains with the region outside being histone H3K27 trimethylated (H3K27me3), a modification associated with the repressed regions of chromatin (Barski et al. 2007). Even though initial studies of chicken HS4 insulator suggested the importance of the CTCF-binding sites for its barrier activity, later dissection of this insulator showed that CTCF was not required for this activity (Recillas-Targa et al. 2002). While a few other studies in the recent past have suggested a barrier activity for CTCF (Cho et al. 2005; Filippova et al. 2005), there has been no direct evidence for this (Gaszner and Felsenfeld 2006). In order to examine whether CTCF is indeed involved in the barrier activity, it is important to delineate the relationship between CTCF-binding sites and the repressive and active domains of the genome. In this study we investigated the potential role of CTCF in delimiting the repressive genomic domains. To identify CTCF-bound genomic sites at high resolution, we analyzed the ChIP-seq data from HeLa and Jurkat cells obtained in this study along with the ChIP-seq data from resting human CD4+ T cells (Barski et al. 2007) using the binding-site identification algorithm, SISSRs (site identification from short sequence reads) (Jothi et al. 2008). Our data revealed an extensive overlap of the CTCF-binding sites across the genome between the different cell types studied. A subset of the CTCF-binding sites was significantly associated with the boundaries of H3K27me3 domains, suggesting a possible repressive domain barrier function. Interestingly, the potential domain barrier activity of CTCF was cell-type-specific. We observed strong cell-type-specific phasing of nucleosomes at the CTCF-binding sites. We found that the histone H2AK5 acetylation (H2AK5ac) marked the active regions of the genome and was complementary to H3K27me3. CTCF binding in between these two domains further reinforces its potential role in the barrier insulator function.

Journal ArticleDOI
TL;DR: The results suggest that analysis of read depth is an effective approach for the detection of CNVs, and it captures structural variants that are refractory to established PEM-based methods.
Abstract: Methods for the direct detection of copy number variation (CNV) genome-wide have become effective instruments for identifying genetic risk factors for disease. The application of next-generation sequencing platforms to genetic studies promises to improve sensitivity to detect CNVs as well as inversions, indels, and SNPs. New computational approaches are needed to systematically detect these variants from genome sequence data. Existing sequence-based approaches for CNV detection are primarily based on paired-end read mapping (PEM) as reported previously by Tuzun et al. and Korbel et al. Due to limitations of the PEM approach, some classes of CNVs are difficult to ascertain, including large insertions and variants located within complex genomic regions. To overcome these limitations, we developed a method for CNV detection using read depth of coverage. Event-wise testing (EWT) is a method based on significance testing. In contrast to standard segmentation algorithms that typically operate by performing likelihood evaluation for every point in the genome, EWT works on intervals of data points, rapidly searching for specific classes of events. Overall false-positive rate is controlled by testing the significance of each possible event and adjusting for multiple testing. Deletions and duplications detected in an individual genome by EWT are examined across multiple genomes to identify polymorphism between individuals. We estimated error rates using simulations based on real data, and we applied EWT to the analysis of chromosome 1 from paired-end shotgun sequence data (30×) on five individuals. Our results suggest that analysis of read depth is an effective approach for the detection of CNVs, and it captures structural variants that are refractory to established PEM-based methods.

Journal ArticleDOI
TL;DR: Dozens of mutations previously described in OMIM and hundreds of nonsynonymous single-nucleotide and structural variants in genes previously implicated in disease are identified in this individual.
Abstract: We describe the genome sequencing of an anonymous individual of African origin using a novel ligation-based sequencing assay that enables a unique form of error correction that improves the raw accuracy of the aligned reads to >99.9%, allowing us to accurately call SNPs with as few as two reads per allele. We collected several billion mate-paired reads yielding approximately 18x haploid coverage of aligned sequence and close to 300x clone coverage. Over 98% of the reference genome is covered with at least one uniquely placed read, and 99.65% is spanned by at least one uniquely placed mate-paired clone. We identify over 3.8 million SNPs, 19% of which are novel. Mate-paired data are used to physically resolve haplotype phases of nearly two-thirds of the genotypes obtained and produce phased segments of up to 215 kb. We detect 226,529 intra-read indels, 5590 indels between mate-paired reads, 91 inversions, and four gene fusions. We use a novel approach for detecting indels between mate-paired reads that are smaller than the standard deviation of the insert size of the library and discover deletions in common with those detected with our intra-read approach. Dozens of mutations previously described in OMIM and hundreds of nonsynonymous single-nucleotide and structural variants in genes previously implicated in disease are identified in this individual. There is more genetic variation in the human genome still to be uncovered, and we provide guidance for future surveys in populations and cancer biopsies.

Journal ArticleDOI
TL;DR: The findings indicate that the MEN epsilon/beta non-coding RNAs are essential structural/organizational components of paraspeckles.
Abstract: Studies of the transcriptional output of the human and mouse genomes have revealed that there are many more transcripts produced than can be accounted for by predicted protein-coding genes. Using a custom microarray, we have identified 184 non-coding RNAs that exhibit more than twofold up- or down-regulation upon differentiation of C2C12 myoblasts into myotubes. Here, we focus on the Men epsilon/beta locus, which is up-regulated 3.3-fold during differentiation. Two non-coding RNA isoforms are produced from a single RNA polymerase II promoter, differing in the location of their 3' ends. Men epsilon is a 3.2-kb polyadenylated RNA, whereas Men beta is an approximately 20-kb transcript containing a genomically encoded poly(A)-rich tract at its 3'-end. The 3'-end of Men beta is generated by RNase P cleavage. The Men epsilon/beta transcripts are localized to nuclear paraspeckles and directly interact with NONO. Knockdown of MEN epsilon/beta expression results in the disruption of nuclear paraspeckles. Furthermore, the formation of paraspeckles, after release from transcriptional inhibition by DRB treatment, was suppressed in MEN epsilon/beta-depleted cells. Our findings indicate that the MEN epsilon/beta non-coding RNAs are essential structural/organizational components of paraspeckles.

Journal ArticleDOI
TL;DR: This screen validated the hypothesis that the authors can simultaneously assay every gene in the genome to identify niche-specific essential genes and generate a genome-wide list of candidate essential genes.
Abstract: Very high-throughput sequencing technologies need to be matched by high-throughput functional studies if we are to make full use of the current explosion in genome sequences. We have generated a very large bacterial mutant pool, consisting of an estimated 1.1 million transposon mutants and we have used genomic DNA from this mutant pool, and Illumina nucleotide sequencing to prime from the transposon and sequence into the adjacent target DNA. With this method, which we have called TraDIS (transposon directed insertion-site sequencing), we have been able to map 370,000 unique transposon insertion sites to the Salmonella enterica serovar Typhi chromosome. The unprecedented density and resolution of mapped insertion sites, an average of one every 13 base pairs, has allowed us to assay simultaneously every gene in the genome for essentiality and generate a genome-wide list of candidate essential genes. In addition, the semiquantitative nature of the assay allowed us to identify genes that are advantageous and those that are disadvantageous for growth under standard laboratory conditions. Comparison of the mutant pool following growth in the presence or absence of ox bile enabled every gene to be assayed for its contribution toward bile tolerance, a trait required of any enteric bacterium and for carriage of S. Typhi in the gall bladder. This screen validated our hypothesis that we can simultaneously assay every gene in the genome to identify niche-specific essential genes.

Journal ArticleDOI
TL;DR: A model in which host silencing of TEs near genes has deleterious effects on neighboring gene expression, resulting in the preferential loss of methylated TEs from gene-rich chromosomal regions is presented.
Abstract: Transposable elements (TEs) are ubiquitous genomic parasites. The deleterious consequences of the presence and activity of TEs have fueled debate about the evolutionary forces countering their expansion. Purifying selection is thought to purge TE insertions from the genome, and TE sequences are targeted by hosts for epigenetic silencing. However, the interplay between epigenetic and evolutionary forces countering TE expansion remains unexplored. Here we analyze genomic, epigenetic, and population genetic data from Arabidopsis thaliana to yield three observations. First, gene expression is negatively correlated with the density of methylated TEs. Second, the signature of purifying selection is detectable for methylated TEs near genes but not for unmethylated TEs or for TEs far from genes. Third, TE insertions are distributed by age and methylation status, such that older, methylated TEs are farther from genes. Based on these observations, we present a model in which host silencing of TEs near genes has deleterious effects on neighboring gene expression, resulting in the preferential loss of methylated TEs from gene-rich chromosomal regions. This mechanism implies an evolutionary tradeoff in which the benefit of TE silencing imposes a fitness cost via deleterious effects on the expression of nearby genes.

Journal ArticleDOI
TL;DR: The CCDS database centralizes the function of identifying well-supported, identically-annotated, protein-coding regions and indicates that the entries in the CCDS set are highly likely to represent real proteins, more so than annotations from contributing groups not included in CCDS.
Abstract: Effective use of the human and mouse genomes requires reliable identification of genes and their products. Although multiple public resources provide annotation, different methods are used that can result in similar but not identical representation of genes, transcripts, and proteins. The collaborative consensus coding sequence (CCDS) project tracks identical protein annotations on the reference mouse and human genomes with a stable identifier (CCDS ID), and ensures that they are consistently represented on the NCBI, Ensembl, and UCSC Genome Browsers. Importantly, the project coordinates on manually reviewing inconsistent protein annotations between sites, as well as annotations for which new evidence suggests a revision is needed, to progressively converge on a complete protein-coding set for the human and mouse reference genomes, while maintaining a high standard of reliability and biological accuracy. To date, the project has identified 20,159 human and 17,707 mouse consensus coding regions from 17,052 human and 16,893 mouse genes. Three evaluation methods indicate that the entries in the CCDS set are highly likely to represent real proteins, more so than annotations from contributing groups not included in CCDS. The CCDS database thus centralizes the function of identifying well-supported, identically-annotated, protein-coding regions.

Journal ArticleDOI
TL;DR: DNA binding and transcriptional response to hormone reveals new insights into the complexities of gene regulatory activities managed by GR, and the GR can respond to different levels of corticosteroids in a gene-specific manner.
Abstract: The glucocorticoid steroid hormone cortisol is released by the adrenal glands in response to stress and serves as a messenger in circadian rhythms. Transcriptional responses to this hormonal signal are mediated by the glucocorticoid receptor (GR). We determined GR binding throughout the human genome by using chromatin immunoprecipitation followed by next-generation DNA sequencing, and measured related changes in gene expression with mRNA sequencing in response to the glucocorticoid dexamethasone (DEX). We identified 4392 genomic positions occupied by the GR and 234 genes with significant changes in expression in response to DEX. This genomic census revealed striking differences between gene activation and repression by the GR. While genes activated with DEX treatment have GR bound within a median distance of 11 kb from the transcriptional start site (TSS), the nearest GR binding for genes repressed with DEX treatment is a median of 146 kb from the TSS, suggesting that DEX-mediated repression occurs independently of promoter-proximal GR binding. In addition to the dramatic differences in proximity of GR binding, we found differences in the kinetics of gene expression response for induced and repressed genes, with repression occurring substantially after induction. We also found that the GR can respond to different levels of corticosteroids in a gene-specific manner. For example, low doses of DEX selectively induced PER1, a transcription factor involved in regulating circadian rhythms. Overall, the genome-wide determination and analysis of GR:DNA binding and transcriptional response to hormone reveals new insights into the complexities of gene regulatory activities managed by GR.

Journal ArticleDOI
TL;DR: Genome analysis of other epidemic ST313 isolates from Malawi and Kenya provided evidence for microevolution and clonal replacement in the field, including evidence of genome degradation, including pseudogene formation and chromosomal deletions, when compared with other S. Typhimurium genome sequences.
Abstract: Whereas most nontyphoidal Salmonella (NTS) are associated with gastroenteritis, there has been a dramatic increase in reports of NTS-associated invasive disease in sub-Saharan Africa. Salmonella enterica serovar Typhimurium isolates are responsible for a significant proportion of the reported invasive NTS in this region. Multilocus sequence analysis of invasive S. Typhimurium from Malawi and Kenya identified a dominant type, designated ST313, which currently is rarely reported outside of Africa. Whole-genome sequencing of a multiple drug resistant (MDR) ST313 NTS isolate, D23580, identified a distinct prophage repertoire and a composite genetic element encoding MDR genes located on a virulence-associated plasmid. Further, there was evidence of genome degradation, including pseudogene formation and chromosomal deletions, when compared with other S. Typhimurium genome sequences. Some of this genome degradation involved genes previously implicated in virulence of S. Typhimurium or genes for which the orthologs in S. Typhi are either pseudogenes or are absent. Genome analysis of other epidemic ST313 isolates from Malawi and Kenya provided evidence for microevolution and clonal replacement in the field.

Journal ArticleDOI
TL;DR: This work synthesized and tested hundreds of ZFNs to target dozens of different sites in the human CCR5 gene-a co-receptor required for HIV infection-and found that many of these nucleases induced site-specific mutations in the C CR5 sequence.
Abstract: Broad applications of zinc finger nuclease (ZFN) technology-which allows targeted genome editing-in research, medicine, and biotechnology are hampered by the lack of a convenient, rapid, and publicly available method for the synthesis of functional ZFNs. Here we describe an efficient and easy-to-practice modular-assembly method using publicly available zinc fingers to make ZFNs that can modify the DNA sequences of predetermined genomic sites in human cells. We synthesized and tested hundreds of ZFNs to target dozens of different sites in the human CCR5 gene-a co-receptor required for HIV infection-and found that many of these nucleases induced site-specific mutations in the CCR5 sequence. Because human cells that harbor CCR5 null mutations are functional and normal, these ZFNs might be used for (1) knocking out CCR5 to produce T-cells that are resistant to HIV infection in AIDS patients or (2) inserting therapeutic genes at "safe sites" in gene therapy applications.

Journal ArticleDOI
TL;DR: This work uses GERMLINE, a robust algorithm for identifying segmental sharing indicative of recent common ancestry between pairs of individuals, to comprehensively survey hidden relatedness both in the HapMap as well as in a densely typed island population of 3000 individuals.
Abstract: We present GERMLINE, a robust algorithm for identifying segmental sharing indicative of recent common ancestry between pairs of individuals. Unlike methods with comparable objectives, GERMLINE scales linearly with the number of samples, enabling analysis of whole-genome data in large cohorts. Our approach is based on a dictionary of haplotypes that is used to efficiently discover short exact matches between individuals. We then expand these matches using dynamic programming to identify long, nearly identical segmental sharing that is indicative of relatedness. We use GERMLINE to comprehensively survey hidden relatedness both in the HapMap as well as in a densely typed island population of 3000 individuals. We verify that GERMLINE is in concordance with other methods when they can process the data, and also facilitates analysis of larger scale studies. We bolster these results by demonstrating novel applications of precise analysis of hidden relatedness for (1) identification and resolution of phasing errors and (2) exposing polymorphic deletions that are otherwise challenging to detect. This finding is supported by concordance of detected deletions with other evidence from independent databases and statistical analyses of fluorescence intensity not used by GERMLINE.

Journal ArticleDOI
TL;DR: In this paper, the 3'-end of conserved miRNAs in particular has significant interaction sites in the human-enriched, less conserved 5'-UTR miRNA motifs.
Abstract: MicroRNAs (miRNAs) are known to post-transcriptionally regulate target mRNAs through the 3'-UTR, which interacts mainly with the 5'-end of miRNA in animals. Here we identify many endogenous motifs within human 5'-UTRs specific to the 3'-ends of miRNAs. The 3'-end of conserved miRNAs in particular has significant interaction sites in the human-enriched, less conserved 5'-UTR miRNA motifs, while human-specific miRNAs have significant interaction sites only in the conserved 5'-UTR motifs, implying both miRNA and 5'-UTR are actively evolving in response to each other. Additionally, many miRNAs with their 3'-end interaction sites in the 5'-UTRs turn out to simultaneously contain 5'-end interaction sites in the 3'-UTRs. Based on these findings we demonstrate combinatory interactions between a single miRNA and both end regions of an mRNA using model systems. We further show that genes exhibiting large-scale protein changes due to miRNA overexpression or deletion contain both UTR interaction sites predicted. We provide the predicted targets of this new miRNA target class, miBridge, as an efficient way to screen potential targets, especially for nonconserved miRNAs, since the target search space is reduced by an order of magnitude compared with the 3'-UTR alone. Efficacy is confirmed by showing SEC24D regulation with hsa-miR-605, a miRNA identified only in primate, opening the door to the study of nonconserved miRNAs. Finally, miRNAs (and associated proteins) involved in this new targeting class may prevent 40S ribosome scanning through the 5'-UTR and keep it from reaching the start-codon, preventing 60S association.

Journal ArticleDOI
TL;DR: The recent history of the burgeoning field of human population genomics is chronicle, genome-wide scans for positive selection in humans are critically assessed, important gaps in knowledge are identified, and both short- and long-term strategies for traversing the path from the low-resolution, incomplete, and error-prone maps of selection today to the ultimate goal of a detailed molecular, mechanistic, phenotypic, and population genetics characterization of adaptive alleles are discussed.
Abstract: Identifying targets of positive selection in humans has, until recently, been frustratingly slow, relying on the analysis of individual candidate genes. Genomics, however, has provided the necessary resources to systematically interrogate the entire genome for signatures of natural selection. To date, 21 genome-wide scans for recent or ongoing positive selection have been performed in humans. A key challenge is to begin synthesizing these newly constructed maps of positive selection into a coherent narrative of human evolutionary history and derive a deeper mechanistic understanding of how natural populations evolve. Here, I chronicle the recent history of the burgeoning field of human population genomics, critically assess genome-wide scans for positive selection in humans, identify important gaps in knowledge, and discuss both short- and long-term strategies for traversing the path from the low-resolution, incomplete, and error-prone maps of selection today to the ultimate goal of a detailed molecular, mechanistic, phenotypic, and population genetics characterization of adaptive alleles.

Journal ArticleDOI
TL;DR: It is argued that the distribution of effect size of common variants is the same for all phenotypes regardless of species, and the importance of epistasis, pleiotropy, and gene by environment interactions is discussed.
Abstract: We compare and contrast the genetic architecture of quantitative phenotypes in two genetically well-characterized model organisms, the laboratory mouse, Mus musculus, and the fruit fly, Drosophila melanogaster, with that found in our own species from recent successes in genome-wide association studies. We show that the current model of large numbers of loci, each of small effect, is true for all species examined, and that discrepancies can be largely explained by differences in the experimental designs used. We argue that the distribution of effect size of common variants is the same for all phenotypes regardless of species, and we discuss the importance of epistasis, pleiotropy, and gene by environment interactions. Despite substantial advances in mapping quantitative trait loci, the identification of the quantitative trait genes and ultimately the sequence variants has proved more difficult, so that our information on the molecular basis of quantitative variation remains limited. Nevertheless, available data indicate that many variants lie outside genes, presumably in regulatory regions of the genome, where they act by altering gene expression. As yet there are very few instances where homologous quantitative trait loci, or quantitative trait genes, have been identified in multiple species, but the availability of high-resolution mapping data will soon make it possible to test the degree of overlap between species.

Journal ArticleDOI
TL;DR: Direct sequence-based immunoprofiling will likely prove to be a useful tool for understanding repertoire dynamics in response to immune challenge, without a priori knowledge of antigen.
Abstract: T-cell receptor (TCR) genomic loci undergo somatic V(D)J recombination, plus the addition/subtraction of nontemplated bases at recombination junctions, in order to generate the repertoire of structurally diverse T cells necessary for antigen recognition. TCR beta subunits can be unambiguously identified by their hypervariable CDR3 (Complement Determining Region 3) sequence. This is the site of V(D)J recombination encoding the principal site of antigen contact. The complexity and dynamics of the T-cell repertoire remain unknown because the potential repertoire size has made conventional sequence analysis intractable. Here, we use 5'-RACE, Illumina sequencing, and a novel short read assembly strategy to sample CDR3(beta) diversity in human T lymphocytes from peripheral blood. Assembly of 40.5 million short reads identified 33,664 distinct TCR(beta) clonotypes and provides precise measurements of CDR3(beta) length diversity, usage of nontemplated bases, sequence convergence, and preferences for TRBV (T-cell receptor beta variable gene) and TRBJ (T-cell receptor beta joining gene) gene usage and pairing. CDR3 length between conserved residues of TRBV and TRBJ ranged from 21 to 81 nucleotides (nt). TRBV gene usage ranged from 0.01% for TRBV17 to 24.6% for TRBV20-1. TRBJ gene usage ranged from 1.6% for TRBJ2-6 to 17.2% for TRBJ2-1. We identified 1573 examples of convergence where the same amino acid translation was specified by distinct CDR3(beta) nucleotide sequences. Direct sequence-based immunoprofiling will likely prove to be a useful tool for understanding repertoire dynamics in response to immune challenge, without a priori knowledge of antigen.

Journal ArticleDOI
Ivan Nasidze1, Jing Li1, Dominique Quinque1, Kun Tang1, Mark Stoneking1 
TL;DR: In this paper, the authors analyzed 14,115 partial (∼500 bp) 16S ribosomal RNA (rRNA) sequences from saliva samples from 120 healthy individuals (10 individuals from each of 12 worldwide locations).
Abstract: The human salivary microbiome may play a role in diseases of the oral cavity and interact with microbiomes from other parts of the human body (in particular, the intestinal tract), but little is known about normal variation in the salivary microbiome. We analyzed 14,115 partial (∼500 bp) 16S ribosomal RNA (rRNA) sequences from saliva samples from 120 healthy individuals (10 individuals from each of 12 worldwide locations). These sequences could be assigned to 101 known bacterial genera, of which 39 were not previously reported from the human oral cavity; phylogenetic analysis suggests that an additional 64 unknown genera are present. There is high diversity in the salivary microbiome within and between individuals, but little geographic structure. Overall, ∼13.5% of the total variance in the composition of genera is due to differences among individuals, which is remarkably similar to the fraction of the total variance in neutral genetic markers that can be attributed to differences among human populations. Investigation of some environmental variables revealed a significant association between the genetic distances among locations and the distance of each location from the equator. Further characterization of the enormous diversity revealed here in the human salivary microbiome will aid in elucidating the role it plays in human health and disease, and in the identification of potentially informative species for studies of human population history.

Journal ArticleDOI
TL;DR: High-resolution binding profiles for 89 known and predicted yeast TFs are determined and proteins that bind the PAC (Polymerase A and C) motif (GATGAG) and regulate ribosomal RNA transcription and processing, core cellular processes that are constituent to ribosome biogenesis are discovered.
Abstract: Transcription factors (TFs) regulate the expression of genes through sequence-specific interactions with DNA-binding sites. However, despite recent progress in identifying in vivo TF binding sites by microarray readout of chromatin immunoprecipitation (ChIP-chip), nearly half of all known yeast TFs are of unknown DNA-binding specificities, and many additional predicted TFs remain uncharacterized. To address these gaps in our knowledge of yeast TFs and their cis regulatory sequences, we have determined high-resolution binding profiles for 89 known and predicted yeast TFs, over more than 2.3 million gapped and ungapped 8-bp sequences (‘‘k-mers’’). We report 50 new or significantly different direct DNA-binding site motifs for yeast DNA-binding proteins and motifs for eight proteins for which only a consensus sequence was previously known; in total, this corresponds to over a 50% increase in the number of yeast DNA-binding proteins with experimentally determined DNA-binding specificities. Among other novel regulators, we discovered proteins that bind the PAC (Polymerase A and C) motif (GATGAG) and regulate ribosomal RNA (rRNA) transcription and processing, core cellular processes that are constituent to ribosome biogenesis. In contrast to earlier data types, these comprehensive k-mer binding data permit us to consider the regulatory potential of genomic sequence at the individual word level. These k-mer data allowed us to reannotate in vivo TF binding targets as direct or indirect and to examine TFs’ potential effects on gene expression in ;1700 environmental and cellular conditions. These approaches could be adapted to identify TFs and cis regulatory elements in higher eukaryotes.

Journal ArticleDOI
TL;DR: Paired-end tag (PET) sequencing for various applications, collectively called the PET sequencing strategy, in which short and paired tags are extracted from the ends of long DNA fragments for ultra-high-throughput sequencing, has a bright future ahead.
Abstract: Comprehensive understanding of functional elements in the human genome will require thorough interrogation and comparison of individual human genomes and genomic structures. Such an endeavor will require improvements in the throughputs and costs of DNA sequencing. Next-generation sequencing platforms have impressively low costs and high throughputs but are limited by short read lengths. An immediate and widely recognized solution to this critical limitation is the paired-end tag (PET) sequencing for various applications, collectively called the PET sequencing strategy, in which short and paired tags are extracted from the ends of long DNA fragments for ultra-high-throughput sequencing. The PET sequences can be accurately mapped to the reference genome, thus demarcating the genomic boundaries of PET-represented DNA fragments and revealing the identities of the target DNA elements. PET protocols have been developed for the analyses of transcriptomes, transcription factor binding sites, epigenetic sites such as histone modification sites, and genome structures. The exclusive advantage of the PET technology is its ability to uncover linkages between the two ends of DNA fragments. Using this unique feature, unconventional fusion transcripts, genome structural variations, and even molecular interactions between distant genomic elements can be unraveled by PET analysis. Extensive use of PET data could lead to efficient assembly of individual human genomes, transcriptomes, and interactomes, enabling new biological and clinical insights. With its versatile and powerful nature for DNA analysis, the PET sequencing strategy has a bright future ahead.