scispace - formally typeset
Search or ask a question

Showing papers in "Genome Research in 2006"


Journal ArticleDOI
TL;DR: A high-throughput 3C approach, 3C-Carbon Copy (5C), that employs microarrays or quantitative DNA sequencing using 454-technology as detection methods that should be widely applicable for large-scale mapping of cis- and trans- interaction networks of genomic elements and for the study of higher-order chromosome structure.
Abstract: Physical interactions between genetic elements located throughout the genome play important roles in gene regulation and can be identified with the Chromosome Conformation Capture (3C) methodology. 3C converts physical chromatin interactions into specific ligation products, which are quantified individually by PCR. Here we present a high-throughput 3C approach, 3C-Carbon Copy (5C), that employs microarrays or quantitative DNA sequencing using 454-technology as detection methods. We applied 5C to analyze a 400-kb region containing the human beta-globin locus and a 100-kb conserved gene desert region. We validated 5C by detection of several previously identified looping interactions in the beta-globin locus. We also identified a new looping interaction in K562 cells between the beta-globin Locus Control Region and the gamma-beta-globin intergenic region. Interestingly, this region has been implicated in the control of developmental globin gene switching. 5C should be widely applicable for large-scale mapping of cis- and trans- interaction networks of genomic elements and for the study of higher-order chromosome structure.

1,178 citations


Journal ArticleDOI
TL;DR: Current efforts are directed toward a more comprehensive cataloging and characterization of CNVs that will provide the basis for determining how genomic diversity impacts biological function, evolution, and common human diseases.
Abstract: DNA copy number variation has long been associated with specific chromosomal rearrangements and genomic disorders, but its ubiquity in mammalian genomes was not fully realized until recently. Although our understanding of the extent of this variation is still developing, it seems likely that, at least in humans, copy number variants (CNVs) account for a substantial amount of genetic variation. Since many CNVs include genes that result in differential levels of gene expression, CNVs may account for a significant proportion of normal phenotypic variation. Current efforts are directed toward a more comprehensive cataloging and characterization of CNVs that will provide the basis for determining how genomic diversity impacts biological function, evolution, and common human diseases.

855 citations


Journal ArticleDOI
TL;DR: Genetic analyses provided evidence of the global regulation of subsets of the sexually dimorphic genes, as the transcript levels of a large number of these genes were controlled by several expression quantitative trait loci (eQTL) hotspots that exhibited tissue-specific control.
Abstract: We report a comprehensive analysis of gene expression differences between sexes in multiple somatic tissues of 334 mice derived from an intercross between inbred mouse strains C57BL/6J and C3H/HeJ. The analysis of a large number of individuals provided the power to detect relatively small differences in expression between sexes, and the use of an intercross allowed analysis of the genetic control of sexually dimorphic gene expression. Microarray analysis of 23,574 transcripts revealed that the extent of sexual dimorphism in gene expression was much greater than previously recognized. Thus, thousands of genes showed sexual dimorphism in liver, adipose, and muscle, and hundreds of genes were sexually dimorphic in brain. These genes exhibited highly tissue-specific patterns of expression and were enriched for distinct pathways represented in the Gene Ontology database. They also showed evidence of chromosomal enrichment, not only on the sex chromosomes, but also on several autosomes. Genetic analyses provided evidence of the global regulation of subsets of the sexually dimorphic genes, as the transcript levels of a large number of these genes were controlled by several expression quantitative trait loci (eQTL) hotspots that exhibited tissue-specific control. Moreover, many tissue-specific transcription factor binding sites were found to be enriched in the sexually dimorphic genes.

801 citations


Journal ArticleDOI
TL;DR: Cross-species sequence divergence estimates suggest that synonymous substitution rates in the basal angiosperms are less than half those previously reported for core eudicots and members of Poaceae, and lower substitution rates permit inference of older duplication events.
Abstract: Genomic comparisons provide evidence for ancient genome-wide duplications in a diverse array of animals and plants. We developed a birth–death model to identify evidence for genome duplication in EST data, and applied a mixture model to estimate the age distribution of paralogous pairs identified in EST sets for species representing the basal-most extant flowering plant lineages. We found evidence for episodes of ancient genome-wide duplications in the basal angiosperm lineages including Nuphar advena (yellow water lily: Nymphaeaceae) and the magnoliids Persea americana (avocado: Lauraceae), Liriodendron tulipifera (tulip poplar: Magnoliaceae), and Saruma henryi (Aristolochiaceae). In addition, we detected independent genome duplications in the basal eudicot Eschscholzia californica (California poppy: Papaveraceae) and the basal monocot Acorus americanus (Acoraceae), both of which were distinct from duplications documented for ancestral grass (Poaceae) and core eudicot lineages. Among gymnosperms, we found equivocal evidence for ancient polyploidy in Welwitschia mirabilis (Gnetales) and no evidence for polyploidy in pine, although gymnosperms generally have much larger genomes than the angiosperms investigated. Cross-species sequence divergence estimates suggest that synonymous substitution rates in the basal angiosperms are less than half those previously reported for core eudicots and members of Poaceae. These lower substitution rates permit inference of older duplication events. We hypothesize that evidence of an ancient duplication observed in the Nuphar data may represent a genome duplication in the common ancestor of all or most extant angiosperms, except Amborella.

677 citations


Journal ArticleDOI
TL;DR: The results demonstrate the effectiveness of the method for reliably profiling many CpG sites in parallel for the discovery of informative methylation markers and should prove useful for DNA methylation analyses in large populations.
Abstract: We have developed a high-throughput method for analyzing the methylation status of hundreds of preselected genes simultaneously and have applied it to the discovery of methylation signatures that distinguish normal from cancer tissue samples. Through an adaptation of the GoldenGate genotyping assay implemented on a BeadArray platform, the methylation state of 1536 specific CpG sites in 371 genes (one to nine CpG sites per gene) was measured in a single reaction by multiplexed genotyping of 200 ng of bisulfite-treated genomic DNA. The assay was used to obtain a quantitative measure of the methylation level at each CpG site. After validating the assay in cell lines and normal tissues, we analyzed a panel of lung cancer biopsy samples (N = 22) and identified a panel of methylation markers that distinguished lung adenocarcinomas from normal lung tissues with high specificity. These markers were validated in a second sample set (N = 24). These results demonstrate the effectiveness of the method for reliably profiling many CpG sites in parallel for the discovery of informative methylation markers. The technology should prove useful for DNA methylation analyses in large populations, with potential application to the classification and diagnosis of a broad range of cancers and other diseases.

676 citations


Journal ArticleDOI
TL;DR: An initial map of human INDEL variation that contains 415,436 unique INDEL polymorphisms, which range from 1 bp to 9989 bp in length and are split almost equally between insertions and deletions, relative to the chimpanzee genome sequence.
Abstract: Although many studies have been conducted to identify single nucleotide polymorphisms (SNPs) in humans, few studies have been conducted to identify alternative forms of natural genetic variation, such as insertion and deletion (INDEL) polymorphisms. In this report, we describe an initial map of human INDEL variation that contains 415,436 unique INDEL polymorphisms. These INDELs were identified with a computational approach using DNA re-sequencing traces that originally were generated for SNP discovery projects. They range from 1 bp to 9989 bp in length and are split almost equally between insertions and deletions, relative to the chimpanzee genome sequence. Five major classes of INDELs were identified, including (1) insertions and deletions of single-base pairs, (2) monomeric base pair expansions, (3) multi-base pair expansions of 2–15 bp repeat units, (4) transposon insertions, and (5) INDELs containing random DNA sequences. Our INDELs are distributed throughout the human genome with an average density of one INDEL per 7.2 kb of DNA. Variation hotspots were identified with up to 48-fold regional increases in INDEL and/or SNP variation compared with the chromosomal averages for the same chromosomes. Over 148,000 INDELs (35.7%) were identified within known genes, and 5542 of these INDELs were located in the promoters and exons of genes, where gene function would be expected to be influenced the greatest. All INDELs in this study have been deposited into dbSNP and have been integrated into maps of human genetic variation that are available to the research community.

644 citations


Journal ArticleDOI
TL;DR: The utility of SNP-CGH is demonstrated with two Infinium whole-genome genotyping BeadChips, assaying 109,000 and 317,000 SNP loci, and the statistical ability to detect common aberrations was modeled by analysis of an X chromosome titration model system, and sensitivity was modeling by titration of gDNA from a tumor cell with that of its paired normal cell line.
Abstract: Array-CGH is a powerful tool for the detection of chromosomal aberrations. The introduction of high-density SNP genotyping technology to genomic profiling, termed SNP-CGH, represents a further advance, since simultaneous measurement of both signal intensity variations and changes in allelic composition makes it possible to detect both copy number changes and copy-neutral loss-of-heterozygosity (LOH) events. We demonstrate the utility of SNP-CGH with two Infinium whole-genome genotyping BeadChips, assaying 109,000 and 317,000 SNP loci, to detect chromosomal aberrations in samples bearing constitutional aberrations as well tumor samples at sub-100 kb effective resolution. Detected aberrations include homozygous deletions, hemizygous deletions, copy-neutral LOH, duplications, and amplifications. The statistical ability to detect common aberrations was modeled by analysis of an X chromosome titration model system, and sensitivity was modeled by titration of gDNA from a tumor cell with that of its paired normal cell line. Analysis was facilitated by using a genome browser that plots log ratios of normalized intensities and allelic ratios along the chromosomes. We developed two modes of SNP-CGH analysis, a single sample and a paired sample mode. The single sample mode computes log intensity ratios and allelic ratios by referencing to canonical genotype clusters generated from ∼120 reference samples, whereas the paired sample mode uses a paired normal reference sample from the same individual. Finally, the two analysis modes are compared and contrasted for their utility in analyzing different types of input gDNA: low input amounts, fragmented gDNA, and Phi29 whole-genome pre-amplified DNA.

547 citations


Journal ArticleDOI
TL;DR: The honey bee genome sequence reveals a remarkable expansion of the insect odorant receptor (Or) family relative to the repertoires of the flies Drosophila melanogaster and Anopheles gambiae, which have 62 and 79 Ors respectively.
Abstract: The honey bee genome sequence reveals a remarkable expansion of the insect odorant receptor (Or) family relative to the repertoires of the flies Drosophila melanogaster and Anopheles gambiae, which have 62 and 79 Ors respectively. A total of 170 Or genes were annotated in the bee, of which seven are pseudogenes. These constitute five bee-specific subfamilies in an insect Or family tree, one of which has expanded to a total of 157 genes encoding proteins with 15%-99% amino acid identity. Most of the Or genes are in tandem arrays, including one with 60 genes. This bee-specific expansion of the Or repertoire presumably underlies their remarkable olfactory abilities, including perception of several pheromone blends, kin recognition signals, and diverse floral odors. The number of Apis mellifera Ors is approximately equal to the number of glomeruli in the bee antennal lobe (160-170), consistent with a general one-receptor/one-neuron/one-glomerulus relationship. The bee genome encodes just 10 gustatory receptors (Grs) compared with the D. melanogaster and A. gambiae repertoires of 68 and 76 Grs, respectively. A lack of Gr gene family expansion primarily accounts for this difference. A nurturing hive environment and a mutualistic relationship with plants may explain the lack of Gr family expansion. The Or family is the most dramatic example of gene family expansion in the bee genome, and characterizing their caste- and sex-specific gene expression may provide clues to their specific roles in detection of pheromone, kin, and floral odors.

542 citations


Journal ArticleDOI
TL;DR: In this article, the authors show that Oryza australiensis, a wild relative of the Asian cultivated rice O. sativa, has undergone recent bursts of three LTR-retrotransposon families, leading to a rapid twofold increase of its size.
Abstract: Retrotransposons are the main components of eukaryotic genomes, representing up to 80% of some large plant genomes. These mobile elements transpose via a “copy and paste” mechanism, thus increasing their copy number while active. Their accumulation is now accepted as the main factor of genome size increase in higher eukaryotes, besides polyploidy. However, the dynamics of this process are poorly understood. In this study, we show that Oryza australiensis, a wild relative of the Asian cultivated rice O. sativa, has undergone recent bursts of three LTR-retrotransposon families. This genome has accumulated more than 90,000 retrotransposon copies during the last three million years, leading to a rapid twofold increase of its size. In addition, phenetic analyses of these retrotransposons clearly confirm that the genomic bursts occurred posterior to the radiation of the species. This provides direct evidence of retrotransposon-mediated variation of genome size within a plant genus.

541 citations


Journal ArticleDOI
TL;DR: High-throughput analysis, using massively parallel signature sequencing (MPSS), of 230,000 tags from a DNase library generated from quiescent human CD4+ T cells identifies 14,190 clusters of sequences that group within close proximity to each other that represent valid DNase HS sites.
Abstract: A major goal in genomics is to understand how genes are regulated in different tissues, stages of development, diseases, and species. Mapping DNase I hypersensitive (HS) sites within nuclear chromatin is a powerful and well-established method of identifying many different types of regulatory elements, but in the past it has been limited to analysis of single loci. We have recently described a protocol to generate a genome-wide library of DNase HS sites. Here, we report high-throughput analysis, using massively parallel signature sequencing (MPSS), of 230,000 tags from a DNase library generated from quiescent human CD4+ T cells. Of the tags that uniquely map to the genome, we identified 14,190 clusters of sequences that group within close proximity to each other. By using a real-time PCR strategy, we determined that the majority of these clusters represent valid DNase HS sites. Approximately 80% of these DNase HS sites uniquely map within one or more annotated regions of the genome believed to contain regulatory elements, including regions 2 kb upstream of genes, CpG islands, and highly conserved sequences. Most DNase HS sites identified in CD4+ T cells are also HS in CD8+ T cells, B cells, hepatocytes, human umbilical vein endothelial cells (HUVECs), and HeLa cells. However, ∼10% of the DNase HS sites are lymphocyte specific, indicating that this procedure can identify gene regulatory elements that control cell type specificity. This strategy, which can be applied to any cell line or tissue, will enable a better understanding of how chromatin structure dictates cell function and fate.

513 citations


Journal ArticleDOI
TL;DR: Analysis of the genomic landscape around these sequences indicates that some cDNA clones were produced not from terminal poly(A) tracts but internal priming sites within longer transcripts, only a minority of which is encompassed by known genes.
Abstract: Recent large-scale analyses of mainly full-length cDNA libraries generated from a variety of mouse tissues indicated that almost half of all representative cloned sequences did not contain an apparent protein-coding sequence, and were putatively derived from non-protein-coding RNA (ncRNA) genes. However, many of these clones were singletons and the majority were unspliced, raising the possibility that they may be derived from genomic DNA or unprocessed pre-mRNA contamination during library construction, or alternatively represent nonspecific "transcriptional noise." Here we show, using reverse transcriptase-dependent PCR, microarray, and Northern blot analyses, that many of these clones were derived from genuine transcripts of unknown function whose expression appears to be regulated. The ncRNA transcripts have larger exons and fewer introns than protein-coding transcripts. Analysis of the genomic landscape around these sequences indicates that some cDNA clones were produced not from terminal poly(A) tracts but internal priming sites within longer transcripts, only a minority of which is encompassed by known genes. A significant proportion of these transcripts exhibit tissue-specific expression patterns, as well as dynamic changes in their expression in macrophages following lipopolysaccharide stimulation. Taken together, the data provide strong support for the conclusion that ncRNAs are an important, regulated component of the mammalian transcriptome.

Journal ArticleDOI
TL;DR: The findings suggest that use of alternate promoters and consequent alternative use of first exons should play a pivotal role in generating the complexity required for the highly elaborated molecular systems in humans.
Abstract: By analyzing 1,780,295 5'-end sequences of human full-length cDNAs derived from 164 kinds of oligo-cap cDNA libraries, we identified 269,774 independent positions of transcriptional start sites (TSSs) for 14,628 human RefSeq genes. These TSSs were clustered into 30,964 clusters that were separated from each other by more than 500 bp and thus are very likely to constitute mutually distinct alternative promoters. To our surprise, at least 7674 (52%) human RefSeq genes were subject to regulation by putative alternative promoters (PAPs). On average, there were 3.1 PAPs per gene, with the composition of one CpG-island-containing promoter per 2.6 CpG-less promoters. In 17% of the PAP-containing loci, tissue-specific use of the PAPs was observed. The richest tissue sources of the tissue-specific PAPs were testis and brain. It was also intriguing that the PAP-containing promoters were enriched in the genes encoding signal transduction-related proteins and were rarer in the genes encoding extracellular proteins, possibly reflecting the varied functional requirement for and the restricted expression of those categories of genes, respectively. The patterns of the first exons were highly diverse as well. On average, there were 7.7 different splicing types of first exons per locus partly produced by the PAPs, suggesting that a wide variety of transcripts can be achieved by this mechanism. Our findings suggest that use of alternate promoters and consequent alternative use of first exons should play a pivotal role in generating the complexity required for the highly elaborated molecular systems in humans.

Journal ArticleDOI
TL;DR: It is argued that "balanced gene drive" is a sufficient explanation for the trend that the maximums of morphological complexity have gone up, and not down, in both plant and animal eukaryotic lineages.
Abstract: Controversy surrounds the apparent rising maximums of morphological complexity during eukaryotic evolution, with organisms increasing the number and nestedness of developmental areas as evidenced by morphological elaborations reflecting area boundaries. No "predictable drive" to increase this sort of complexity has been reported. Recent genetic data and theory in the general area of gene dosage effects has engendered a robust "gene balance hypothesis," with a theoretical base that makes specific predictions as to gene content changes following different types of gene duplication. Genomic data from both chordate and angiosperm genomes fit these predictions: Each type of duplication provides a one-way injection of a biased set of genes into the gene pool. Tetraploidies and balanced segments inject bias for those genes whose products are the subunits of the most complex biological machines or cascades, like transcription factors (TFs) and proteasome core proteins. Most duplicate genes are removed after tetraploidy. Genic balance is maintained by not removing those genes that are dose-sensitive, which tends to leave duplicate "functional modules" as the indirect products (spandrels) of purifying selection. Functional modules are the likely precursors of coadapted gene complexes, a unit of natural selection. The result is a predictable drive mechanism where "drive" is used rigorously, as in "meiotic drive." Rising morphological gain is expected given a supply of duplicate functional modules. All flowering plants have survived at least three large-scale duplications/diploidizations over the last 300 million years (Myr). An equivalent period of tetraploidy and body plan evolution may have ended for animals 500 million years ago (Mya). We argue that "balanced gene drive" is a sufficient explanation for the trend that the maximums of morphological complexity have gone up, and not down, in both plant and animal eukaryotic lineages.

Journal ArticleDOI
TL;DR: It is found that islands of retention contain "connected genes," those genes predicted-by the gene balance hypothesis-to be resistant to removal because the products they encode interact with other products in a dose-sensitive manner, creating a web of dependency.
Abstract: Approximately 90% of Arabidopsis’ unique gene content is found in syntenic blocks that were formed during the most recent whole-genome duplication. Within these blocks, 28.6% of the genes have a retained pair; the remaining genes have been lost from one of the homeologs. We create a minimized genome by condensing local duplications to one gene, removing transposons, and including only genes within blocks defined by retained pairs. We use a moving average of retained and non-retained genes to find clusters of retention and then identify the types of genes that appear in clusters at frequencies above expectations. Significant clusters of retention exist for almost all chromosomal segments. Detailed alignments show that, for 85% of the genome, one homeolog was preferentially (1.6×) targeted for fractionation. This homeolog fractionation bias suggests an epigenetic mechanism. We find that islands of retention contain “connected genes,” those genes predicted—by the gene balance hypothesis—to be resistant to removal because the products they encode interact with other products in a dose-sensitive manner, creating a web of dependency. Gene families that are overrepresented in clusters include those encoding components of the proteasome/protein modification complexes, signal transduction machinery, ribosomes, and transcription factor complexes. Gene pair fractionation following polyploidy or segmental duplication leaves a genome enriched for “connected” genes. These clusters of duplicate genes may help explain the evolutionary origin of coregulated chromosomal regions and new functional modules.

Journal ArticleDOI
TL;DR: The HpaII tiny fragment Enrichment by Ligation-mediated PCR assay is robust, quantitative, and accurate and is providing new insights into the distribution and dynamic nature of cytosine methylation in the genome.
Abstract: The distribution of cytosine methylation in 6.2 Mb of the mouse genome was tested using cohybridization of genomic representations from a methylation-sensitive restriction enzyme and its methylation-insensitive isoschizomer. This assay, termed HELP (HpaII tiny fragment Enrichment by Ligation-mediated PCR), allows both intragenomic profiling and intergenomic comparisons of cytosine methylation. The intragenomic profile shows most of the genome to be contiguous methylated sequence with occasional clusters of hypomethylated loci, usually but not exclusively at promoters and CpG islands. Intergenomic comparison found marked differences in cytosine methylation between spermatogenic and brain cells, identifying 223 new candidate tissue-specific differentially methylated regions (T-DMRs). Bisulfite pyrosequencing confirmed the four candidates tested to be T-DMRs, while quantitative RT-PCR for two genes with T-DMRs located at their promoters showed the HELP data to be correlated with gene activity at these loci. The HELP assay is robust, quantitative, and accurate and is providing new insights into the distribution and dynamic nature of cytosine methylation in the genome.

Journal ArticleDOI
TL;DR: An extended analysis of these interacting networks by bioinformatics and experimentation should provide new insights and novel strategies for E. coli systems biology.
Abstract: Escherichia coli is one of the best characterized organisms and has served as a model system to study many aspects of bacterial physiology and genetics of fundamental and applied interest. Among the 4339 predicted ORFs including previous prediction (Riley et al. 2006) in E. coli, nearly 50% are experimentally uncharacterized. In addition to functional analysis of individual ORFs, systematic analyses of relationships between constituent elements, such as gene regulatory networks, protein–protein interactions (PPIs), and metabolic networks, have only recently become feasible. To date, comprehensive PPI studies have been based on the yeast two-hybrid system that detects binary interactions through activation of reporter gene expression (Fields and Song 1989; Uetz et al. 2000; Ito et al. 2001), and pull-down assays that detect large complexes by copurification of prey proteins through their interactions with bait proteins (Gavin et al. 2002; Ho et al. 2002) or protein chips (Zhu et al. 2001). In E. coli, a large-scale protein interaction network was recently carried out by pull-down assay using TAP-tagged bait proteins (Butland et al. 2005). We have already described a comprehensive E. coli ORF library (the ASKA library) as a new resource for E. coli biology (Kitagawa et al. 2005). Here, we report the use of this resource in a systematic analysis of PPIs using pull-down assays. With the advent of matrixassisted laser desorption ionization time of flight (MALDI-TOF) mass spectrometry methods, it is feasible to identify PPIs on a proteome-wide scale. We have carried out a large-scale identification of protein–protein interactions to gain further understanding of the E. coli model cell at the system level. Because E. coli is one of the best studied organisms, it should also be an excellent target for systems biology (Kitano 2002) and synthetic biology fields (Silver and Way 2004) approaches.

Journal ArticleDOI
TL;DR: To understand the modes and mechanisms that underlie variation in genome composition, sequence data from whole genome shotgun libraries for three representative diploid members of Gossypium were generated and a pattern of lineage-specific amplification of particular subfamilies of retrotransposons within each species studied was demonstrated.
Abstract: The DNA content of eukaryotic nuclei (C-value) varies ∼200,000-fold, but there is only a ∼20-fold variation in the number of protein-coding genes. Hence, most C-value variation is ascribed to the repetitive fraction, although little is known about the evolutionary dynamics of the specific components that lead to genome size variation. To understand the modes and mechanisms that underlie variation in genome composition, we generated sequence data from whole genome shotgun (WGS) libraries for three representative diploid (n = 13) members of Gossypium that vary in genome size from 880 to 2460 Mb (1C) and from a phylogenetic outgroup, Gossypioides kirkii, with an estimated genome size of 588 Mb. Copy number estimates including all dispersed repetitive sequences indicate that 40%–65% of each genome is composed of transposable elements. Inspection of individual sequence types revealed differential, lineage-specific expansion of various families of transposable elements among the different plant lineages. Copia-like retrotransposable element sequences have differentially accumulated in the Gossypium species with the smallest genome, G. raimondii, while gypsy-like sequences have proliferated in the lineages with larger genomes. Phylogenetic analyses demonstrated a pattern of lineage-specific amplification of particular subfamilies of retrotransposons within each species studied. One particular group of gypsy-like retrotransposon sequences, Gorge3 (Gossypium retrotransposable gypsy-like element), appears to have undergone a massive proliferation in two plant lineages, accounting for a major fraction of genome-size change. Like maize, Gossypium has undergone a threefold increase in genome size due to the accumulation of LTR retrotransposons over the 5–10 Myr since its origin. [The sequence data described in this paper have been submitted to the GSS Division of GenBank under accessions DX390732–DX406528.]

Journal ArticleDOI
TL;DR: This work considered a coalescent model of directional selection in a sensible demographic setting, allowing for selection on standing variation as well as on a new mutation, and concluded that, insofar as attributes of the beneficial mutation affect the power to detect targets of selection, genomic scans will yield an unrepresentative subset of loci that contribute to adaptations.
Abstract: The beneficial substitution of an allele shapes patterns of genetic variation at linked sites. Thus, in principle, adaptations can be mapped by looking for the signature of directional selection in polymorphism data. In practice, such efforts are hampered by the need for an accurate characterization of the demographic history of the species and of the effects of positive selection. In an attempt to circumvent these difficulties, researchers are increasingly taking a purely empirical approach, in which a large number of genomic regions are ordered by summaries of the polymorphism data, and loci with extreme values are considered to be likely targets of positive selection. We evaluated the reliability of the "empirical" approach, focusing on applications to human data and to maize. To do so, we considered a coalescent model of directional selection in a sensible demographic setting, allowing for selection on standing variation as well as on a new mutation. Our simulations suggest that while empirical approaches will identify several interesting candidates, they will also miss many--in some cases, most--loci of interest. The extent of the trade-off depends on the mode of positive selection and the demographic history of the population. Specifically, the false-discovery rate is higher when directional selection involves a recessive rather than a co-dominant allele, when it acts on a previously neutral rather than a new allele, and when the population has experienced a population bottleneck rather than maintained a constant size. One implication of these results is that, insofar as attributes of the beneficial mutation (e.g., the dominance coefficient) affect the power to detect targets of selection, genomic scans will yield an unrepresentative subset of loci that contribute to adaptations.

Journal ArticleDOI
TL;DR: The enrichment of Regulatory sequences in the relatively small unmethylated compartment suggests that cytosine methylation constrains the effective size of the genome through the selective exposure of regulatory sequences.
Abstract: The mammalian genome depends on patterns of methylated cytosines for normal function, but the relationship between genomic methylation patterns and the underlying sequence is unclear. We have characterized the methylation landscape of the human genome by global analysis of patterns of CpG depletion and by direct sequencing of 3073 unmethylated domains and 2565 methylated domains from human brain DNA. The genome was found to consist of short (<4 kb) unmethylated domains embedded in a matrix of long methylated domains. Unmethylated domains were enriched in promoters, CpG islands, and first exons, while methylated domains comprised interspersed and tandem-repeated sequences, exons other than first exons, and non-annotated single-copy sequences that are depleted in the CpG dinucleotide. The enrichment of regulatory sequences in the relatively small unmethylated compartment suggests that cytosine methylation constrains the effective size of the genome through the selective exposure of regulatory sequences. This buffers regulatory networks against changes in total genome size and provides an explanation for the C value paradox, which concerns the wide variations in genome size that scale independently of gene number. This suggestion is compatible with the finding that cytosine methylation is universal among large-genome eukaryotes, while many eukaryotes with genome sizes <5 x 10(8) bp do not methylate their DNA.

Journal ArticleDOI
TL;DR: Graemlin is developed, the first algorithm capable of scalable multiple network alignment and the first quantitative benchmarks for network alignment, which allow comparisons of algorithms in terms of their ability to recapitulate the KEGG database of conserved functional modules.
Abstract: The recent proliferation of protein interaction networks has motivated research into network alignment: the cross-species comparison of conserved functional modules. Previous studies have laid the foundations for such comparisons and demonstrated their power on a select set of sparse interaction networks. Recently, however, new computational techniques have produced hundreds of predicted interaction networks with interconnection densities that push existing alignment algorithms to their limits. To find conserved functional modules in these new networks, we have developed Graemlin, the first algorithm capable of scalable multiple network alignment. Graemlin's explicit model of functional evolution allows both the generalization of existing alignment scoring schemes and the location of conserved network topologies other than protein complexes and metabolic pathways. To assess Graemlin's performance, we have developed the first quantitative benchmarks for network alignment, which allow comparisons of algorithms in terms of their ability to recapitulate the KEGG database of conserved functional modules. We find that Graemlin achieves substantial scalability gains over previous methods while improving sensitivity.

Journal ArticleDOI
TL;DR: Deep sequencing of mutants provides a genetic approach for the dissection and characterization of diverse small RNA populations and the identification of low abundance miRNAs.
Abstract: The Arabidopsis genome contains a highly complex and abundant population of small RNAs, and many of the endogenous siRNAs are dependent on RNA-Dependent RNA Polymerase 2 (RDR2) for their biogenesis. By analyzing an rdr2 loss-of-function mutant using two different parallel sequencing technologies, MPSS and 454, we characterized the complement of miRNAs expressed in Arabidopsis inflorescence to considerable depth. Nearly all known miRNAs were enriched in this mutant and we identified 13 new miRNAs, all of which were relatively low abundance and constitute new families. Trans-acting siRNAs (ta-siRNAs) were even more highly enriched. Computational and gel blot analyses suggested that the minimal number of miRNAs in Arabidopsis is approximately 155. The size profile of small RNAs in rdr2 reflected enrichment of 21-nt miRNAs and other classes of siRNAs like ta-siRNAs, and a significant reduction in 24-nt heterochromatic siRNAs. Other classes of small RNAs were found to be RDR2-independent, particularly those derived from long inverted repeats and a subset of tandem repeats. The small RNA populations in other Arabidopsis small RNA biogenesis mutants were also examined; a dcl2/3/4 triple mutant showed a similar pattern to rdr2, whereas dcl1-7 and rdr6 showed reductions in miRNAs and ta-siRNAs consistent with their activities in the biogenesis of these types of small RNAs. Deep sequencing of mutants provides a genetic approach for the dissection and characterization of diverse small RNA populations and the identification of low abundance miRNAs.

Journal ArticleDOI
TL;DR: Surprisingly, the PRC complexes can be localized to discrete binding sites or spread through large regions of the mouse and human genomes, and it is suggested that OCT4 maintains stem cell self-renewal, in part, by recruitingPRC complexes to certain genes that promote differentiation.
Abstract: Suz12 is a component of the Polycomb group complexes 2, 3, and 4 (PRC 2/3/4). These complexes are critical for proper embryonic development, but very few target genes have been identified in either mouse or human cells. Using a variety of ChIP-chip approaches, we have identified a large set of Suz12 target genes in five different human and mouse cell lines. Interestingly, we found that Suz12 target promoters are cell type specific, with transcription factors and homeobox proteins predominating in embryonal cells and glycoproteins and immunoglobulin-related proteins predominating in adult tumors. We have also characterized the localization of other components of the PRC complex with Suz12 and investigated the overall relationship between Suz12 binding and markers of active versus inactive chromatin, using both promoter arrays and custom tiling arrays. Surprisingly, we find that the PRC complexes can be localized to discrete binding sites or spread through large regions of the mouse and human genomes. Finally, we have shown that some Suz12 target genes are bound by OCT4 in embryonal cells and suggest that OCT4 maintains stem cell self-renewal, in part, by recruiting PRC complexes to certain genes that promote differentiation.

Journal ArticleDOI
TL;DR: Analysis of a selected subset of clinical material suggests that a simple genomic calculation, based on the number and proximity of genomic alterations, correlates with life-table estimates of the probability of overall survival in patients with primary breast cancer.
Abstract: Representational Oligonucleotide Microarray Analysis (ROMA) detects genomic amplifications and deletions with boundaries defined at a resolution of approximately 50 kb. We have used this technique to examine 243 breast tumors from two separate studies for which detailed clinical data were available. The very high resolution of this technology has enabled us to identify three characteristic patterns of genomic copy number variation in diploid tumors and to measure correlations with patient survival. One of these patterns is characterized by multiple closely spaced amplicons, or "firestorms," limited to single chromosome arms. These multiple amplifications are highly correlated with aggressive disease and poor survival even when the rest of the genome is relatively quiet. Analysis of a selected subset of clinical material suggests that a simple genomic calculation, based on the number and proximity of genomic alterations, correlates with life-table estimates of the probability of overall survival in patients with primary breast cancer. Based on this sample, we generate the working hypothesis that copy number profiling might provide information useful in making clinical decisions, especially regarding the use or not of systemic therapies (hormonal therapy, chemotherapy), in the management of operable primary breast cancer with ostensibly good prognosis, for example, small, node-negative, hormone-receptor-positive diploid cases.

Journal ArticleDOI
TL;DR: Cyanobacterial genomes reveal a complex evolutionary history, which cannot be represented by a single strictly bifurcating tree for all genes or even most genes, although a single completely resolved phylogeny was recovered from the quartets' plurality signals.
Abstract: Using 1128 protein-coding gene families from 11 completely sequenced cyanobacterial genomes, we attempt to quantify horizontal gene transfer events within cyanobacteria, as well as between cyanobacteria and other phyla. A novel method of detecting and enumerating potential horizontal gene transfer events within a group of organisms based on analyses of “embedded quartets” allows us to identify phylogenetic signal consistent with a plurality of gene families, as well as to delineate cases of conflict to the plurality signal, which include horizontally transferred genes. To infer horizontal gene transfer events between cyanobacteria and other phyla, we added homologs from 168 available genomes. We screened phylogenetic trees reconstructed for each of these extended gene families for highly supported monophyly of cyanobacteria (or lack of it). Cyanobacterial genomes reveal a complex evolutionary history, which cannot be represented by a single strictly bifurcating tree for all genes or even most genes, although a single completely resolved phylogeny was recovered from the quartets’ plurality signals. We find more conflicts within cyanobacteria than between cyanobacteria and other phyla. We also find that genes from all functional categories are subject to transfer. However, in interphylum as compared to intraphylum transfers, the proportion of metabolic (operational) gene transfers increases, while the proportion of informational gene transfers decreases.

Journal ArticleDOI
TL;DR: It is proposed that L1 families with different 5'UTR can coexist because they don't rely on the same host-encoded factors for their transcription and therefore do not compete with each other.
Abstract: We investigated the evolution of the families of LINE-1 (L1) retrotransposons that have amplified in the human lineage since the origin of primates. We identified two phases in the evolution of L1. From approximately 70 million years ago (Mya) until approximately 40 Mya, three distinct L1 lineages were simultaneously active in the genome of ancestral primates. In contrast, during the last 40 million years (Myr), i.e., during the evolution of anthropoid primates, a single lineage of families has evolved and amplified. We found that novel (i.e., unrelated) regulatory regions (5'UTR) have been frequently recruited during the evolution of L1, whereas the two open-reading frames (ORF1 and ORF2) have remained relatively conserved. We found that L1 families coexisted and formed independently evolving L1 lineages only when they had different 5'UTRs. We propose that L1 families with different 5'UTR can coexist because they don't rely on the same host-encoded factors for their transcription and therefore do not compete with each other. The most prolific L1 families (families L1PA8 to L1PA3) amplified between 40 and 12 Mya. This period of high activity corresponds to an episode of adaptive evolution in a segment of ORF1. The correlation between the high activity of L1 families and adaptive evolution could result from the coevolution of L1 and a host-encoded repressor of L1 activity.

Journal ArticleDOI
TL;DR: The results suggest that E 2F1 is recruited to promoters via a method distinct from recognition of the known consensus site and point toward a new understanding of E2F1 as a factor that contributes to the regulation of a large fraction of human genes.
Abstract: The E2F family of transcription factors regulates basic cellular processes. Here, we take an unbiased approach towards identifying E2F1 target genes by examining localization of E2F1-binding sites using high-density oligonucleotide tiling arrays. To begin, we developed a statistically-based methodology for analysis of ChIP-chip data obtained from arrays that represent 30 Mb of the human genome. Using this methodology, we identified regions bound by E2F1, MYC, and RNA Polymerase II (POLR2A). We found a large number of binding sites for all three factors; extrapolation suggests there may be approximately 20,000-30,000 E2F1- and MYC-binding sites and approximately 12,000-17,000 active promoters in HeLa cells. In contrast to our results for MYC, we find that the majority of E2F1-binding sites (>80%) are located in core promoters and that 50% of the sites overlap transcription starts. Only a small fraction of E2F1 sites possess the canonical binding motif. Surprisingly, we found that approximately 30% of genes in the 30-Mb region possessed an E2F1 binding site in a core promoter and E2F1 was bound near to 83% of POLR2A-bound sites. To determine if these results were representative of the entire human genome, we performed ChIP-chip analyses of approximately 24,000 promoters and confirmed that greater than 20% of the promoters were bound by E2F1. Our results suggest that E2F1 is recruited to promoters via a method distinct from recognition of the known consensus site and point toward a new understanding of E2F1 as a factor that contributes to the regulation of a large fraction of human genes.

Journal ArticleDOI
TL;DR: Sodalis represents an evolutionary intermediate transitioning from a free-living to a mutualistic lifestyle, and its chromosome encodes a complete flagella structure, key components of which are expressed in immature host developmental stages.
Abstract: Sodalis glossinidius is a maternally transmitted endosymbiont of tsetse flies (Glossina spp.), an insect of medical and veterinary significance. Analysis of the complete sequence of Sodalis' chromosome (4,171,146 bp, encoding 2,432 protein coding sequences) indicates a reduced coding capacity of 51%. Furthermore, the chromosome contains 972 pseudogenes, an inordinately high number compared with that of other bacterial species. A high proportion of these pseudogenes are homologs of known proteins that function either in defense or in the transport and metabolism of carbohydrates and inorganic ions, suggesting Sodalis' degenerative adaptations to the immunity and restricted nutritional status of the host. Sodalis possesses three chromosomal symbiosis regions (SSR): SSR-1, SSR-2, and SSR-3, with gene inventories similar to the Type-III secretion system (TTSS) ysa from Yersinia enterolitica and SPI-1 and SPI-2 from Salmonella, respectively. While core components of the needle structure have been conserved, some of the effectors and regulators typically associated with these systems in pathogenic microbes are modified or eliminated in Sodalis. Analysis of SSR-specific invA transcript abundance in Sodalis during host development indicates that the individual symbiosis regions may exhibit different temporal expression profiles. In addition, the Sodalis chromosome encodes a complete flagella structure, key components of which are expressed in immature host developmental stages. These features may be important for the transmission and establishment of symbiont infections in the intra-uterine progeny. The data suggest that Sodalis represents an evolutionary intermediate transitioning from a free-living to a mutualistic lifestyle.

Journal ArticleDOI
TL;DR: Significant differences between the strains include numerous novel mobile elements and genes encoding metabolic capabilities, strain-specific extracellular polysaccharide capsule, sporulation factors, toxins, and other secreted enzymes, providing substantial insight into this medically important bacterial pathogen.
Abstract: Clostridium perfringens is a Gram-positive, anaerobic spore-forming bacterium commonly found in soil, sediments, and the human gastrointestinal tract. C. perfringens is responsible for a wide spectrum of disease, including food poisoning, gas gangrene (clostridial myonecrosis), enteritis necroticans, and non-foodborne gastrointestinal infections. The complete genome sequences of Clostridium perfringens strain ATCC 13124, a gas gangrene isolate and the species type strain, and the enterotoxin-producing food poisoning strain SM101, were determined and compared with the published C. perfringens strain 13 genome. Comparison of the three genomes revealed considerable genomic diversity with >300 unique "genomic islands" identified, with the majority of these islands unusually clustered on one replichore. PCR-based analysis indicated that the large genomic islands are widely variable across a large collection of C. perfringens strains. These islands encode genes that correlate to differences in virulence and phenotypic characteristics of these strains. Significant differences between the strains include numerous novel mobile elements and genes encoding metabolic capabilities, strain-specific extracellular polysaccharide capsule, sporulation factors, toxins, and other secreted enzymes, providing substantial insight into this medically important bacterial pathogen.

Journal ArticleDOI
TL;DR: A genome-wide screen to further define the functional mammalian CArGome is described with the discovery of an array of cyto-contractile genes that coordinate normal cytoskeletal homeostasis and suggests one function of SRF is that of an ancient master regulator of the actin cytoskeleton.
Abstract: Serum response factor (SRF) binds a 1216-fold degenerate cis element known as the CArG box. CArG boxes are found primarily in muscle- and growth-factor-associated genes although the full spectrum of functional CArG elements in the genome (the CArGome) has yet to be defined. Here we describe a genome-wide screen to further define the functional mammalian CArGome. A computational approach involving comparative genomic analyses of human and mouse orthologous genes uncovered >100 hypothetical SRF-dependent genes, including 10 previously identified SRF targets, harboring a conserved CArG element within 4000 bp of the annotated transcription start site (TSS). We PCR-cloned 89 hypothetical SRF targets and subjected each of them to at least two of several validations including luciferase reporter, gel shift, chromatin immunoprecipitation, and mRNA expression following RNAi knockdown of SRF; 60/89 (67%) of the targets were validated. Interestingly, 26 of the validated SRF target genes encode for cytoskeletal/contractile or adhesion proteins. RNAi knockdown of SRF diminishes expression of several SRF-dependent cytoskeletal genes and elicits an attending perturbation in the cytoarchitecture of both human and rodent cells. These data illustrate the power of integrating existing algorithms to interrogate the genome in a relatively unbiased fashion for cis-regulatory element discovery. In this manner, we have further expanded the mammalian CArGome with the discovery of an array of cyto-contractile genes that coordinate normal cytoskeletal homeostasis. We suggest one function of SRF is that of an ancient master regulator of the actin cytoskeleton.

Journal ArticleDOI
TL;DR: Comparisons between the human genes and ZNF loci mined from the draft mouse, dog, and chimpanzee genomes identified 103 KRAB-ZNF genes that are conserved in mammals but also highlighted a substantial level of lineage-specific change.
Abstract: Kruppel-type zinc finger (ZNF) motifs are prevalent components of transcription factor proteins in all eukaryotes. KRAB-ZNF proteins, in which a potent repressor domain is attached to a tandem array of DNA-binding zinc-finger motifs, are specific to tetrapod vertebrates and represent the largest class of ZNF proteins in mammals. To define the full repertoire of human KRAB-ZNF proteins, we searched the genome sequence for key motifs and then constructed and manually curated gene models incorporating those sequences. The resulting gene catalog contains 423 KRAB-ZNF protein-coding loci, yielding alternative transcripts that altogether predict at least 742 structurally distinct proteins. Active rounds of segmental duplication, involving single genes or larger regions and including both tandem and distributed duplication events, have driven the expansion of this mammalian gene family. Comparisons between the human genes and ZNF loci mined from the draft mouse, dog, and chimpanzee genomes not only identified 103 KRAB-ZNF genes that are conserved in mammals but also highlighted a substantial level of lineage-specific change; at least 136 KRAB-ZNF coding genes are primate specific, including many recent duplicates. KRAB-ZNF genes are widely expressed and clustered genes are typically not coregulated, indicating that paralogs have evolved to fill roles in many different biological processes. To facilitate further study, we have developed a Web-based public resource with access to gene models, sequences, and other data, including visualization tools to provide genomic context and interaction with other public data sets.