Author
Rachel Maupin
Bio: Rachel Maupin is an academic researcher from Washington University in St. Louis. The author has contributed to research in topics: Chromosome 20 & Sequence analysis. The author has an hindex of 5, co-authored 6 publications receiving 2562 citations.
Papers
More filters
••
TL;DR: The male-specific region of the Y chromosome, the MSY, differentiates the sexes and comprises 95% of the chromosome's length, and is a mosaic of heterochromatic sequences and three classes of euchromatics sequences: X-transposed, X-degenerate and ampliconic.
Abstract: The male-specific region of the Y chromosome, the MSY, differentiates the sexes and comprises 95% of the chromosome's length. Here, we report that the MSY is a mosaic of heterochromatic sequences and three classes of euchromatic sequences: X-transposed, X-degenerate and ampliconic. These classes contain all 156 known transcription units, which include 78 protein-coding genes that collectively encode 27 distinct proteins. The X-transposed sequences exhibit 99% identity to the X chromosome. The X-degenerate sequences are remnants of ancient autosomes from which the modern X and Y chromosomes evolved. The ampliconic class includes large regions (about 30% of the MSY euchromatin) where sequence pairs show greater than 99.9% identity, which is maintained by frequent gene conversion (non-reciprocal transfer). The most prominent features here are eight massive palindromes, at least six of which contain testis genes.
2,022 citations
••
TL;DR: The euchromatic sequence of chromosome 7, the first metacentric chromosome completed so far, has excellent concordance with previously established physical and genetic maps, and it exhibits an unusual amount of segmentally duplicated sequence.
Abstract: Human chromosome 7 has historically received prominent attention in the human genetics community, primarily related to the search for the cystic fibrosis gene and the frequent cytogenetic changes associated with various forms of cancer. Here we present more than 153 million base pairs representing 99.4% of the euchromatic sequence of chromosome 7, the first metacentric chromosome completed so far. The sequence has excellent concordance with previously established physical and genetic maps, and it exhibits an unusual amount of segmentally duplicated sequence (8.2%), with marked differences between the two arms. Our initial analyses have identified 1,150 protein-coding genes, 605 of which have been confirmed by complementary DNA sequences, and an additional 941 pseudogenes. Of genes confirmed by transcript sequences, some are polymorphic for mutations that disrupt the reading frame.
244 citations
••
TL;DR: The use of an unbiased high-resolution genomic screen identified many genes not previously implicated in AML that may be relevant for pathogenesis, along with many known oncogenes and tumor suppressor genes.
Abstract: Cytogenetic analysis of acute myeloid leukemia (AML) cells has accelerated the identification of genes important for AML pathogenesis. To complement cytogenetic studies and to identify genes altered in AML genomes, we performed genome-wide copy number analysis with paired normal and tumor DNA obtained from 86 adult patients with de novo AML using 1.85 million feature SNP arrays. Acquired copy number alterations (CNAs) were confirmed using an ultra-dense array comparative genomic hybridization platform. A total of 201 somatic CNAs were found in the 86 AML genomes (mean, 2.34 CNAs per genome), with French-American-British system M6 and M7 genomes containing the most changes (10–29 CNAs per genome). Twenty-four percent of AML patients with normal cytogenetics had CNA, whereas 40% of patients with an abnormal karyotype had additional CNA detected by SNP array, and several CNA regions were recurrent. The mRNA expression levels of 57 genes were significantly altered in 27 of 50 recurrent CNA regions <5 megabases in size. A total of 8 uniparental disomy (UPD) segments were identified in the 86 genomes; 6 of 8 UPD calls occurred in samples with a normal karyotype. Collectively, 34 of 86 AML genomes (40%) contained alterations not found with cytogenetics, and 98% of these regions contained genes. Of 86 genomes, 43 (50%) had no CNA or UPD at this level of resolution. In this study of 86 adult AML genomes, the use of an unbiased high-resolution genomic screen identified many genes not previously implicated in AML that may be relevant for pathogenesis, along with many known oncogenes and tumor suppressor genes.
241 citations
••
TL;DR: Extensive analyses confirm the underlying construction of the sequence, and expand the understanding of the structure and evolution of mammalian chromosomes, including gene deserts, segmental duplications and highly variant regions.
Abstract: Human chromosome 2 is unique to the human lineage in being the product of a head-to-head fusion of two intermediate-sized ancestral chromosomes. Chromosome 4 has received attention primarily related to the search for the Huntington's disease gene, but also for genes associated with Wolf-Hirschhorn syndrome, polycystic kidney disease and a form of muscular dystrophy. Here we present approximately 237 million base pairs of sequence for chromosome 2, and 186 million base pairs for chromosome 4, representing more than 99.6% of their euchromatic sequences. Our initial analyses have identified 1,346 protein-coding genes and 1,239 pseudogenes on chromosome 2, and 796 protein-coding genes and 778 pseudogenes on chromosome 4. Extensive analyses confirm the underlying construction of the sequence, and expand our understanding of the structure and evolution of mammalian chromosomes, including gene deserts, segmental duplications and highly variant regions.
107 citations
••
TL;DR: The generated sequence reveals the precise architecture of genes residing near CFTR/Cftr, including one known gene (WNT2/Wnt2) and two previously unknown genes that immediately flank CFTR or Cftr.
Abstract: The identification of the cystic fibrosis transmembrane conductance regulator gene (CFTR) in 1989 represents a landmark accomplishment in human genetics. Since that time, there have been numerous advances in elucidating the function of the encoded protein and the physiological basis of cystic fibrosis. However, numerous areas of cystic fibrosis biology require additional investigation, some of which would be facilitated by information about the long-range sequence context of the CFTR gene. For example, the latter might provide clues about the sequence elements responsible for the temporal and spatial regulation of CFTR expression. We thus sought to establish the sequence of the chromosomal segments encompassing the human CFTR and mouse Cftr genes, with the hope of identifying conserved regions of biologic interest by sequence comparison. Bacterial clone-based physical maps of the relevant human and mouse genomic regions were constructed, and minimally overlapping sets of clones were selected and sequenced, eventually yielding ≈1.6 Mb and ≈358 kb of contiguous human and mouse sequence, respectively. These efforts have produced the complete sequence of the ≈189-kb and ≈152-kb segments containing the human CFTR and mouse Cftr genes, respectively, as well as significant amounts of flanking DNA. Analyses of the resulting data provide insights about the organization of the CFTR/Cftr genes and potential sequence elements regulating their expression. Furthermore, the generated sequence reveals the precise architecture of genes residing near CFTR/Cftr, including one known gene (WNT2/Wnt2) and two previously unknown genes that immediately flank CFTR/Cftr.
81 citations
Cited by
More filters
••
TL;DR: The results of an international collaboration to produce a high-quality draft sequence of the mouse genome are reported and an initial comparative analysis of the Mouse and human genomes is presented, describing some of the insights that can be gleaned from the two sequences.
Abstract: The sequence of the mouse genome is a key informational tool for understanding the contents of the human genome and a key experimental tool for biomedical research. Here, we report the results of an international collaboration to produce a high-quality draft sequence of the mouse genome. We also present an initial comparative analysis of the mouse and human genomes, describing some of the insights that can be gleaned from the two sequences. We discuss topics including the analysis of the evolutionary forces shaping the size, structure and sequence of the genomes; the conservation of large-scale synteny across most of the genomes; the much lower extent of sequence orthology covering less than half of the genomes; the proportions of the genomes under selection; the number of protein-coding genes; the expansion of gene families related to reproduction and immunity; the evolution of proteins; and the identification of intraspecies polymorphism.
6,643 citations
••
TL;DR: This work introduces Gene Set Variation Analysis (GSVA), a GSE method that estimates variation of pathway activity over a sample population in an unsupervised manner and constitutes a starting point to build pathway-centric models of biology.
Abstract: Gene set enrichment (GSE) analysis is a popular framework for condensing information from gene expression profiles into a pathway or signature summary. The strengths of this approach over single gene analysis include noise and dimension reduction, as well as greater biological interpretability. As molecular profiling experiments move beyond simple case-control studies, robust and flexible GSE methodologies are needed that can model pathway activity within highly heterogeneous data sets. To address this challenge, we introduce Gene Set Variation Analysis (GSVA), a GSE method that estimates variation of pathway activity over a sample population in an unsupervised manner. We demonstrate the robustness of GSVA in a comparison with current state of the art sample-wise enrichment methods. Further, we provide examples of its utility in differential pathway activity and survival analysis. Lastly, we show how GSVA works analogously with data from both microarray and RNA-seq experiments. GSVA provides increased power to detect subtle pathway activity changes over a sample population in comparison to corresponding methods. While GSE methods are generally regarded as end points of a bioinformatic analysis, GSVA constitutes a starting point to build pathway-centric models of biology. Moreover, GSVA contributes to the current need of GSE methods for RNA-seq data. GSVA is an open source software package for R which forms part of the Bioconductor project and can be downloaded at http://www.bioconductor.org
.
6,125 citations
••
TL;DR: New normal linear modeling strategies are presented for analyzing read counts from RNA-seq experiments, and the voom method estimates the mean-variance relationship of the log-counts, generates a precision weight for each observation and enters these into the limma empirical Bayes analysis pipeline.
Abstract: New normal linear modeling strategies are presented for analyzing read counts from RNA-seq experiments. The voom method estimates the mean-variance relationship of the log-counts, generates a precision weight for each observation and enters these into the limma empirical Bayes analysis pipeline. This opens access for RNA-seq analysts to a large body of methodology developed for microarrays. Simulation studies show that voom performs as well or better than count-based RNA-seq methods even when the data are generated according to the assumptions of the earlier methods. Two case studies illustrate the use of linear modeling and gene set testing methods.
4,475 citations
••
TL;DR: The current human genome sequence (Build 35) as discussed by the authors contains 2.85 billion nucleotides interrupted by only 341 gaps and is accurate to an error rate of approximately 1 event per 100,000 bases.
Abstract: The sequence of the human genome encodes the genetic instructions for human physiology, as well as rich information about human evolution. In 2001, the International Human Genome Sequencing Consortium reported a draft sequence of the euchromatic portion of the human genome. Since then, the international collaboration has worked to convert this draft into a genome sequence with high accuracy and nearly complete coverage. Here, we report the result of this finishing process. The current genome sequence (Build 35) contains 2.85 billion nucleotides interrupted by only 341 gaps. It covers approximately 99% of the euchromatic genome and is accurate to an error rate of approximately 1 event per 100,000 bases. Many of the remaining euchromatic gaps are associated with segmental duplications and will require focused work with new methods. The near-complete sequence, the first for a vertebrate, greatly improves the precision of biological analyses of the human genome including studies of gene number, birth and death. Notably, the human genome seems to encode only 20,000-25,000 protein-coding genes. The genome sequence reported here should serve as a firm foundation for biomedical research in the decades ahead.
3,989 citations
••
Washington University in St. Louis1, Brown University2, University of British Columbia3, University of North Carolina at Chapel Hill4, University of Southern California5, Massachusetts Institute of Technology6, Seattle Cancer Care Alliance7, Johns Hopkins University8, University of Texas MD Anderson Cancer Center9, Nationwide Children's Hospital10, National Institutes of Health11, SRA International12, Temple University13, University of Chicago14, University of Pennsylvania15
TL;DR: It is found that a complex interplay of genetic events contributes to AML pathogenesis in individual patients and the databases from this study are widely available to serve as a foundation for further investigations of AMl pathogenesis, classification, and risk stratification.
Abstract: BACKGROUND—Many mutations that contribute to the pathogenesis of acute myeloid leukemia (AML) are undefined The relationships between patterns of mutations and epigenetic phenotypes are not yet clear METHODS—We analyzed the genomes of 200 clinically annotated adult cases of de novo AML, using either whole-genome sequencing (50 cases) or whole-exome sequencing (150 cases), along with RNA and microRNA sequencing and DNA-methylation analysis RESULTS—AML genomes have fewer mutations than most other adult cancers, with an average of only 13 mutations found in genes Of these, an average of 5 are in genes that are recurrently mutated in AML A total of 23 genes were significantly mutated, and another 237 were mutated in two or more samples Nearly all samples had at least 1 nonsynonymous mutation in one of nine categories of genes that are almost certainly relevant for pathogenesis, including transcriptionfactor fusions (18% of cases), the gene encoding nucleophosmin (NPM1) (27%), tumorsuppressor genes (16%), DNA-methylation–related genes (44%), signaling genes (59%), chromatin-modifying genes (30%), myeloid transcription-factor genes (22%), cohesin-complex genes (13%), and spliceosome-complex genes (14%) Patterns of cooperation and mutual exclusivity suggested strong biologic relationships among several of the genes and categories CONCLUSIONS—We identified at least one potential driver mutation in nearly all AML samples and found that a complex interplay of genetic events contributes to AML pathogenesis in individual patients The databases from this study are widely available to serve as a foundation for further investigations of AML pathogenesis, classification, and risk stratification (Funded by the National Institutes of Health) The molecular pathogenesis of acute myeloid leukemia (AML) has been studied with the use of cytogenetic analysis for more than three decades Recurrent chromosomal structural variations are well established as diagnostic and prognostic markers, suggesting that acquired genetic abnormalities (ie, somatic mutations) have an essential role in pathogenesis 1,2 However, nearly 50% of AML samples have a normal karyotype, and many of these genomes lack structural abnormalities, even when assessed with high-density comparative genomic hybridization or single-nucleotide polymorphism (SNP) arrays 3-5 (see Glossary) Targeted sequencing has identified recurrent mutations in FLT3, NPM1, KIT, CEBPA, and TET2 6-8 Massively parallel sequencing enabled the discovery of recurrent mutations in DNMT3A 9,10 and IDH1 11 Recent studies have shown that many patients with
3,980 citations