scispace - formally typeset
Search or ask a question
Author

Hui Du

Bio: Hui Du is an academic researcher from Washington University in St. Louis. The author has contributed to research in topics: Biology & Medicine. The author has an hindex of 4, co-authored 4 publications receiving 2651 citations.
Topics: Biology, Medicine, Genome, Gene, Chromosome 4

Papers
More filters
Journal Article•DOI•
19 Jun 2003-Nature
TL;DR: The male-specific region of the Y chromosome, the MSY, differentiates the sexes and comprises 95% of the chromosome's length, and is a mosaic of heterochromatic sequences and three classes of euchromatics sequences: X-transposed, X-degenerate and ampliconic.
Abstract: The male-specific region of the Y chromosome, the MSY, differentiates the sexes and comprises 95% of the chromosome's length. Here, we report that the MSY is a mosaic of heterochromatic sequences and three classes of euchromatic sequences: X-transposed, X-degenerate and ampliconic. These classes contain all 156 known transcription units, which include 78 protein-coding genes that collectively encode 27 distinct proteins. The X-transposed sequences exhibit 99% identity to the X chromosome. The X-degenerate sequences are remnants of ancient autosomes from which the modern X and Y chromosomes evolved. The ampliconic class includes large regions (about 30% of the MSY euchromatin) where sequence pairs show greater than 99.9% identity, which is maintained by frequent gene conversion (non-reciprocal transfer). The most prominent features here are eight massive palindromes, at least six of which contain testis genes.

2,022 citations

Journal Article•DOI•
Klaus F. X. Mayer1, C. Schüller1, R. Wambutt, George Murphy2  +230 more•Institutions (21)
16 Dec 1999-Nature
TL;DR: Analysis of 17.38 megabases of unique sequence, representing about 17% of the Arabidopsis genome, reveals 3,744 protein coding genes, 81 transfer RNAs and numerous repeat elements.
Abstract: The higher plant Arabidopsis thaliana (Arabidopsis) is an important model for identifying plant genes and determining their function. To assist biological investigations and to define chromosome structure, a coordinated effort to sequence the Arabidopsis genome was initiated in late 1996. Here we report one of the first milestones of this project, the sequence of chromosome 4. Analysis of 17.38 megabases of unique sequence, representing about 17% of the genome, reveals 3,744 protein coding genes, 81 transfer RNAs and numerous repeat elements. Heterochromatic regions surrounding the putative centromere, which has not yet been completely sequenced, are characterized by an increased frequency of a variety of repeats, new repeats, reduced recombination, lowered gene density and lowered gene expression. Roughly 60% of the predicted protein-coding genes have been functionally characterized on the basis of their homology to known genes. Many genes encode predicted proteins that are homologous to human and Caenorhabditis elegans proteins.

411 citations

Journal Article•DOI•
LaDeana W. Hillier1, Robert S. Fulton1, Lucinda Fulton1, Tina Graves1, Kymberlie H. Pepin1, Caryn Wagner-McPherson1, Dan Layman1, Jason Maas1, Sara Jaeger1, Rebecca S. Walker1, Kristine M. Wylie1, Mandeep Sekhon1, Michael C. Becker1, Michelle O'Laughlin1, Mark E. Schaller1, Ginger A. Fewell1, Kimberly D. Delehaunty1, Tracie L. Miner1, William E. Nash1, Matt Cordes1, Hui Du1, Hui Sun1, Jennifer Edwards1, Holland Bradshaw-Cordum1, Johar Ali1, Stephanie Andrews1, Amber Isak1, Andrew Vanbrunt1, Christine Nguyen1, Feiyu Du1, Betty Lamar1, Laura Courtney1, Joelle Kalicki1, Philip Ozersky1, Lauren Bielicki1, Kelsi Scott1, Andrea Holmes1, Richard Harkins1, Anthony R. Harris1, Cindy Strong1, Shunfang Hou1, Chad Tomlinson1, Sara Dauphin-Kohlberg1, Amy Kozlowicz-Reilly1, Shawn Leonard1, Theresa Rohlfing1, Susan M. Rock1, Aye-Mon Tin-Wollam1, Amanda Abbott1, Patrick Minx1, Rachel Maupin1, Catrina Strowmatt1, Phil Latreille1, Nancy Miller1, Doug Johnson1, Jennifer Murray1, Jeffrey Woessner1, Michael C. Wendl1, Shiaw-Pyng Yang1, Brian Schultz1, John W. Wallis1, John Spieth1, Tamberlyn Bieri1, Joanne O. Nelson1, Nicolas Berkowicz1, Patricia Wohldmann1, Lisa Cook1, Matthew T. Hickenbotham1, James M. Eldred1, Donald Williams1, Joseph A. Bedell1, Elaine R. Mardis1, Sandra W. Clifton1, Stephanie L. Chissoe1, Marco A. Marra1, Marco A. Marra2, Christopher K. Raymond3, Eric Haugen3, Will Gillett3, Yang Zhou3, R. James3, Karen A. Phelps3, Shawn Iadanoto3, Kerry L. Bubb3, Elizabeth Simms3, Ruth Levy3, James B. Clendenning3, Rajinder Kaul3, W. James Kent4, Terrence S. Furey4, Robert Baertsch4, Michael R. Brent1, Evan Keibler1, Paul Flicek1, Peer Bork5, Mikita Suyama5, Jeffrey A. Bailey6, Matthew E. Portnoy7, David Torrents5, Asif T. Chinwalla1, Warren Gish1, Sean R. Eddy1, John Douglas Mcpherson8, John Douglas Mcpherson1, Maynard V. Olson3, Evan E. Eichler6, Eric D. Green7, Robert H. Waterston3, Robert H. Waterston1, Richard K. Wilson1 •
10 Jul 2003-Nature
TL;DR: The euchromatic sequence of chromosome 7, the first metacentric chromosome completed so far, has excellent concordance with previously established physical and genetic maps, and it exhibits an unusual amount of segmentally duplicated sequence.
Abstract: Human chromosome 7 has historically received prominent attention in the human genetics community, primarily related to the search for the cystic fibrosis gene and the frequent cytogenetic changes associated with various forms of cancer. Here we present more than 153 million base pairs representing 99.4% of the euchromatic sequence of chromosome 7, the first metacentric chromosome completed so far. The sequence has excellent concordance with previously established physical and genetic maps, and it exhibits an unusual amount of segmentally duplicated sequence (8.2%), with marked differences between the two arms. Our initial analyses have identified 1,150 protein-coding genes, 605 of which have been confirmed by complementary DNA sequences, and an additional 941 pseudogenes. Of genes confirmed by transcript sequences, some are polymorphic for mutations that disrupt the reading frame.

244 citations

Journal Article•DOI•
LaDeana W. Hillier1, Tina Graves1, Robert S. Fulton1, Lucinda Fulton1, Kymberlie H. Pepin1, Patrick Minx1, Caryn Wagner-McPherson1, Dan Layman1, Kristine M. Wylie1, Mandeep Sekhon1, Michael C. Becker1, Ginger A. Fewell1, Kimberly D. Delehaunty1, Tracie L. Miner1, William E. Nash1, Colin Kremitzki1, Lachlan G. Oddy1, Hui Du1, Hui Sun1, Holland Bradshaw-Cordum1, Johar Ali1, Jason Carter1, Matt Cordes1, Anthony R. Harris1, Amber Isak1, Andrew Van Brunt1, Christine Nguyen1, Feiyu Du1, Laura Courtney1, Joelle Kalicki1, Philip Ozersky1, Scott Abbott1, Jon R. Armstrong1, Edward A. Belter1, Lauren Caruso1, Maria Cedroni1, Marc Cotton1, Teresa Davidson1, Anu Desai1, Glendoria Elliott1, Thomas Erb1, Catrina Fronick1, Tony Gaige1, William Haakenson1, Krista Haglund1, Andrea Holmes1, Richard Harkins1, Kyung Kim1, Scott Kruchowski1, Cindy Strong1, Neenu Grewal1, Ernest Goyea1, Shunfang Hou1, Andrew Levy1, Scott Martinka1, Kelly Mead1, Michael D. McLellan1, Rick Meyer1, Jennifer Randall-Maher1, Chad Tomlinson1, Sara Dauphin-Kohlberg1, Amy Kozlowicz-Reilly1, Neha Shah1, Sharhonda Swearengen-Shahid1, Jacqueline E. Snider1, Joseph T. Strong1, Johanna Thompson1, Martin Yoakum1, Shawn Leonard1, Charlene Pearman1, Lee Trani1, Maxim Radionenko1, Jason Waligorski1, Chunyan Wang1, Susan M. Rock1, Aye Mon Tin-Wollam1, Rachel Maupin1, Phil Latreille1, Michael C. Wendl1, Shiaw Pyng Yang1, Craig Pohl1, John W. Wallis1, John Spieth1, Tamberlyn Bieri1, Nicolas Berkowicz1, Joanne O. Nelson1, John R. Osborne1, Li Ding1, Rekha Meyer1, Aniko Sabo1, Yoram Shotland1, Prashant R. Sinha1, Patricia Wohldmann1, Lisa Cook1, Matthew T. Hickenbotham1, James M. Eldred1, Donald Williams1, Thomas A. Jones1, Xinwei She2, Francesca D. Ciccarelli, Elisa Izaurralde, James Taylor3, Jeremy Schmutz4, Richard M. Myers4, David R. Cox4, Xiaoqiu Huang5, John Douglas Mcpherson1, John Douglas Mcpherson6, Elaine R. Mardis1, Sandra W. Clifton1, Wesley C. Warren1, Asif T. Chinwalla1, Sean R. Eddy1, Marco A. Marra7, Marco A. Marra1, Ivan Ovcharenko8, Terrence S. Furey9, Webb Miller3, Evan E. Eichler2, Peer Bork, Mikita Suyama, David Torrents, Robert H. Waterston2, Robert H. Waterston1, Richard K. Wilson1 •
07 Apr 2005-Nature
TL;DR: Extensive analyses confirm the underlying construction of the sequence, and expand the understanding of the structure and evolution of mammalian chromosomes, including gene deserts, segmental duplications and highly variant regions.
Abstract: Human chromosome 2 is unique to the human lineage in being the product of a head-to-head fusion of two intermediate-sized ancestral chromosomes. Chromosome 4 has received attention primarily related to the search for the Huntington's disease gene, but also for genes associated with Wolf-Hirschhorn syndrome, polycystic kidney disease and a form of muscular dystrophy. Here we present approximately 237 million base pairs of sequence for chromosome 2, and 186 million base pairs for chromosome 4, representing more than 99.6% of their euchromatic sequences. Our initial analyses have identified 1,346 protein-coding genes and 1,239 pseudogenes on chromosome 2, and 796 protein-coding genes and 778 pseudogenes on chromosome 4. Extensive analyses confirm the underlying construction of the sequence, and expand our understanding of the structure and evolution of mammalian chromosomes, including gene deserts, segmental duplications and highly variant regions.

107 citations

Journal Article•DOI•
TL;DR: Pan-genomics can encompass most of the genetic diversity of a species or population and has proved to be a powerful tool for studying genomic evolution and the origin and domestication of species, and for providing information for plant improvement as discussed by the authors .

11 citations


Cited by
More filters
Journal Article•DOI•
14 Dec 2000-Nature
TL;DR: This is the first complete genome sequence of a plant and provides the foundations for more comprehensive comparison of conserved processes in all eukaryotes, identifying a wide range of plant-specific gene functions and establishing rapid systematic ways to identify genes for crop improvement.
Abstract: The flowering plant Arabidopsis thaliana is an important model system for identifying genes and determining their functions. Here we report the analysis of the genomic sequence of Arabidopsis. The sequenced regions cover 115.4 megabases of the 125-megabase genome and extend into centromeric regions. The evolution of Arabidopsis involved a whole-genome duplication, followed by subsequent gene loss and extensive local gene duplications, giving rise to a dynamic genome enriched by lateral gene transfer from a cyanobacterial-like ancestor of the plastid. The genome contains 25,498 genes encoding proteins from 11,000 families, similar to the functional diversity of Drosophila and Caenorhabditis elegans--the other sequenced multicellular eukaryotes. Arabidopsis has many families of new proteins but also lacks several common protein families, indicating that the sets of common proteins have undergone differential expansion and contraction in the three multicellular eukaryotes. This is the first complete genome sequence of a plant and provides the foundations for more comprehensive comparison of conserved processes in all eukaryotes, identifying a wide range of plant-specific gene functions and establishing rapid systematic ways to identify genes for crop improvement.

8,742 citations

Journal Article•DOI•
TL;DR: This work introduces Gene Set Variation Analysis (GSVA), a GSE method that estimates variation of pathway activity over a sample population in an unsupervised manner and constitutes a starting point to build pathway-centric models of biology.
Abstract: Gene set enrichment (GSE) analysis is a popular framework for condensing information from gene expression profiles into a pathway or signature summary. The strengths of this approach over single gene analysis include noise and dimension reduction, as well as greater biological interpretability. As molecular profiling experiments move beyond simple case-control studies, robust and flexible GSE methodologies are needed that can model pathway activity within highly heterogeneous data sets. To address this challenge, we introduce Gene Set Variation Analysis (GSVA), a GSE method that estimates variation of pathway activity over a sample population in an unsupervised manner. We demonstrate the robustness of GSVA in a comparison with current state of the art sample-wise enrichment methods. Further, we provide examples of its utility in differential pathway activity and survival analysis. Lastly, we show how GSVA works analogously with data from both microarray and RNA-seq experiments. GSVA provides increased power to detect subtle pathway activity changes over a sample population in comparison to corresponding methods. While GSE methods are generally regarded as end points of a bioinformatic analysis, GSVA constitutes a starting point to build pathway-centric models of biology. Moreover, GSVA contributes to the current need of GSE methods for RNA-seq data. GSVA is an open source software package for R which forms part of the Bioconductor project and can be downloaded at http://www.bioconductor.org .

6,125 citations

Journal Article•DOI•
TL;DR: New normal linear modeling strategies are presented for analyzing read counts from RNA-seq experiments, and the voom method estimates the mean-variance relationship of the log-counts, generates a precision weight for each observation and enters these into the limma empirical Bayes analysis pipeline.
Abstract: New normal linear modeling strategies are presented for analyzing read counts from RNA-seq experiments. The voom method estimates the mean-variance relationship of the log-counts, generates a precision weight for each observation and enters these into the limma empirical Bayes analysis pipeline. This opens access for RNA-seq analysts to a large body of methodology developed for microarrays. Simulation studies show that voom performs as well or better than count-based RNA-seq methods even when the data are generated according to the assumptions of the earlier methods. Two case studies illustrate the use of linear modeling and gene set testing methods.

4,475 citations

Journal Article•DOI•
TL;DR: A neural network-based tool, TargetP, for large-scale subcellular location prediction of newly identified proteins has been developed and it is estimated that 10% of all plant proteins are mitochondrial and 14% chloroplastic, and that the abundance of secretory proteins, in both Arabidopsis and Homo, is around 10%.

4,268 citations

Journal Article•DOI•
21 Oct 2004-Nature
TL;DR: The current human genome sequence (Build 35) as discussed by the authors contains 2.85 billion nucleotides interrupted by only 341 gaps and is accurate to an error rate of approximately 1 event per 100,000 bases.
Abstract: The sequence of the human genome encodes the genetic instructions for human physiology, as well as rich information about human evolution. In 2001, the International Human Genome Sequencing Consortium reported a draft sequence of the euchromatic portion of the human genome. Since then, the international collaboration has worked to convert this draft into a genome sequence with high accuracy and nearly complete coverage. Here, we report the result of this finishing process. The current genome sequence (Build 35) contains 2.85 billion nucleotides interrupted by only 341 gaps. It covers approximately 99% of the euchromatic genome and is accurate to an error rate of approximately 1 event per 100,000 bases. Many of the remaining euchromatic gaps are associated with segmental duplications and will require focused work with new methods. The near-complete sequence, the first for a vertebrate, greatly improves the precision of biological analyses of the human genome including studies of gene number, birth and death. Notably, the human genome seems to encode only 20,000-25,000 protein-coding genes. The genome sequence reported here should serve as a firm foundation for biomedical research in the decades ahead.

3,989 citations