Evaluation of GRCh38 and de novo haploid genome assemblies demonstrates the enduring quality of the reference assembly

doi:10.1101/GR.213611.116

Journal Article•DOI•

Evaluation of GRCh38 and de novo haploid genome assemblies demonstrates the enduring quality of the reference assembly

Valerie A. Schneider¹, Tina A. Graves-Lindsay², Kerstin Howe³, Nathan Bouk¹, Hsiu-Chuan Chen¹, Paul Kitts¹, Terence Murphy¹, Kim D. Pruitt¹, Françoise Thibaud-Nissen¹, Derek Albracht², Robert S. Fulton², Milinn Kremitzki², Vincent Magrini², Chris Markovic², Sean McGrath², Karyn Meltz Steinberg², Kate Auger³, William Chow³, Joanna Collins³, Glenn Harden³, Tim Hubbard³, Sarah Pelan³, Jared T. Simpson³, Glen Threadgold³, James Torrance³, Jonathan Wood³, Laura Clarke⁴, Sergey Koren¹, Matthew Boitano⁵, Paul Peluso⁵, Heng Li⁶, Chen-Shan Chin⁵, Adam M. Phillippy¹, Richard Durbin³, Richard K. Wilson², Paul Flicek⁴, Evan E. Eichler⁷, Deanna M. Church¹ - Show less +34 more•Institutions (7)

National Institutes of Health¹, Washington University in St. Louis², Wellcome Trust Sanger Institute³, European Bioinformatics Institute⁴, Pacific Biosciences⁵, Broad Institute⁶, University of Washington⁷

01 May 2017-Genome Research (Cold Spring Harbor Laboratory Press)-Vol. 27, Iss: 5, pp 849-864

TL;DR: It is asserted that the collected updates in GRCh38 make the newer assembly a more robust substrate for comprehensive analyses that will promote the understanding of human biology and advance the efforts to improve health.

read less

Abstract: The human reference genome assembly plays a central role in nearly all aspects of today's basic and clinical research. GRCh38 is the first coordinate-changing assembly update since 2009; it reflects the resolution of roughly 1000 issues and encompasses modifications ranging from thousands of single base changes to megabase-scale path reorganizations, gap closures, and localization of previously orphaned sequences. We developed a new approach to sequence generation for targeted base updates and used data from new genome mapping technologies and single haplotype resources to identify and resolve larger assembly issues. For the first time, the reference assembly contains sequence-based representations for the centromeres. We also expanded the number of alternate loci to create a reference that provides a more robust representation of human population variation. We demonstrate that the updates render the reference an improved annotation substrate, alter read alignments in unchanged regions, and impact variant interpretation at clinically relevant loci. We additionally evaluated a collection of new de novo long-read haploid assemblies and conclude that although the new assemblies compare favorably to the reference with respect to continuity, error rate, and gene completeness, the reference still provides the best representation for complex genomic regions and coding sequences. We assert that the collected updates in GRCh38 make the newer assembly a more robust substrate for comprehensive analyses that will promote our understanding of human biology and advance our efforts to improve health.

...read moreread less

Content maybe subject to copyright Report