!
!"#$%"&'()"*+"#",(-*).,.*$#*~/001000*
23*4($5.#6*7.8,(-(7.#,9*
*
:;.8"*4<-8$=,
>?
1*:$;(#*@8""%.#
>?
1*A"9(9;.B.*C",6$B.
>1D?
1*!.B(#*4.#)
>
1*
E;$<)*FG*H;;($,,
I
1*3"B(#*JK.87
I
1*L;;.#*M$,<"8
N
1*A.%O.#*PQ6-"B(-
N1R
1*
S;(B("8*A";.#".Q
/1T1U
1*V.8")*SW:$##";;
X
1**L)8(.#*:$8,"9
>1Y
1*J.%.#,K.*Z";9K
>0
1*
!(;*M-P".#
>1>>1
1*J,"7K"#*E"9;("
N1R
1*C","8*A$##";;<
>1I[
1*V$#.,K.#*M.8-K(#(
I1>[\*
!
1
!Wellcome!Trust!Center!for!Human!Genetics,!University!of!Oxford,!UK!
2
!Department!of!Statistics,!University!of!Oxford,!UK!
3
!Centre!for!Systems!Genomics!and!the!Schools!of!Mathematics!and!Statistics,!and!
BioSciences,!The!University!of!Melbourne,!Parkville,!Victoria,!Australia.!
4
!Murdoch!Children’s!Research!Institute,!Parkville,!Victoria,!Australia.!
5
!Department!of!Genetic!Medicine!and!Development,!University!of!Geneva,!1!Michel!
Servet,!Geneva,!CH1211,!Switzerland.!
6
!Swiss!Institute!of!Bioinformatics,!University!of!Geneva,!1!Michel!Servet,!Geneva,!
CH1211,!Switzerland.!
7
!Institute!of!Genetics!and!Genomics!in!Geneva,!University!of!Geneva,!1!Michel!
Servet,!Geneva,!CH1211,!Switzerland.!
8!
Illumina!Ltd,!Chesterford!Research!Park,!Little!Chesterford,!Essex,!CB10!1XL,!United!
Kingdom.!
9
!Nuffield!Department!of!Clinical!Neurosciences,!Division!of!Clinical!Neurology,!John!
Radcliffe!Hospital,!University!of!Oxford,!Oxford!OX3!9DU,!United!Kingdom.!
10!
UK!Biobank,!Units!1-4!Spectrum!Way,!Adswood,!Stockport,!Cheshire,!SK3!0SA,!UK!
11
!Big!Data!Institute,!Li!Ka!Shing!Centre!for!Health!Information!and!Discovery,!
University!of!Oxford,!Oxford!OX3!7LF,!United!Kingdom.!
^!Current!address:!Procter!&!Gamble,!Brussels,!Belgium!
!!
*!These!authors!contributed!equally!to!this!work.!
†!These!authors!jointly!directed!this!work.!
‡!To!whom!correspondence!should!be!addressed:!marchini@stats.ox.ac.uk!
!
!
!
! !
not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.
The copyright holder for this preprint (which wasthis version posted July 20, 2017. ; https://doi.org/10.1101/166298doi: bioRxiv preprint
2!
!
L59,8.-,!
!
The!UK!Biobank!project!is!a!large!prospective!cohort!study!of!~500,000!individuals!
from!across!the!United!Kingdom,!aged!between!40-69!at!recruitment.!!A!rich!variety!
of!phenotypic!and!health-related!information!is!available!on!each!participant,!
making!the!resource!unprecedented!in!its!size!and!scope.!Here!we!describe!the!
genome-wide!genotype!data!(~805,000!markers)!collected!on!all!individuals!in!the!
cohort!and!its!quality!control!procedures.!Genotype!data!on!this!scale!offers!novel!
opportunities!for!assessing!quality!issues,!although!the!wide!range!of!ancestries!of!
the!individuals!in!the!cohort!also!creates!particular!challenges.!!We!also!conducted!a!
set!of!analyses!that!reveal!properties!of!the!genetic!data!–!such!as!population!
structure!and!relatedness!–!that!can!be!important!for!downstream!analyses.!In!
addition,!we!phased!and!imputed!genotypes!into!the!dataset,!using!computationally!
efficient!methods!combined!with!the!Haplotype!Reference!Consortium!(HRC)!and!
UK10K!haplotype!resource.!!This!increases!the!number!of!testable!variants!by!over!
100-fold!to!~96!million!variants.!We!also!imputed!classical!allelic!variation!at!11!
human!leukocyte!antigen!(HLA)!genes,!and!as!a!quality!control!check!of!this!
imputation,!we!replicate!signals!of!known!associations!between!HLA!alleles!and!
many!common!diseases.!!We!describe!tools!that!allow!efficient!genome-wide!
association!studies!(GWAS)!of!multiple!traits!and!fast!phenome-wide!association!
studies!(PheWAS),!which!work!together!with!a!new!compressed!file!format!that!has!
been!used!to!distribute!the!dataset.!!As!a!further!check!of!the!genotyped!and!
imputed!datasets,!we!performed!a!test-case!genome-wide!association!scan!on!a!
well-studied!human!trait,!standing!height.!
!
!
3"<'$8)9*
!
UK!Biobank,!Genotypes,!Quality!control,!Population!structure,!Relatedness,!Phasing,!
Imputation,!HLA!Imputation,!GWAS,!PheWAS!
! !
not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.
The copyright holder for this preprint (which wasthis version posted July 20, 2017. ; https://doi.org/10.1101/166298doi: bioRxiv preprint
3!
!
1" Introduction!............................................................................................................!4"
2" Results!....................................................................................................................!5"
2.1" High!quality!genotype!calling!on!novel!array!..................................................!5"
2.2" Ancestral!diversity!and!cryptic!relatedness!...................................................!18"
2.3" Phasing!and!Imputation!of!SNPs,!short!indels!and!CNVs!...............................!23"
2.4" Imputation!of!classical!HLA!alleles!.................................................................!26"
2.5" GWAS!for!standing!height!.............................................................................!28"
2.6" Multiple!trait!GWAS!and!PheWAS!.................................................................!31"
3" Data!provision!and!access!....................................................................................!31"
4" URLs!......................................................................................................................!32"
5" Author!contributions!............................................................................................!32"
6" Acknowledgements!..............................................................................................!32"
7" Conflicts!of!Interest!..............................................................................................!33"
8" References!............................................................................................................!33"
!
!
!
!
!
!
!
!
!
!
! *
not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.
The copyright holder for this preprint (which wasthis version posted July 20, 2017. ; https://doi.org/10.1101/166298doi: bioRxiv preprint
4!
!
> ]#,8$)Q-,($#*
The!UK!Biobank!project!is!a!large!prospective!cohort!study!of!~500,000!individuals!
from!across!the!United!Kingdom,!aged!between!40-69!at!recruitment![1].!!A!rich!
variety!of!phenotypic!and!health-related!information!is!available!on!each!participant,!
making!the!resource!unprecedented!in!its!size!and!scope.!The!data!contains!self-
reported!information,!including!basic!demographics,!diet,!and!exercise!habits;!
extensive!physical!and!cognitive!measurements;!with!other!sources!of!health-related!
information!such!as!medical!records!and!cancer!registers!being!integrated!and!
followed!up!over!the!course!of!the!participants’!lives![2].!The!baseline!information!
has,!and!will!be,!extended!in!a!number!of!ways![3].!For!example,!many!blood!and!
urine!biomarkers!are!being!measured;!and!medical!imaging!of!brain![4],!heart,!
bones,!carotid!arteries!and!abdominal!fat!is!being!carried!out!on!a!large!subset!
(~100,000)!of!participants![5].!
!
Understanding!the!role!that!genetics!plays!in!phenotypic!and!disease!variation,!and!
its!potential!interactions!with!other!factors,!provides!a!critical!route!to!a!better!
understanding!of!human!biology.!!It!is!anticipated!that!this!will!lead!to!more-
successful!drug!development![6]!,!and!potentially!to!more!efficient!and!personalised!
treatments!and!to!better!diagnoses.!As!such,!a!key!component!of!the!UK!Biobank!
resource!has!been!the!collection!of!genome-wide!genetic!data!on!every!participant!
using!a!purpose-designed!genotyping!array![7].!An!interim!release!of!genotype!data!
on!~150,000!UK!Biobank!participants!(May!2015)![8]!has!already!facilitated!
numerous!studies![9].!These!exploit!the!UK!Biobank’s!substantial!sample!size,!
extensive!phenotype!information,!and!genome-wide!genetic!information!to!study!
the!often!subtle!and!complex!effects!of!genetics!on!human!traits!and!disease,!and!its!
potential!interactions!with!other!factors![10-15].!!
!
In!this!paper!we!describe!the!genetic!dataset!on!the!full!~500,000!participants,!
together!with!a!range!of!quality!control!procedures,!which!have!been!undertaken!on!
the!genotype!data!in!the!hope!of!facilitating!its!wider!use.!!To!achieve!this!we!
designed!and!implemented!a!quality!control!(QC)!pipeline!that!addresses!challenges!
specific!to!the!experimental!design,!scale,!and!diversity!of!this!dataset.!!Raw!data!
from!the!genotyping!experiments!will!be!available!from!UK!Biobank.!!We!also!
conducted!a!set!of!analyses!that!reveal!properties!of!the!genetic!data!–!such!as!
population!structure!and!relatedness!–!that!can!be!important!for!downstream!
analyses.!!In!addition,!we!phased!and!imputed!genotypes!into!the!dataset,!using!
computationally!efficient!methods!combined!with!the!Haplotype!Reference!
Consortium!(HRC)![16]!and!UK10K!haplotype!resources![17].!!This!increases!the!
number!of!testable!variants!by!over!100-fold!to!~96!million!variants.!We!also!
imputed!classical!allelic!variation!at!11!human!leukocyte!antigen!(HLA)!genes,!and!as!
not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.
The copyright holder for this preprint (which wasthis version posted July 20, 2017. ; https://doi.org/10.1101/166298doi: bioRxiv preprint
5!
!
a!QC!check!of!this!imputation,!we!replicate!signals!of!known!associations!between!
HLA!alleles!and!many!common!diseases.!!We!describe!tools!that!allow!efficient!
genome-wide!association!studies!(GWAS)!of!multiple!traits!and!fast!phenome-wide!
association!studies!(PheWAS),!which!work!together!with!a!new!compressed!file!
format!that!has!been!used!to!distribute!the!dataset.!!As!a!further!check!of!the!
genotyped!and!imputed!datasets,!we!performed!a!test-case!genome-wide!
association!scan!on!a!well-studied!human!trait,!standing!height.!!
I ^"9Q;,9*
IG> _(+K*`Q.;(,<*+"#$,<7"*-.;;(#+*$#*#$B";*.88.<*
IG>G> CQ87$9"&)"9(+#")*+"#$,<7(#+*.88.<*
The!data!release!contains!genotypes!of!488,377!UK!Biobank!participants.!!These!
were!assayed!using!two!very!similar!genotyping!arrays.!!A!subset!of!49,950!
participants!involved!in!the!UK!Biobank!Lung!Exome!Variant!Evaluation!(UK!BiLEVE)!
study!were!genotyped!using!the!Applied!Biosystems™!UK!BiLEVE!Axiom™!Array!by!
Affymetrix
1
!(807,411!markers),!which!is!described!elsewhere![15].!!Following!this,!
438,427!participants!were!genotyped!using!the!closely-related!Applied!Biosystems™!
UK!Biobank!Axiom™!Array!(825,927!markers).!!Both!arrays!were!purpose-designed!
specifically!for!the!UK!Biobank!genotyping!project!and!share!95%!of!marker!content!
[7].!!The!marker!content!of!the!UK!Biobank!Axiom™!array!was!chosen!to!capture!
genome-wide!genetic!variation!(single!nucleotide!polymorphism!(SNPs)!and!short!
insertions!and!deletions!(indels)),!and!is!summarised!in!@(+Q8"*>.!!Many!markers!
were!included!because!of!known!associations!with,!or!possible!roles!in,!phenotypic!
variation,!particularly!disease.!A!notable!example!is!the!inclusion!of!two!variants,!
rs429358!and!rs7412,!which!define!the!isoforms!of!the!apolipoprotein!E!(APoE)!gene!
known!to!be!associated!with!risk!of!Alzheimer’s!disease![7]!and!other!conditions.!!
Neither!marker!is!easy!to!type!using!array!technologies;!as!a!consequence!of!this!
they!have!not!always!been!assayed!on!earlier!arrays.!!The!array!also!includes!coding!
variants!across!a!range!of!minor!allele!frequencies!(MAFs),!including!rare!markers!
(<1%!MAF);!and!markers!that!provide!good!genome-wide!coverage!for!imputation!in!
European!populations!in!the!common!(>5%)!and!low!frequency!(1-5%)!MAF!ranges.!
Further!details!of!the!array!design!are!in![7].!!
! !
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
1
Now!part!of!Thermo!Fisher!Scientific.!
not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.
The copyright holder for this preprint (which wasthis version posted July 20, 2017. ; https://doi.org/10.1101/166298doi: bioRxiv preprint