scispace - formally typeset
Open AccessPosted ContentDOI

Genome-wide genetic data on ~500,000 UK Biobank participants

TLDR
The UK Biobank project is a large prospective cohort study of ~500,000 individuals from across the United Kingdom, aged between 40-69 at recruitment, and a set of analyses that reveal properties of the genetic data – such as population structure and relatedness – that can be important for downstream analyses are conducted.
Abstract
The UK Biobank project is a large prospective cohort study of ~500,000 individuals from across the United Kingdom, aged between 40-69 at recruitment. A rich variety of phenotypic and health-related information is available on each participant, making the resource unprecedented in its size and scope. Here we describe the genome-wide genotype data (~805,000 markers) collected on all individuals in the cohort and its quality control procedures. Genotype data on this scale offers novel opportunities for assessing quality issues, although the wide range of ancestries of the individuals in the cohort also creates particular challenges. We also conducted a set of analyses that reveal properties of the genetic data – such as population structure and relatedness – that can be important for downstream analyses. In addition, we phased and imputed genotypes into the dataset, using computationally efficient methods combined with the Haplotype Reference Consortium (HRC) and UK10K haplotype resource. This increases the number of testable variants by over 100-fold to ~96 million variants. We also imputed classical allelic variation at 11 human leukocyte antigen (HLA) genes, and as a quality control check of this imputation, we replicate signals of known associations between HLA alleles and many common diseases. We describe tools that allow efficient genome-wide association studies (GWAS) of multiple traits and fast phenome-wide association studies (PheWAS), which work together with a new compressed file format that has been used to distribute the dataset. As a further check of the genotyped and imputed datasets, we performed a test-case genome-wide association scan on a well-studied human trait, standing height.

read more

Content maybe subject to copyright    Report

!
!"#$%"&'()"*+"#",(-*).,.*$#*~/001000*
23*4($5.#6*7.8,(-(7.#,9*
*
:;.8"*4<-8$=,
>?
1*:$;(#*@8""%.#
>?
1*A"9(9;.B.*C",6$B.
>1D?
1*!.B(#*4.#)
>
1*
E;$<)*FG*H;;($,,
I
1*3"B(#*JK.87
I
1*L;;.#*M$,<"8
N
1*A.%O.#*PQ6-"B(-
N1R
1*
S;(B("8*A";.#".Q
/1T1U
1*V.8")*SW:$##";;
X
1**L)8(.#*:$8,"9
>1Y
1*J.%.#,K.*Z";9K
>0
1*
!(;*M-P".#
>1>>1
1*J,"7K"#*E"9;("
N1R
1*C","8*A$##";;<
>1I[
1*V$#.,K.#*M.8-K(#(
I1>[\*
!
1
!Wellcome!Trust!Center!for!Human!Genetics,!University!of!Oxford,!UK!
2
!Department!of!Statistics,!University!of!Oxford,!UK!
3
!Centre!for!Systems!Genomics!and!the!Schools!of!Mathematics!and!Statistics,!and!
BioSciences,!The!University!of!Melbourne,!Parkville,!Victoria,!Australia.!
4
!Murdoch!Children’s!Research!Institute,!Parkville,!Victoria,!Australia.!
5
!Department!of!Genetic!Medicine!and!Development,!University!of!Geneva,!1!Michel!
Servet,!Geneva,!CH1211,!Switzerland.!
6
!Swiss!Institute!of!Bioinformatics,!University!of!Geneva,!1!Michel!Servet,!Geneva,!
CH1211,!Switzerland.!
7
!Institute!of!Genetics!and!Genomics!in!Geneva,!University!of!Geneva,!1!Michel!
Servet,!Geneva,!CH1211,!Switzerland.!
8!
Illumina!Ltd,!Chesterford!Research!Park,!Little!Chesterford,!Essex,!CB10!1XL,!United!
Kingdom.!
9
!Nuffield!Department!of!Clinical!Neurosciences,!Division!of!Clinical!Neurology,!John!
Radcliffe!Hospital,!University!of!Oxford,!Oxford!OX3!9DU,!United!Kingdom.!
10!
UK!Biobank,!Units!1-4!Spectrum!Way,!Adswood,!Stockport,!Cheshire,!SK3!0SA,!UK!
11
!Big!Data!Institute,!Li!Ka!Shing!Centre!for!Health!Information!and!Discovery,!
University!of!Oxford,!Oxford!OX3!7LF,!United!Kingdom.!
^!Current!address:!Procter!&!Gamble,!Brussels,!Belgium!
!!
*!These!authors!contributed!equally!to!this!work.!
†!These!authors!jointly!directed!this!work.!
‡!To!whom!correspondence!should!be!addressed:!marchini@stats.ox.ac.uk!
!
!
!
! !
not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.
The copyright holder for this preprint (which wasthis version posted July 20, 2017. ; https://doi.org/10.1101/166298doi: bioRxiv preprint

2!
!
L59,8.-,!
!
The!UK!Biobank!project!is!a!large!prospective!cohort!study!of!~500,000!individuals!
from!across!the!United!Kingdom,!aged!between!40-69!at!recruitment.!!A!rich!variety!
of!phenotypic!and!health-related!information!is!available!on!each!participant,!
making!the!resource!unprecedented!in!its!size!and!scope.!Here!we!describe!the!
genome-wide!genotype!data!(~805,000!markers)!collected!on!all!individuals!in!the!
cohort!and!its!quality!control!procedures.!Genotype!data!on!this!scale!offers!novel!
opportunities!for!assessing!quality!issues,!although!the!wide!range!of!ancestries!of!
the!individuals!in!the!cohort!also!creates!particular!challenges.!!We!also!conducted!a!
set!of!analyses!that!reveal!properties!of!the!genetic!data!–!such!as!population!
structure!and!relatedness!–!that!can!be!important!for!downstream!analyses.!In!
addition,!we!phased!and!imputed!genotypes!into!the!dataset,!using!computationally!
efficient!methods!combined!with!the!Haplotype!Reference!Consortium!(HRC)!and!
UK10K!haplotype!resource.!!This!increases!the!number!of!testable!variants!by!over!
100-fold!to!~96!million!variants.!We!also!imputed!classical!allelic!variation!at!11!
human!leukocyte!antigen!(HLA)!genes,!and!as!a!quality!control!check!of!this!
imputation,!we!replicate!signals!of!known!associations!between!HLA!alleles!and!
many!common!diseases.!!We!describe!tools!that!allow!efficient!genome-wide!
association!studies!(GWAS)!of!multiple!traits!and!fast!phenome-wide!association!
studies!(PheWAS),!which!work!together!with!a!new!compressed!file!format!that!has!
been!used!to!distribute!the!dataset.!!As!a!further!check!of!the!genotyped!and!
imputed!datasets,!we!performed!a!test-case!genome-wide!association!scan!on!a!
well-studied!human!trait,!standing!height.!
!
!
3"<'$8)9*
!
UK!Biobank,!Genotypes,!Quality!control,!Population!structure,!Relatedness,!Phasing,!
Imputation,!HLA!Imputation,!GWAS,!PheWAS!
! !
not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.
The copyright holder for this preprint (which wasthis version posted July 20, 2017. ; https://doi.org/10.1101/166298doi: bioRxiv preprint

3!
!
1" Introduction!............................................................................................................!4"
2" Results!....................................................................................................................!5"
2.1" High!quality!genotype!calling!on!novel!array!..................................................!5"
2.2" Ancestral!diversity!and!cryptic!relatedness!...................................................!18"
2.3" Phasing!and!Imputation!of!SNPs,!short!indels!and!CNVs!...............................!23"
2.4" Imputation!of!classical!HLA!alleles!.................................................................!26"
2.5" GWAS!for!standing!height!.............................................................................!28"
2.6" Multiple!trait!GWAS!and!PheWAS!.................................................................!31"
3" Data!provision!and!access!....................................................................................!31"
4" URLs!......................................................................................................................!32"
5" Author!contributions!............................................................................................!32"
6" Acknowledgements!..............................................................................................!32"
7" Conflicts!of!Interest!..............................................................................................!33"
8" References!............................................................................................................!33"
!
!
!
!
!
!
!
!
!
!
! *
not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.
The copyright holder for this preprint (which wasthis version posted July 20, 2017. ; https://doi.org/10.1101/166298doi: bioRxiv preprint

4!
!
> ]#,8$)Q-,($#*
The!UK!Biobank!project!is!a!large!prospective!cohort!study!of!~500,000!individuals!
from!across!the!United!Kingdom,!aged!between!40-69!at!recruitment![1].!!A!rich!
variety!of!phenotypic!and!health-related!information!is!available!on!each!participant,!
making!the!resource!unprecedented!in!its!size!and!scope.!The!data!contains!self-
reported!information,!including!basic!demographics,!diet,!and!exercise!habits;!
extensive!physical!and!cognitive!measurements;!with!other!sources!of!health-related!
information!such!as!medical!records!and!cancer!registers!being!integrated!and!
followed!up!over!the!course!of!the!participants’!lives![2].!The!baseline!information!
has,!and!will!be,!extended!in!a!number!of!ways![3].!For!example,!many!blood!and!
urine!biomarkers!are!being!measured;!and!medical!imaging!of!brain![4],!heart,!
bones,!carotid!arteries!and!abdominal!fat!is!being!carried!out!on!a!large!subset!
(~100,000)!of!participants![5].!
!
Understanding!the!role!that!genetics!plays!in!phenotypic!and!disease!variation,!and!
its!potential!interactions!with!other!factors,!provides!a!critical!route!to!a!better!
understanding!of!human!biology.!!It!is!anticipated!that!this!will!lead!to!more-
successful!drug!development![6]!,!and!potentially!to!more!efficient!and!personalised!
treatments!and!to!better!diagnoses.!As!such,!a!key!component!of!the!UK!Biobank!
resource!has!been!the!collection!of!genome-wide!genetic!data!on!every!participant!
using!a!purpose-designed!genotyping!array![7].!An!interim!release!of!genotype!data!
on!~150,000!UK!Biobank!participants!(May!2015)![8]!has!already!facilitated!
numerous!studies![9].!These!exploit!the!UK!Biobank’s!substantial!sample!size,!
extensive!phenotype!information,!and!genome-wide!genetic!information!to!study!
the!often!subtle!and!complex!effects!of!genetics!on!human!traits!and!disease,!and!its!
potential!interactions!with!other!factors![10-15].!!
!
In!this!paper!we!describe!the!genetic!dataset!on!the!full!~500,000!participants,!
together!with!a!range!of!quality!control!procedures,!which!have!been!undertaken!on!
the!genotype!data!in!the!hope!of!facilitating!its!wider!use.!!To!achieve!this!we!
designed!and!implemented!a!quality!control!(QC)!pipeline!that!addresses!challenges!
specific!to!the!experimental!design,!scale,!and!diversity!of!this!dataset.!!Raw!data!
from!the!genotyping!experiments!will!be!available!from!UK!Biobank.!!We!also!
conducted!a!set!of!analyses!that!reveal!properties!of!the!genetic!data!–!such!as!
population!structure!and!relatedness!–!that!can!be!important!for!downstream!
analyses.!!In!addition,!we!phased!and!imputed!genotypes!into!the!dataset,!using!
computationally!efficient!methods!combined!with!the!Haplotype!Reference!
Consortium!(HRC)![16]!and!UK10K!haplotype!resources![17].!!This!increases!the!
number!of!testable!variants!by!over!100-fold!to!~96!million!variants.!We!also!
imputed!classical!allelic!variation!at!11!human!leukocyte!antigen!(HLA)!genes,!and!as!
not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.
The copyright holder for this preprint (which wasthis version posted July 20, 2017. ; https://doi.org/10.1101/166298doi: bioRxiv preprint

5!
!
a!QC!check!of!this!imputation,!we!replicate!signals!of!known!associations!between!
HLA!alleles!and!many!common!diseases.!!We!describe!tools!that!allow!efficient!
genome-wide!association!studies!(GWAS)!of!multiple!traits!and!fast!phenome-wide!
association!studies!(PheWAS),!which!work!together!with!a!new!compressed!file!
format!that!has!been!used!to!distribute!the!dataset.!!As!a!further!check!of!the!
genotyped!and!imputed!datasets,!we!performed!a!test-case!genome-wide!
association!scan!on!a!well-studied!human!trait,!standing!height.!!
I ^"9Q;,9*
IG> _(+K*`Q.;(,<*+"#$,<7"*-.;;(#+*$#*#$B";*.88.<*
IG>G> CQ87$9"&)"9(+#")*+"#$,<7(#+*.88.<*
The!data!release!contains!genotypes!of!488,377!UK!Biobank!participants.!!These!
were!assayed!using!two!very!similar!genotyping!arrays.!!A!subset!of!49,950!
participants!involved!in!the!UK!Biobank!Lung!Exome!Variant!Evaluation!(UK!BiLEVE)!
study!were!genotyped!using!the!Applied!Biosystems™!UK!BiLEVE!Axiom™!Array!by!
Affymetrix
1
!(807,411!markers),!which!is!described!elsewhere![15].!!Following!this,!
438,427!participants!were!genotyped!using!the!closely-related!Applied!Biosystems™!
UK!Biobank!Axiom™!Array!(825,927!markers).!!Both!arrays!were!purpose-designed!
specifically!for!the!UK!Biobank!genotyping!project!and!share!95%!of!marker!content!
[7].!!The!marker!content!of!the!UK!Biobank!Axiom™!array!was!chosen!to!capture!
genome-wide!genetic!variation!(single!nucleotide!polymorphism!(SNPs)!and!short!
insertions!and!deletions!(indels)),!and!is!summarised!in!@(+Q8"*>.!!Many!markers!
were!included!because!of!known!associations!with,!or!possible!roles!in,!phenotypic!
variation,!particularly!disease.!A!notable!example!is!the!inclusion!of!two!variants,!
rs429358!and!rs7412,!which!define!the!isoforms!of!the!apolipoprotein!E!(APoE)!gene!
known!to!be!associated!with!risk!of!Alzheimer’s!disease![7]!and!other!conditions.!!
Neither!marker!is!easy!to!type!using!array!technologies;!as!a!consequence!of!this!
they!have!not!always!been!assayed!on!earlier!arrays.!!The!array!also!includes!coding!
variants!across!a!range!of!minor!allele!frequencies!(MAFs),!including!rare!markers!
(<1%!MAF);!and!markers!that!provide!good!genome-wide!coverage!for!imputation!in!
European!populations!in!the!common!(>5%)!and!low!frequency!(1-5%)!MAF!ranges.!
Further!details!of!the!array!design!are!in![7].!!
! !
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
1
Now!part!of!Thermo!Fisher!Scientific.!
not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.
The copyright holder for this preprint (which wasthis version posted July 20, 2017. ; https://doi.org/10.1101/166298doi: bioRxiv preprint

Figures
Citations
More filters
Journal ArticleDOI

The MR-Base platform supports systematic causal inference across the human phenome

TL;DR: MR-Base is a platform that integrates a curated database of complete GWAS results (no restrictions according to statistical significance) with an application programming interface, web app and R packages that automate 2SMR, and includes several sensitivity analyses for assessing the impact of horizontal pleiotropy and other violations of assumptions.
Journal ArticleDOI

Genome-wide polygenic scores for common diseases identify individuals with risk equivalent to monogenic mutations

TL;DR: Genome-wide polygenic risk scores derived from GWAS data for five common diseases can identify subgroups of the population with risk approaching or exceeding that of a monogenic mutation.
Journal ArticleDOI

Gene discovery and polygenic prediction from a genome-wide association study of educational attainment in 1.1 million individuals

James J. Lee, +94 more
- 23 Jul 2018 - 
TL;DR: A joint (multi-phenotype) analysis of educational attainment and three related cognitive phenotypes generates polygenic scores that explain 11–13% of the variance ineducational attainment and 7–10% ofthe variance in cognitive performance, which substantially increases the utility ofpolygenic scores as tools in research.
Journal ArticleDOI

Genome-wide meta-analysis of depression identifies 102 independent variants and highlights the importance of the prefrontal brain regions

TL;DR: A genetic meta-analysis of depression found 269 associated genes that highlight several potential drug repositioning opportunities, and relationships with depression were found for neuroticism and smoking.
Journal ArticleDOI

Meta-analysis of genome-wide association studies for height and body mass index in ∼700000 individuals of European ancestry

TL;DR: This study demonstrates that, as previously predicted, increasing GWAS sample sizes continues to deliver, by the discovery of new loci, increasing prediction accuracy and providing additional data to achieve deeper insight into complex trait biology.
References
More filters
Journal ArticleDOI

PLINK: A Tool Set for Whole-Genome Association and Population-Based Linkage Analyses

TL;DR: This work introduces PLINK, an open-source C/C++ WGAS tool set, and describes the five main domains of function: data management, summary statistics, population stratification, association analysis, and identity-by-descent estimation, which focuses on the estimation and use of identity- by-state and identity/descent information in the context of population-based whole-genome studies.
Journal ArticleDOI

A global reference for human genetic variation.

Adam Auton, +517 more
- 01 Oct 2015 - 
TL;DR: The 1000 Genomes Project set out to provide a comprehensive description of common human genetic variation by applying whole-genome sequencing to a diverse set of individuals from multiple populations, and has reconstructed the genomes of 2,504 individuals from 26 populations using a combination of low-coverage whole-generation sequencing, deep exome sequencing, and dense microarray genotyping.
Journal ArticleDOI

Principal components analysis corrects for stratification in genome-wide association studies

TL;DR: This work describes a method that enables explicit detection and correction of population stratification on a genome-wide scale and uses principal components analysis to explicitly model ancestry differences between cases and controls.
Journal ArticleDOI

Genome-wide association study of 14,000 cases of seven common diseases and 3,000 shared controls

Paul Burton, +195 more
- 07 Jun 2007 - 
TL;DR: This study has demonstrated that careful use of a shared control group represents a safe and effective approach to GWA analyses of multiple disease phenotypes; generated a genome-wide genotype database for future studies of common diseases in the British population; and shown that, provided individuals with non-European ancestry are excluded, the extent of population stratification in theBritish population is generally modest.
Journal ArticleDOI

Analysis of protein-coding genetic variation in 60,706 humans

Monkol Lek, +106 more
- 18 Aug 2016 - 
TL;DR: The aggregation and analysis of high-quality exome (protein-coding region) DNA sequence data for 60,706 individuals of diverse ancestries generated as part of the Exome Aggregation Consortium (ExAC) provides direct evidence for the presence of widespread mutational recurrence.
Related Papers (5)

A global reference for human genetic variation.

Adam Auton, +517 more
- 01 Oct 2015 - 
Frequently Asked Questions (1)
Q1. What are the contributions mentioned in the paper "Genome-wide genetic data on ~500,000 uk biobank participants" ?

The UK Biobank dataset this paper is a large prospective cohort study of ~500,000 individuals from across the United Kingdom, aged between 40-69 at recruitment.