scispace - formally typeset
Open AccessJournal ArticleDOI

Exploration of haplotype research consortium imputation for genome-wide association studies in 20,032 Generation Scotland participants

TLDR
It is shown that the combination of deeper genotype imputation and extended phenotype availability make GS:SFHS an attractive resource to carry out association studies to gain insight into the genetic architecture of complex traits.
Abstract
The Generation Scotland: Scottish Family Health Study (GS:SFHS) is a family-based population cohort with DNA, biological samples, socio-demographic, psychological and clinical data from approximately 24,000 adult volunteers across Scotland. Although data collection was cross-sectional, GS:SFHS became a prospective cohort due to of the ability to link to routine Electronic Health Record (EHR) data. Over 20,000 participants were selected for genotyping using a large genome-wide array. GS:SFHS was analysed using genome-wide association studies (GWAS) to test the effects of a large spectrum of variants, imputed using the Haplotype Research Consortium (HRC) dataset, on medically relevant traits measured directly or obtained from EHRs. The HRC dataset is the largest available haplotype reference panel for imputation of variants in populations of European ancestry and allows investigation of variants with low minor allele frequencies within the entire GS:SFHS genotyped cohort. Genome-wide associations were run on 20,032 individuals using both genotyped and HRC imputed data. We present results for a range of well-studied quantitative traits obtained from clinic visits and for serum urate measures obtained from data linkage to EHRs collected by the Scottish National Health Service. Results replicated known associations and additionally reveal novel findings, mainly with rare variants, validating the use of the HRC imputation panel. For example, we identified two new associations with fasting glucose at variants near to Y_RNA and WDR4 and four new associations with heart rate at SNPs within CSMD1 and ASPH, upstream of HTR1F and between PROKR2 and GPCPD1. All were driven by rare variants (minor allele frequencies in the range of 0.08–1%). Proof of principle for use of EHRs was verification of the highly significant association of urate levels with the well-established urate transporter SLC2A9. GS:SFHS provides genetic data on over 20,000 participants alongside a range of phenotypes as well as linkage to National Health Service laboratory and clinical records. We have shown that the combination of deeper genotype imputation and extended phenotype availability make GS:SFHS an attractive resource to carry out association studies to gain insight into the genetic architecture of complex traits.

read more

Content maybe subject to copyright    Report

University of Dundee
Exploration of haplotype research consortium imputation for genome-wide association
studies in 20,032 Generation Scotland participants
Nagy, Reka; Boutin, Thibaud S.; Marten, Jonathan; Huffman, Jennifer E.; Kerr, Shona M.;
Campbell, Archie
Published in:
Genome Medicine
DOI:
10.1186/s13073-017-0414-4
Publication date:
2017
Licence:
CC BY
Document Version
Publisher's PDF, also known as Version of record
Link to publication in Discovery Research Portal
Citation for published version (APA):
Nagy, R., Boutin, T. S., Marten, J., Huffman, J. E., Kerr, S. M., Campbell, A., Evenden, L., Gibson, J., Amador,
C., Howard, D. M., Navarro, P., Morris, A., Deary, I. J., Hocking, L. J., Padmanabhan, S., Smith, B. H., Joshi, P.,
Wilson, J. F., Hastie, N. D., ... Hayward, C. (2017). Exploration of haplotype research consortium imputation for
genome-wide association studies in 20,032 Generation Scotland participants. Genome Medicine, 9, 1-14. [23].
https://doi.org/10.1186/s13073-017-0414-4
General rights
Copyright and moral rights for the publications made accessible in Discovery Research Portal are retained by the authors and/or other
copyright owners and it is a condition of accessing publications that users recognise and abide by the legal requirements associated with
these rights.
• Users may download and print one copy of any publication from Discovery Research Portal for the purpose of private study or research.
• You may not further distribute the material or use it for any profit-making activity or commercial gain.
• You may freely distribute the URL identifying the publication in the public portal.
Take down policy
If you believe that this document breaches copyright please contact us providing details, and we will remove access to the work immediately
and investigate your claim.
Download date: 26. Aug. 2022

RES E AR C H Open Access
Exploration of haplotype research
consortium imputation for genome-wide
association studies in 20,032 Generation
Scotland participants
Reka Nagy
1
, Thibaud S. Boutin
1
, Jonathan Marten
1
, Jennifer E. Huffman
1
, Shona M. Kerr
1
, Archie Campbell
2
,
Louise Evenden
3
, Jude Gibson
3
, Carmen Amador
1
, David M. Howard
4
, Pau Navarro
1
, Andrew Morris
5
, Ian J. Deary
6
,
Lynne J. Hocking
7
, Sandosh Padmanabhan
8
, Blair H. Smith
9
, Peter Joshi
10
, James F. Wilson
10
, Nicholas D. Hastie
1
,
Alan F. Wright
1
, Andrew M. McIntosh
4,6
, David J. Porteous
2,6
, Chris S. Haley
1
, Veronique Vitart
1
and Caroline Hayward
1*
Abstract
Background: The Generation Scotland: Scottish Family Health Study (GS:SFHS) is a family-based population cohort
with DN A, biological samples, socio-demographic, psychological and clinical data from approximately 24,000 adult
volunteers across Scotland. Although data collection was cross-sectional, GS:SFHS became a prospective cohort due
to of the ability to link to routine Electronic Health Record (EHR) data. Over 20,000 participants were selected for
genotyping using a large genome-wide array.
Methods: GS:SFHS was analysed using genome-wide association studies (GWAS) to test the effects of a large spectrum
of variants, imputed using the Haplotype Research Consortium (HRC) dataset, on medically relevant traits measured
directly or obtained from EHRs. The HRC dataset is the largest available haplotype reference panel for imputation of
variants in populations of European ancestry and allows investigation of variants with low minor allele frequencies
within the entire GS:SFHS genotyped cohort.
Results: Genome-wide associations were run on 20,032 individuals using both genotyped and HRC imputed data. We
present results for a range of well-studied quantitative traits obtained from clinic visits and for serum urate measures
obtained from data linkage to EHRs collected by th e Scottish National H ealth Service. Results replicated known
associations and additionally reveal novel findings, mainly with rare variants, validating the use of the HRC imputation
panel. For example, we identified two new associations with fasting glucose at variants near to Y_RNA and WDR4 and
four new associations with heart rate at SNPs within CSMD1 and ASPH, upstream of HTR1F and between PROKR2 and
GPCPD1. All were driven by rare variants (minor allele frequencies in the range of 0.081%).Proofofprinciple
for use of EHRs was verification of the highly significant association of urate levels with the well-established
urate transporter SLC2A9.
(Continued on next page)
* Correspondence: caroline.hayward@igmm.ed.ac.uk
1
MRC Human Genetics Unit, University of Edinburgh, Institute of Genetics
and Molecular Medicine, Western General Hospital, Crewe Road, Edinburgh
EH4 2XU, UK
Full list of author information is available at the end of the article
© The Author(s). 2017 Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0
International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and
reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to
the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver
(http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
Nagy et al. Genome Medicine (2017) 9:23
DOI 10.1186/s13073-017-0414-4

(Continued from previous page)
Conclusions: GS:SFHS provides genetic data on over 20,000 participants alongside a range of phenotypes as
well as linkage to National Health Service laboratory and clinical records. We have shown that the combination of
deeper genotype imputation and extended phenotype availability make GS:SFHS an attractive resource to carry out
association studies to gain insight into the genetic architecture of complex traits.
Keywords: Genome-wide association studies (GWAS), Electronic health records, Imputation, Quantitative trait, Genetics,
Urate, Heart rate, Glucose, Haplotype Research Consortium (HRC)
Background
Generation Scotland is a multi-institution collaboration
that has created an eth ically sound, family-based and
population-based resource for identifying the genetic
basis of common complex diseases [13]. The Scottish
Family Health Study component (GS:SFHS) has DNA and
sociodemographic, psychological and clinical data from
~24,000 adult volunteers from across Scotland. The ethni-
city of the cohort is 99% Caucasian, with 96% born in the
UK and 87% in Scotland. Features of GS:SFHS include the
family-based recruitment, breadth and depth of phenotype
information, broad consent from participants to use their
data and samples for a wide range of medical research and
for re-contact, and consent and mechanisms for linkage of
all data to comprehensive routine healthcare records.
These features were designed to maximise the power of
the resource to identify, replicate or control for genetic
factors associated with a wide spectrum of illnesses and
risk factors [3].
GS:SFHS can also be utilised as a longitudinal cohort
due to the ability to link to routine Scottish National
Health Service (NHS) data. Electronic Health Record
(EHR) linkage uses the ten-digit community health index
(CHI) number, a unique identifying number allocated to
every person in Scotland registered with a General Practi-
tioner (GP), and used for all NHS procedures (registrations,
attendances, samples, prescribing and investigations). This
unique patient identifier allows healthcare records for indi-
viduals to be linked across time and location [4]. The
population is relatively stable with comparatively low levels
of geographic mobility and there is relatively little uptake
of private healthcare in the population. Few countries,
other than Scotland, have health service information which
combines high quality data, consistency, national coverage
and the ability to link data to allow for genetic and clinical
patient-based analysis and follow-up.
The Haplotype Reference Consortium (HRC) data set
is a large haplotype reference panel for imputation of
genetic variants in populations of European ancestry,
recently made available to the research community [5].
Within a simulated genome-wide association study
(GWAS) dataset, it allowed an increased rate of accurate
imputation at minor allele frequencies as low as 0.1%,
which will allow better interrogation of genetic variation
across the allele spe ctrum. A selected subset of 428
GS:SFHS participants had their exomes sequenced at
high depth and contributed reference haplotypes to the
HRC dataset, making it ideal for more accurate imput-
ation of this cohort [6].
This paper describes genome-wide association analysis of
over 20,000 GS:SFHS participants using two genetic data-
sets (common, genotyped Single Nucleotide Polymorphisms
(SNPs) and HRC-imputed data) across a range of medically
relevant quantitative phenotypes measured at recruitment
in research clinics. To illustrate the quality and potential of
the many EHR linkage-derived phenotypes available, we
selected serum urate as an exemplar due to its direct associ-
ation with disease, gout, and its strong well-studied genetic
associations. About 10% of people with hyperuricemia
develop gout, an inflammatory arthritis that results from
deposition of monosodium urate crystals in the joint.
Genome-wide meta-analyses have identified 31 genome-
wide significant urate-associated SNPs, with SLC2A9 alone
explaining ~3% of the phenotypic variance [7].
Methods
Sample selection
Selection criteria for genome-wide genotype analysis of
the participants were: Caucasian ethnicity; born in the
UK (prioritising those born in Scotland); and full phenotype
data available from attendance at a Generation Scotland
research clinic. The participants were also selected to have
consented for their data to be linkable to t heir NHS
electronic medical records using the CHI number. The
GS:SFHS genotyped set consisted of 20,195 participants,
before quality control exclusions.
DNA extraction and genotyping
Blood (or occasionally saliva) samples from GS:SFHS
participants were collected, processed and stored using
standard operating procedures and managed through a
laboratory information m anagement system at the
Edinburgh Clinical Resea rch Facility, Univers ity of
Edinburgh [8]. DNA was quantitated using picogreen
and diluted to 50 ng/μL; 4 μLwerethenusedingeno-
typing. The genotyping of the first 9863 samples used
the Illumina HumanOmniExpressExome-8 v1.0 BeadChip
and the remainder were genotyped using the Illumina
Nagy et al. Genome Medicine (2017) 9:23 Page 2 of 14

HumanOmniExpressExome-8 v1.2 BeadChip, with Infi-
nium chemistry for both [9].
Phenotype measures
Measurement of total cholesterol, HDL cholesterol, urea
and creatinine was from serum prepared from 5 mL of
venous blood collected into a tube containing clot acti-
vator and gel separator at the time of the visit by the
participant to the research clinic. For glucose measure-
ment, 2 mL of venou s blood was collected in a sodium
fluoride/potassium oxalate tube, with fasting duration
recorded. Resting heart rate (pulse) was recorded using
an Omron digital blood pressure monitor. Two readings
were taken and the second reading was used in the ana-
lyses. All other cardiometabolic and anthropo metric pheno-
type measures (see Table 1) are described in [3].
The EHR biochemistry dataset was extracted on 28
th
September 2015 and covers 11,125 participants. EHR data
are held in the Tayside Safe Haven, which is fully accredited
and utilises a VMware Horizon client environment. Data
are placed on a server within a secure IT environment,
where the data user is given secure remote access for its
analysis [4]. For serum urate, records were available from
October 1988 to August 2015. Any data entries in the EHR
relating to pregnancy (keywords one or more of preg na/
labour/GEST/PET, total of 117 entries in the urate data-
set), were manually removed, as data obtained during preg-
nancyareusuallynotincludedinaGWAS.Manyofthe
participant IDs have multiple readings, spread over time.
For extraction of serum urate data for analysis, the highest
reading was used, as a high reading would trigger a treat-
ment (such as allopurinol) to lower the urate level, which is
then checked by the clinician requesting a subsequent test.
Genotype data quality control
Genotyping quality control was performed using the fol-
lowing procedures: individuals with a call rate less than
98% were removed, as were SNPs with a call rate less than
98% or Hardy-Weinberg equilibrium p value less than 1 ×
10
6
. Mendelian errors, determined using relationships
recorded in the pedigree, were removed by setting the
individual-level genotypes at erroneous SNPs to missing.
Ancestry outliers who were more than six standard devia-
tions away from the mean, in a principal component ana-
lysis of GS:SFHS [10] merged with 1092 individuals from
the 1000 Genomes Project [11], were excluded. A total of
20,032 individuals (8227 male participants and 11,805
female participants) pa ssed all quality control thresh-
olds. The number of genotyped autosomal SNPs that
passed all quality control parameters was 604,858.
Pedigree correction
Sample identity was verified by comparing the genetic
and recorded gender in the first instance and pedigrees
were checked for unknown or incorrectly recorded rela-
tionships based on estimated genome-wide identity-by-
descent (IBD).
Unrecorded first-degree or second degree relationships
(calculated IBD 25%) were identified and entered into
the pedigree. Pedigree links to first-degree or second-
degree relatives were broken or adjusted if the difference
between the calculated and expected amount of IBD
was 25%. After these corrections, any remaining pedi-
gree outliers as determined by examination of the plots
of expected versus observed IBD sharing were identified
and corrected in the pedigree. Due to some missing par-
ental genotypes, autosomal SNP sharin g was not always
enough to unambiguously determine whether individuals
were related through the maternal or paternal line. In
such cases, mitochondrial and/or Y-chromosome markers
were compared to help determine the correct lineage.
The full pedigree contains 42,662 individuals (22,383
female participants) in 6863 families, across five genera-
tions (average 2.34 generation s per family). Family sizes
were in the range of 166 individuals, with an average of
6.22 individuals per family. The final geno typed dataset
contains 9853 parentchild pairs, 8495 full siblings (52
monozygotic twins), 381 half siblings, 848 grandparent
grandchild pairs, 2443 first cousins and 6599 avuncular
(niece/nephewaunt/uncle) relationships.
Imputation
In order to increase the density of variants throughout
the genom e, the genotyped data were imputed utilising
the Sanger Imputation Service [12] using the HRC panel
v1.1 [5, 13]. This exome sequence data will have greatly
improved imputation quality across the whole cohort.
Autosomal haplotypes were checked to ensure consistency
with the reference panel (strand orientation, reference
allele, position) then pre-phased using Shapeit2 v2r837
[14, 15] using the Shapeit2 duohmm option11 [16], taking
advantage of the cohort family structure in order to
improve the imputation quality [17]. Monogenic and low
imputation quality (INFO < 0.4) variants were removed
from the imputed dataset leaving 24,111,857 variants
available for downstream analysis.
Phenotype quality control and exclusions
Prior to analysis, extreme outliers (those with values
more than three times the interquartile distances away
from either the 75th or the 25th percentile values) were
removed for each phenotypic measure to account for er-
rors in quantifi cation and to remove individuals not rep-
resentative of normal variation within the population.
Approximately 4000 glucose measures were from people
who had not fasted for at least 4 h, so these were ex-
cluded from the fasting glucose analysis. Additionally,
948 individuals were identified as having diabetes, as
Nagy et al. Genome Medicine (2017) 9:23 Page 3 of 14

Table 1 Top GWAS hits
Baseline characteristic N dbSNP ID Minor allele
frequency
p value Gene Imputation
quality
Gene association
reported previously?
Region significant in
genotyped data?
Cardiometabolic
Diastolic blood pressure 19,546 rs142892876 0.0010 4.97E-08 CNTN6 0.75 No No
rs528908640 0.0005 1.93E-08 OPA1 0.80 No No
rs568998724 0.0007 2.91E-08 - 0.78 No No
rs187680191 0.0006 2.94E-09 NRG4 0.51 No No
Systolic blood pressure 19,547 None None
Pulse pressure 19,546 None None
Heart rate 19,920 rs9970334 0.4474 4.38E-08 ICMT 0.90 Yes No
rs755291044 0.0017 1.80E-08 - 0.90 No No
rs145669495 0.0022 2.01E-08 CSMD1 0.90 No No
rs142916219 0.0037 2.21E-08 ASPH 0.85 No No
rs365990 0.3637 4.04E-10 MYH6 0.99 Yes GWS
rs148397504 0.0008 3.21E-09 - 0.45 No No
Biochemistry
Serum creatinine 16,347 rs548873184 0.0010 1.47E-08 LINC00626 0.96 No No
rs573421908 0.0027 1.35E-08 SLC35F3 0.80 Yes No
rs62412107 0.0660 1.87E-08 - 0.79 No No
rs3812036 0.2301 1.13E-10 SLC34A1 0.96 Yes GWS
Fasting plasma glucose
(with diabetics)
16,174 rs560887 0.2907 6.02E-68 G6PC2 1.00 Yes GWS
rs9873618 0.2871 9.83E-12 SLC2A2 0.99 Yes GWS
rs917793 0.1831 2.51E-24 YKT6 0.98 Yes GWS
rs13266634 0.3153 3.66E-11 SLC30A8 1.00 Yes GWS
rs533883198 0.0027 3.86E-08 - 0.84 No No
rs7981781 0.2337 1.40E-08 PDX1 0.98 Yes GWS
rs370189685 0.0014 7.32E-09 WDR4 0.63 No No
Fasting plasma glucose
(diabetics removed)
15,226 rs79687284 0.0364 1.87E-08 - 0.78 Yes GWS
rs780095 0.4267 8.20E-09 GCKR 1.00 Yes GWS
rs560887 0.2907 2.09E-75 G6PC2 1.00 Yes GWS
rs8192675 0.2839 8.41E-11 SLC2A2 1.00 Yes GWS
rs917793 0.1831 1.46E-28 YKT6 0.98 Yes GWS
rs11558471 0.3227 4.63E-13 SLC30A8 1.00 Yes GWS
rs143399767 0.0108 1.42E-08 Y_RNA 0.89
No No
rs7981781 0.2337 5.01E-10 PDX1 0.98 Yes GWS
rs370189685 0.0014 2.75E-08 WDR4 0.63 No Suggestive
HDL cholesterol 19,223 rs149963466 0.0016 3.18E-08 - 0.76 No No
rs76183280 0.0048 4.14E-08 AC016735.2 0.78 No No
rs4841132 0.0925 1.08E-08 RP11-115 J16.1 1.00 Yes Suggestive
rs15285 0.2675 1.16E-18 LPL 1.00 Yes GWS
rs2740488 0.2745 2.53E-08 ABCA1 1.00 Yes GWS
rs138326449 0.0032 2.92E-20 APOC3 0.85 Yes No
rs114529226 0.0038 6.98E-09 IGHVII-33-1 0.64 No No
rs261290 0.3442 2.78E-25 ALDH1A2 1.00 Yes GWS
rs3764261 0.3261 1.40E-113 CETP 1.00 Yes GWS
Nagy et al. Genome Medicine (2017) 9:23 Page 4 of 14

Citations
More filters
Journal ArticleDOI

Genome-wide meta-analysis of depression identifies 102 independent variants and highlights the importance of the prefrontal brain regions

TL;DR: A genetic meta-analysis of depression found 269 associated genes that highlight several potential drug repositioning opportunities, and relationships with depression were found for neuroticism and smoking.
Journal ArticleDOI

Genetic mechanisms of critical illness in Covid-19.

Erola Pairo-Castineira, +1449 more
- 04 Mar 2021 - 
TL;DR: The GenOMICC (Genetics Of Mortality In Critical Care) genome-wide association study in 2244 critically ill Covid-19 patients from 208 UK intensive care units is reported, finding evidence in support of a causal link from low expression of IFNAR2, and high expression of TYK2, to life-threatening disease.
Journal ArticleDOI

Brain age prediction using deep learning uncovers associated sequence variants

TL;DR: A new deep learning approach to predict brain age from a T1-weighted MRI is presented and a GWAS of the difference between predicted and chronological age is carried out, revealing two associated variants.
Journal ArticleDOI

An update on the genetics of hyperuricaemia and gout.

TL;DR: Genome-wide association studies confirm the importance of modulating urate levels in gout pathophysiology and discuss how these discoveries could be applied to the treatment of hyperuricaemia and gout.
References
More filters
Journal ArticleDOI

A global reference for human genetic variation.

Adam Auton, +517 more
- 01 Oct 2015 - 
TL;DR: The 1000 Genomes Project set out to provide a comprehensive description of common human genetic variation by applying whole-genome sequencing to a diverse set of individuals from multiple populations, and has reconstructed the genomes of 2,504 individuals from 26 populations using a combination of low-coverage whole-generation sequencing, deep exome sequencing, and dense microarray genotyping.
Journal ArticleDOI

Multipoint Quantitative-Trait Linkage Analysis in General Pedigrees

TL;DR: It is shown how variance-component linkage methods can be used in pedigrees of arbitrary size and complexity, and a general framework for multipoint identity-by-descent (IBD) probability calculations is developed.
Journal ArticleDOI

The NHGRI GWAS Catalog, a curated resource of SNP-trait associations

TL;DR: A number of recent improvements to theNHGRI Catalog of Published Genome-Wide Association Studies are presented, including novel ways for users to interact with the Catalog and changes to the curation infrastructure.
Journal ArticleDOI

A reference panel of 64,976 haplotypes for genotype imputation

Shane A. McCarthy, +117 more
- 22 Aug 2016 - 
TL;DR: A reference panel of 64,976 human haplotypes at 39,235,157 SNPs constructed using whole-genome sequence data from 20 studies of predominantly European ancestry leads to accurate genotype imputation at minor allele frequencies as low as 0.1% and a large increase in the number of SNPs tested in association studies.
Journal ArticleDOI

GenABEL: an R library for genome-wide association analysis

TL;DR: An R library for genome-wide association (GWA) analysis that implements effective storage and handling of GWA data, fast procedures for genetic data quality control, testing of association of single nucleotide polymorphisms with binary or quantitative traits, visualization of results and also provides easy interfaces to standard statistical and graphical procedures.
Related Papers (5)