scispace - formally typeset
Open AccessJournal ArticleDOI

Application of t-SNE to human genetic data.

TLDR
The ability for t-SNE to reveal population stratification at different scales could be useful for human genetic association studies.
Abstract
The t-distributed stochastic neighbor embedding t-SNE is a new dimension reduction and visualization technique for high-dimensional data. t-SNE is rarely applied to human genetic data, even though it is commonly used in other data-intensive biological fields, such as single-cell genomics. We explore the applicability of t-SNE to human genetic data and make these observations: (i) similar to previously used dimension reduction techniques such as principal component analysis (PCA), t-SNE is able to separate samples from different continents; (ii) unlike PCA, t-SNE is more robust with respect to the presence of outliers; (iii) t-SNE is able to display both continental and sub-continental patterns in a single plot. We conclude that the ability for t-SNE to reveal population stratification at different scales could be useful for human genetic association studies.

read more

Content maybe subject to copyright    Report

Application of t-SNE to Human Genetic Data
Wentian Li
1
, Jane E Cerise
1
, Yaning Yang
2
, Henry Han
3
1. The Robert S. Boas Center for Genomics and Human Genetics
The Feinstein Institute for Medical Research, Northwell Health, Manhasset, NY, USA
2. Department of Statistics and Finance, University of Science and Technology of China, Hefei, CHINA
3. Department of Computer and Information Sciences, Fordham University
Lincoln Center, New York, NY, USA
Februry 3, 2017
Abstract
The t-SNE (t-distributed stochastic neighbor embedding) is a new dimension reduction and visualiza-
tion technique for high-dimensional data. t-SNE is rarely applied to human genetic data, even though it
is commonly used in other data-intensive biological fields, such as single-cell genomics. We explore the
applicability of t-SNE to human genetic data and make these observations: (i) similar to previously used
dimension reduction techniques such as principal component analysis (PCA), t-SNE is able to separate
samples from different continents; (ii) unlike PCA, t-SNE is more robust with respect to the presence of
outliers; (iii) t-SNE is able to display both continental and sub-continental patterns in a single plot. We
conclude that the ability for t-SNE to reveal population stratification at different scales could be useful for
human genetic association studies.
keywords: t-SNE, PCA, SNP, dimension reduction
Background
Genome-wide association study (GWAS) [1, 2, 3, 4, 5, 6] is an approach to type common
genetic variants in a general human population, and by comparing the allele frequency dif-
ference between a group of patients (cases) and a group of normal people (controls), discover
statistical association signals. The genetic variant mostly encountered is the single nucleotide
1
.CC-BY-NC-ND 4.0 International licensea
certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under
The copyright holder for this preprint (which was notthis version posted March 8, 2017. ; https://doi.org/10.1101/114884doi: bioRxiv preprint

Li et al. 2
polymorphism (SNP) [7]. These associated variants may reside in a protein coding region,
indicating a possible change of transcription products which may play a role in the disease. It
may also sit between genes which probably change a binding motif of a transcription factor
which leads to a change of the transcription (expression) level [8]. A comprehensive database
on statistically significant variants for many human diseases is the GWAS Catalog, first main-
tained by the NHRGI/NIH (National Human Genome Research Institute of National Institute
of Health (USA)), then by the European Bioinformatics Institute (UK), which can be found
at https://www.ebi.ac.uk/gwas/ [9, 10].
A key step in GWAS analysis is to match the ethnicity background of cases and controls.
Failure to do so would confound allele frequency difference due to disease-causing mutations
and difference due to population genetic history. There have been many attempts to correct
this “spurious association” due to “population stratification” [11, 4]: “genomic control” uses
non-functional variants to estimate the amount of population genetic history differences and the
association signal is corrected accordingly [12]; “family-based association” uses untransmitted
alleles as controls thus circumventing the population stratification issue completely when these
type of data are available [13, 14, 15]; K-mean clustering to group sample genotype data
towards the presumed K populations [16]; incorporating co-variance among samples to correct
the association signal [17, 18, 19], etc.
However, the most common practice in dealing with population stratification or other sub-
tle/hidden structures in genetic data is to perform dimensional reduction techniques, such as
MDS (multi dimensional scaling), PCA (principal component analysis), SVD (singular value
decomposition). The reduced dimensions can be directly visualized and, in the case of PCA,
can be used as covariates in the association analysis [20, 21, 22, 23, 24, 25, 26]. Within the Eu-
ropean populations, it is well established that the first PC (PC1) aligns with the north-south
direction (latitude), and the second PC (PC2) aligns with the east-west direction (longitude)
[27].
Though the use of PCA is mostly satisfactory, the method is not without a problem. Most
notably, PCA is highly affected by the presence of outliers. If most of the samples belong to one
homogeneous population with a minority from another different population, the presence of
genetically different minority samples can completely change the principal axes, thus changing
.CC-BY-NC-ND 4.0 International licensea
certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under
The copyright holder for this preprint (which was notthis version posted March 8, 2017. ; https://doi.org/10.1101/114884doi: bioRxiv preprint

Li et al. 3
the distribution of samples along the main PCs. This is because as a linear holistic dimension
reduction method, PCA cannot capture local data characteristics well. Other problems include
the determination of the number of SNPs to be included, the role common vs. rare variants
play in the result, and the number of PCs to be kept.
Here we explore the application of a new dimensional reduction technique, t-SNE (t-
distributed stochastic neighbor embedding) [28], to the genetic data. Similar to MDS, the
aim of t-SNE is to preserve the pairwise distance in high-dimensional space to 2 or 3 lower
dimensions. Unlike MDS or PCA, the preservation in t-SNE is non-linear: t-SNE minimizes
the Kullback-Leibler divergence between two distributions one distribution that measures
pairwise similarities of input samples in high-dimensional space, another heavy-tailed Stu-
dent’s t-distribution that measures pairwise similarities of corresponding samples in the low-
dimensional embedding space. t-SNE has demonstrated its built-in advantages in capturing
local data characteristics and revealing subtle data structures in visualization, as shown in the
original publication [28].
t-SNE is a popular choice in the analysis of single-cell RNA-seq data (e.g. [29, 30, 31]),
but has not been applied extensively to genetic data. In the single previous publication where
t-SNE was applied to genetic data [32], the main conclusion in is that if t-SNE is considered
as a clustering technique, it performs better than PCA. We are particularly interested in t-
SNE’s claimed ability to “reveal structure at many different scales” [28], as major population
stratification co-exists with other small-scaled shared evolutionary history among samples.
Results
Continental separation
We first examine the ability of t-SNE to separate continental populations (Africa, Asia, Europe,
etc.). GWAS usually refers to genetic association studies using common variants. The new
next-generation-sequencing (NGS) technique, though aiming at typing all variants, also pro-
duces common variants. We use one of the major public NGS data, the 1000 Genomes Project
[33, 34, 35]. We extracted 3825 common variants which pass a quality control (QC) criteria,
and are also present in the Illumina Global Screening Array chip. We use the KING program
.CC-BY-NC-ND 4.0 International licensea
certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under
The copyright holder for this preprint (which was notthis version posted March 8, 2017. ; https://doi.org/10.1101/114884doi: bioRxiv preprint

Li et al. 4
−0.02 −0.01 0.00 0.01 0.02 0.03
−0.03 −0.01 0.01 0.03
PCA2
AFR
EUR
EAS
AMR
SAS
(A) PCA 1−2
−0.03 −0.02 −0.01 0.00 0.01 0.02 0.03 0.04
−0.04 0.00 0.02 0.04 0.06
AFR
EUR
EAS
AMR
SAS
(B) PCA 2−3
−20 −10 0 10 20
−20 −10 0 10 20
t_SNE_1
t_SNE_2
AFR
EUR
EAS
AMR
SAS
(C) t−SNE 1−2
−20 −10 0 10 20
−20 −10 0 10 20
t_SNE_2
AFR
EUR
EAS
AMR
SAS
(D) t−SNE 2−3
Figure 1: A comparison of three different low-dimensional display of 1000 Genomes Projects data with 3825
SNPs. The number of samples (dots) is 2504, with 661 AFR (African green), 347 AMR (American, orange), 504
EAS (East Asia, purple), 489 SAS (South Asia, blue), and 503 EUR (European, red). (A) PCA (PC1-PC2);
(B) PCA (PC2-PC3); (C) t-SNE (1-2); (D) t-SNE (2-3).
to calculate the inter-sample genetic distances. We override the default setting of KING to
retain only 3 significant digits so that 9 significant digits are kept, in order to improve our
ability to distinguish subtle structures.
Fig.1 shows the results from PCA (top) and t-SNE (bottom) with the first and the second
major dimensions (left column) and the second & third dimension. The number of samples
in each continent (though Asia is split into east and south Asia) are more or less balanced:
661 Africans (AFR), 347 “Americans” (AMR), 504 East Asians (EAS), 489 South Asians, and
503 Europeans (EUR). Note that some subgroups living in continental America is not grouped
with AMR: African-American in Southwest of US (ASW) and African-Caribbean in Barbados
.CC-BY-NC-ND 4.0 International licensea
certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under
The copyright holder for this preprint (which was notthis version posted March 8, 2017. ; https://doi.org/10.1101/114884doi: bioRxiv preprint

Li et al. 5
0
10
20
20
-10
-20
-20
-10
0
10
20
3
t_SNE_1
2
10
0
-10
-20
Figure 2: 3-dimensional t-SNE which combines information from Fig.1(E) and (F). Color scheme: green for
AFR, orange for AMR, purple for EAS, blue for SAS, and red for EUR.
are in the AFR group, Utah CEPH families are in the EUR group, etc.
Although all methods are able to separate continental populations, PCA (1-2 dimensions)
shows an overlap between South Asian and American, whereas t-SNE shows AMR has more
overlap with Europeans. In the 2-3 dimension, PCA shows some link between AMR and EUR,
whereas t-SNE continue to show a strong connection between AMR and EUR. As some AMR
samples, such as those from Colombia, Puerto Rico, and to some extent, Mexico (actually the
samples are Mexican-American), are expected to contain European ancestor, the t-SNE result
is consistent with our external knowledge. As can be seen in the 3D version of t-SNE (Fig.2),
the overlap between AMR and AFR in Fig.1(F) is an artifact, as AMR and AFR are actually
separated.
Treatment of outliers
We compare t-SNE and PCA in a more realistic setting where most of the samples belong to
one ethnic group, whereas a few are either from distinct ethnic groups or a mixed race. For
.CC-BY-NC-ND 4.0 International licensea
certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under
The copyright holder for this preprint (which was notthis version posted March 8, 2017. ; https://doi.org/10.1101/114884doi: bioRxiv preprint

Figures
Citations
More filters
Journal ArticleDOI

A Survey on Explainable Artificial Intelligence (XAI): Toward Medical XAI

TL;DR: A review on interpretabilities suggested by different research works and categorize them is provided, hoping that insight into interpretability will be born with more considerations for medical practices and initiatives to push forward data-based, mathematically grounded, and technically grounded medical education are encouraged.
Journal ArticleDOI

The art of using t-SNE for single-cell transcriptomics.

TL;DR: A protocol is introduced to help avoid common shortcomings of t-SNE, for example, enabling preservation of the global structure of the data.

Revealing the vectors of cellular identity with single-cell genomics

TL;DR: Single-cell genomics has now made it possible to create a comprehensive atlas of human cells and has reopened definitions of a cell's identity and of the ways in which identity is regulated by the cell's molecular circuitry.
Journal ArticleDOI

UMAP reveals cryptic population structure and phenotype heterogeneity in large genomic cohorts.

TL;DR: Uniform manifold approximation and projection (UMAP), a non-linear dimension reduction tool, is applied to three well-studied genotype datasets and discover overlooked subpopulations within the American Hispanic population, fine-scale relationships between geography, genotypes, and phenotypes in the UK population, and cryptic structure in the Thousand Genomes Project data.
Journal ArticleDOI

Rise of Deep Learning for Genomic, Proteomic, and Metabolomic Data Integration in Precision Medicine

TL;DR: The challenges and opportunities for DL at a systems and biological scale for a precision medicine readership are examined and it may not be surprising that concepts encountered in DL share similarities with those observed in biological message relay systems.
References
More filters
Journal Article

Visualizing Data using t-SNE

TL;DR: A new technique called t-SNE that visualizes high-dimensional data by giving each datapoint a location in a two or three-dimensional map, a variation of Stochastic Neighbor Embedding that is much easier to optimize, and produces significantly better visualizations by reducing the tendency to crowd points together in the center of the map.
Journal ArticleDOI

Inference of population structure using multilocus genotype data

TL;DR: Pritch et al. as discussed by the authors proposed a model-based clustering method for using multilocus genotype data to infer population structure and assign individuals to populations, which can be applied to most of the commonly used genetic markers, provided that they are not closely linked.
Journal ArticleDOI

PLINK: A Tool Set for Whole-Genome Association and Population-Based Linkage Analyses

TL;DR: This work introduces PLINK, an open-source C/C++ WGAS tool set, and describes the five main domains of function: data management, summary statistics, population stratification, association analysis, and identity-by-descent estimation, which focuses on the estimation and use of identity- by-state and identity/descent information in the context of population-based whole-genome studies.
Journal ArticleDOI

A global reference for human genetic variation.

Adam Auton, +517 more
- 01 Oct 2015 - 
TL;DR: The 1000 Genomes Project set out to provide a comprehensive description of common human genetic variation by applying whole-genome sequencing to a diverse set of individuals from multiple populations, and has reconstructed the genomes of 2,504 individuals from 26 populations using a combination of low-coverage whole-generation sequencing, deep exome sequencing, and dense microarray genotyping.
Journal ArticleDOI

Principal components analysis corrects for stratification in genome-wide association studies

TL;DR: This work describes a method that enables explicit detection and correction of population stratification on a genome-wide scale and uses principal components analysis to explicitly model ancestry differences between cases and controls.
Related Papers (5)
Frequently Asked Questions (11)
Q1. What are the contributions in "Application of t-sne to human genetic data" ?

The t-distributed stochastic neighbor embedding ( t-SNE ) is a dimension reduction and visualization technique for high-dimensional data this paper. 

The purpose of imposing extra constraints in NPCA is that positive and negative terms in classic PCA may cancel each other, leading to a loss of local feature. 

The authors are particularly interested in tSNE’s claimed ability to “reveal structure at many different scales” [28], as major population stratification co-exists with other small-scaled shared evolutionary history among samples. 

The main feature of t-SNE, the ability to exhibit structures at multiple scales, is responsible for two observations in this paper. 

The phase 3 1000 Genomes Project data for 2504 individuals was downloaded from http://www.internationalgenome.org/data.Li et al. 15Genetic distance between samplesKING (Kinship-based INference for Genome-wide association studies), a computationally efficient program to calculate the person-person distance based on genetic data, was used: (http://people.virginia.edu/~wc9c/KING/) [52].t-SNEThe R (http://www.r-project.org/) implementation of the t-SNE, Rtsne version 0.11 (June 30, 2016) (https://cran.r-project.org/web/packages/Rtsne/ or https://github.com/jkrijthe/Rtsne) was used. 

t-SNE has demonstrated its built-in advantages in capturing local data characteristics and revealing subtle data structures in visualization, as shown in the original publication [28].t-SNE is a popular choice in the analysis of single-cell RNA-seq data (e.g. [29, 30, 31]), but has not been applied extensively to genetic data. 

ForLi et al. 6that purpose, the authors extract 99 Utah residents with North/West European ancestry (CEU), 91 England/Scotland samples (GBR), and 5 African-American in Southwest (ASW), 5 MexicanAmerican in Los Angeles (MXL), 2 Chinese in Beijing (CHB), 2 Chinese in South China (CHS). 

For that, the authors extract a denser set of SNPs from chromosome 1 in the 1000 Genomes Project with the criteria of alternative/minor allele frequency > 0.2, and spacing between neighboring SNPs > 20, 000 bases. 

The time computational complexity of standard PCA/SVD is O(min(Np2 + p3, pN2 +N3) where N is the number of samples and p the number of factors [48]. 

America samples are the only one which do not form its own cluster, due to the well known admixture between indigenous America population and European colonists. 

Another observation that both continental and sub-continental population structures can be viewed in a single t-SNE plot is also a consequence of this property of t-SNE.