What are the contributions in "Application of t-sne to human genetic data" ?

The t-distributed stochastic neighbor embedding ( t-SNE ) is a dimension reduction and visualization technique for high-dimensional data this paper.

What is the purpose of imposing extra constraints in NPCA?

The purpose of imposing extra constraints in NPCA is that positive and negative terms in classic PCA may cancel each other, leading to a loss of local feature.

Why do the authors want to use t-SNE?

The authors are particularly interested in tSNE’s claimed ability to “reveal structure at many different scales” [28], as major population stratification co-exists with other small-scaled shared evolutionary history among samples.

What is the main feature of t-SNE?

The main feature of t-SNE, the ability to exhibit structures at multiple scales, is responsible for two observations in this paper.

how many individuals were used in the phase 3 1000 genome project?

The phase 3 1000 Genomes Project data for 2504 individuals was downloaded from http://www.internationalgenome.org/data.Li et al. 15Genetic distance between samplesKING (Kinship-based INference for Genome-wide association studies), a computationally efficient program to calculate the person-person distance based on genetic data, was used: (http://people.virginia.edu/~wc9c/KING/) [52].t-SNEThe R (http://www.r-project.org/) implementation of the t-SNE, Rtsne version 0.11 (June 30, 2016) (https://cran.r-project.org/web/packages/Rtsne/ or https://github.com/jkrijthe/Rtsne) was used.

What is the purpose of the study?

ForLi et al. 6that purpose, the authors extract 99 Utah residents with North/West European ancestry (CEU), 91 England/Scotland samples (GBR), and 5 African-American in Southwest (ASW), 5 MexicanAmerican in Los Angeles (MXL), 2 Chinese in Beijing (CHB), 2 Chinese in South China (CHS).

How many SNPs are in the 1000 Genomes Project?

For that, the authors extract a denser set of SNPs from chromosome 1 in the 1000 Genomes Project with the criteria of alternative/minor allele frequency > 0.2, and spacing between neighboring SNPs > 20, 000 bases.

What is the computational complexity of t-SNE?

The time computational complexity of standard PCA/SVD is O(min(Np2 + p3, pN2 +N3) where N is the number of samples and p the number of factors [48].

Why do the America samples have their own cluster?

America samples are the only one which do not form its own cluster, due to the well known admixture between indigenous America population and European colonists.

What is the main effect of t-SNE?

Another observation that both continental and sub-continental population structures can be viewed in a single t-SNE plot is also a consequence of this property of t-SNE.

(Open Access) Application of t-SNE to human genetic data. (2017) | Wentian Li

Application of t-SNE to Human Genetic Data

Wentian Li

, Jane E Cerise

, Yaning Yang

, Henry Han

1. The Robert S. Boas Center for Genomics and Human Genetics

The Feinstein Institute for Medical Research, Northwell Health, Manhasset, NY, USA

2. Department of Statistics and Finance, University of Science and Technology of China, Hefei, CHINA

3. Department of Computer and Information Sciences, Fordham University

Lincoln Center, New York, NY, USA

Februry 3, 2017

Abstract

The t-SNE (t-distributed stochastic neighbor embedding) is a new dimension reduction and visualiza-

tion technique for high-dimensional data. t-SNE is rarely applied to human genetic data, even though it

is commonly used in other data-intensive biological ﬁelds, such as single-cell genomics. We explore the

applicability of t-SNE to human genetic data and make these observations: (i) similar to previously used

dimension reduction techniques such as principal component analysis (PCA), t-SNE is able to separate

samples from diﬀerent continents; (ii) unlike PCA, t-SNE is more robust with respect to the presence of

outliers; (iii) t-SNE is able to display both continental and sub-continental patterns in a single plot. We

conclude that the ability for t-SNE to reveal population stratiﬁcation at diﬀerent scales could be useful for

human genetic association studies.

keywords: t-SNE, PCA, SNP, dimension reduction

Background

Genome-wide association study (GWAS) [1, 2, 3, 4, 5, 6] is an approach to type common

genetic variants in a general human population, and by comparing the allele frequency dif-

ference between a group of patients (cases) and a group of normal people (controls), discover

statistical association signals. The genetic variant mostly encountered is the single nucleotide

.CC-BY-NC-ND 4.0 International licensea

certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under

The copyright holder for this preprint (which was notthis version posted March 8, 2017. ; https://doi.org/10.1101/114884doi: bioRxiv preprint

Li et al. 2

polymorphism (SNP) [7]. These associated variants may reside in a protein coding region,

indicating a possible change of transcription products which may play a role in the disease. It

may also sit between genes which probably change a binding motif of a transcription factor

which leads to a change of the transcription (expression) level [8]. A comprehensive database

on statistically signiﬁcant variants for many human diseases is the GWAS Catalog, ﬁrst main-

tained by the NHRGI/NIH (National Human Genome Research Institute of National Institute

of Health (USA)), then by the European Bioinformatics Institute (UK), which can be found

at https://www.ebi.ac.uk/gwas/ [9, 10].

A key step in GWAS analysis is to match the ethnicity background of cases and controls.

Failure to do so would confound allele frequency diﬀerence due to disease-causing mutations

and diﬀerence due to population genetic history. There have been many attempts to correct

this “spurious association” due to “population stratiﬁcation” [11, 4]: “genomic control” uses

non-functional variants to estimate the amount of population genetic history diﬀerences and the

association signal is corrected accordingly [12]; “family-based association” uses untransmitted

alleles as controls thus circumventing the population stratiﬁcation issue completely when these

type of data are available [13, 14, 15]; K-mean clustering to group sample genotype data

towards the presumed K populations [16]; incorporating co-variance among samples to correct

the association signal [17, 18, 19], etc.

However, the most common practice in dealing with population stratiﬁcation or other sub-

tle/hidden structures in genetic data is to perform dimensional reduction techniques, such as

MDS (multi dimensional scaling), PCA (principal component analysis), SVD (singular value

decomposition). The reduced dimensions can be directly visualized and, in the case of PCA,

can be used as covariates in the association analysis [20, 21, 22, 23, 24, 25, 26]. Within the Eu-

ropean populations, it is well established that the ﬁrst PC (PC1) aligns with the north-south

direction (latitude), and the second PC (PC2) aligns with the east-west direction (longitude)

[27].

Though the use of PCA is mostly satisfactory, the method is not without a problem. Most

notably, PCA is highly aﬀected by the presence of outliers. If most of the samples belong to one

homogeneous population with a minority from another diﬀerent population, the presence of

genetically diﬀerent minority samples can completely change the principal axes, thus changing

.CC-BY-NC-ND 4.0 International licensea

certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under

The copyright holder for this preprint (which was notthis version posted March 8, 2017. ; https://doi.org/10.1101/114884doi: bioRxiv preprint

Li et al. 3

the distribution of samples along the main PCs. This is because as a linear holistic dimension

reduction method, PCA cannot capture local data characteristics well. Other problems include

the determination of the number of SNPs to be included, the role common vs. rare variants

play in the result, and the number of PCs to be kept.

Here we explore the application of a new dimensional reduction technique, t-SNE (t-

distributed stochastic neighbor embedding) [28], to the genetic data. Similar to MDS, the

aim of t-SNE is to preserve the pairwise distance in high-dimensional space to 2 or 3 lower

dimensions. Unlike MDS or PCA, the preservation in t-SNE is non-linear: t-SNE minimizes

the Kullback-Leibler divergence between two distributions – one distribution that measures

pairwise similarities of input samples in high-dimensional space, another heavy-tailed Stu-

dent’s t-distribution that measures pairwise similarities of corresponding samples in the low-

dimensional embedding space. t-SNE has demonstrated its built-in advantages in capturing

local data characteristics and revealing subtle data structures in visualization, as shown in the

original publication [28].

t-SNE is a popular choice in the analysis of single-cell RNA-seq data (e.g. [29, 30, 31]),

but has not been applied extensively to genetic data. In the single previous publication where

t-SNE was applied to genetic data [32], the main conclusion in is that if t-SNE is considered

as a clustering technique, it performs better than PCA. We are particularly interested in t-

SNE’s claimed ability to “reveal structure at many diﬀerent scales” [28], as major population

stratiﬁcation co-exists with other small-scaled shared evolutionary history among samples.

Results

Continental separation

We ﬁrst examine the ability of t-SNE to separate continental populations (Africa, Asia, Europe,

etc.). GWAS usually refers to genetic association studies using common variants. The new

next-generation-sequencing (NGS) technique, though aiming at typing all variants, also pro-

duces common variants. We use one of the major public NGS data, the 1000 Genomes Project

[33, 34, 35]. We extracted 3825 common variants which pass a quality control (QC) criteria,

and are also present in the Illumina Global Screening Array chip. We use the KING program

.CC-BY-NC-ND 4.0 International licensea

certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under

The copyright holder for this preprint (which was notthis version posted March 8, 2017. ; https://doi.org/10.1101/114884doi: bioRxiv preprint

Li et al. 4

−0.02 −0.01 0.00 0.01 0.02 0.03

−0.03 −0.01 0.01 0.03

PCA2

AFR

EUR

EAS

AMR

SAS

(A) PCA 1−2

−0.03 −0.02 −0.01 0.00 0.01 0.02 0.03 0.04

−0.04 0.00 0.02 0.04 0.06

AFR

EUR

EAS

AMR

SAS

(B) PCA 2−3

−20 −10 0 10 20

t_SNE_1

t_SNE_2

AFR

EUR

EAS

AMR

SAS

−20 −10 0 10 20

t_SNE_2

AFR

EUR

EAS

AMR

SAS

(D) t−SNE 2−3

Figure 1: A comparison of three diﬀerent low-dimensional display of 1000 Genomes Projects data with 3825

SNPs. The number of samples (dots) is 2504, with 661 AFR (African green), 347 AMR (American, orange), 504

EAS (East Asia, purple), 489 SAS (South Asia, blue), and 503 EUR (European, red). (A) PCA (PC1-PC2);

(B) PCA (PC2-PC3); (C) t-SNE (1-2); (D) t-SNE (2-3).

to calculate the inter-sample genetic distances. We override the default setting of KING to

retain only 3 signiﬁcant digits so that 9 signiﬁcant digits are kept, in order to improve our

ability to distinguish subtle structures.

Fig.1 shows the results from PCA (top) and t-SNE (bottom) with the ﬁrst and the second

major dimensions (left column) and the second & third dimension. The number of samples

in each continent (though Asia is split into east and south Asia) are more or less balanced:

661 Africans (AFR), 347 “Americans” (AMR), 504 East Asians (EAS), 489 South Asians, and

503 Europeans (EUR). Note that some subgroups living in continental America is not grouped

with AMR: African-American in Southwest of US (ASW) and African-Caribbean in Barbados

.CC-BY-NC-ND 4.0 International licensea

certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under

The copyright holder for this preprint (which was notthis version posted March 8, 2017. ; https://doi.org/10.1101/114884doi: bioRxiv preprint

Li et al. 5

-10

-20

-10

t_SNE_1

-10

-20

Figure 2: 3-dimensional t-SNE which combines information from Fig.1(E) and (F). Color scheme: green for

AFR, orange for AMR, purple for EAS, blue for SAS, and red for EUR.

are in the AFR group, Utah CEPH families are in the EUR group, etc.

Although all methods are able to separate continental populations, PCA (1-2 dimensions)

shows an overlap between South Asian and American, whereas t-SNE shows AMR has more

overlap with Europeans. In the 2-3 dimension, PCA shows some link between AMR and EUR,

whereas t-SNE continue to show a strong connection between AMR and EUR. As some AMR

samples, such as those from Colombia, Puerto Rico, and to some extent, Mexico (actually the

samples are Mexican-American), are expected to contain European ancestor, the t-SNE result

is consistent with our external knowledge. As can be seen in the 3D version of t-SNE (Fig.2),

the overlap between AMR and AFR in Fig.1(F) is an artifact, as AMR and AFR are actually

separated.

Treatment of outliers

We compare t-SNE and PCA in a more realistic setting where most of the samples belong to

one ethnic group, whereas a few are either from distinct ethnic groups or a mixed race. For

.CC-BY-NC-ND 4.0 International licensea

certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under

The copyright holder for this preprint (which was notthis version posted March 8, 2017. ; https://doi.org/10.1101/114884doi: bioRxiv preprint

Application of t-SNE to human genetic data.

Figures

Citations

A Survey on Explainable Artificial Intelligence (XAI): Toward Medical XAI

The art of using t-SNE for single-cell transcriptomics.

Revealing the vectors of cellular identity with single-cell genomics

UMAP reveals cryptic population structure and phenotype heterogeneity in large genomic cohorts.

Rise of Deep Learning for Genomic, Proteomic, and Metabolomic Data Integration in Precision Medicine

References

Visualizing Data using t-SNE

Inference of population structure using multilocus genotype data

PLINK: A Tool Set for Whole-Genome Association and Population-Based Linkage Analyses

A global reference for human genetic variation.

Principal components analysis corrects for stratification in genome-wide association studies

Related Papers (5)

Visualizing Data using t-SNE

UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction

Nonlinear dimensionality reduction by locally linear embedding.

Scikit-learn: Machine Learning in Python

Dimensionality reduction for visualizing single-cell data using UMAP.

Frequently Asked Questions (11)

Q1. What are the contributions in "Application of t-sne to human genetic data" ?

Q2. What is the purpose of imposing extra constraints in NPCA?

Q3. Why do the authors want to use t-SNE?

Q4. What is the main feature of t-SNE?

Q5. how many individuals were used in the phase 3 1000 genome project?

Q6. Why is t-SNE used in genetic analysis?

Q7. What is the purpose of the study?

Q8. How many SNPs are in the 1000 Genomes Project?

Q9. What is the computational complexity of t-SNE?

Q10. Why do the America samples have their own cluster?

Q11. What is the main effect of t-SNE?