scispace - formally typeset
Open AccessPosted ContentDOI

An exhaustive analysis of single amino acid variants in helical transmembrane proteins

TLDR
This work investigated the impact of variants in a human population upon helical transmembrane proteins (TMPs) and found that common SAVs, on average, have stronger effects than rare S AVs for TMPs, and are enriched, in particular, in the membrane helices.
Abstract
Single nucleotide variants (SNVs) have been widely studied in the past due to being the main source of human genetic variation. Less is known about the effect of single amino acid variants (SAVs) due to the immense resources required for comprehensive experimental studies. In contrast, in silico methods predicting the effects of sequence variants upon molecular function and upon the organism are readily available and have contributed unexpected suggestions, e.g. that SAVs common to a human population (shared by >5% of the population) have, on average, more significant impact on the molecular function of proteins than do rare SAVs (shared by

read more

Content maybe subject to copyright    Report

An exhaustive analysis of single amino acid variants in
helical transmembrane proteins
Oscar Llorian-Salvador
1*
, Michael Bernhofer
1
, Yannick Mahlich
1,2
and Burkhard
Rost
1,3,4
1 TUM, Department of Informatics, Bioinformatics & Computational Biology - i12, Boltzmannstr.
3, 85748 Garching/Munich, Germany
2 Department of Biochemistry and Microbiology, Rutgers University, 76 Lipman Dr, New
Brunswick, NJ, 08901, USA
3 Institute of Advanced Study (TUM-IAS), Lichtenbergstr. 2a, 85748, Garching/Munich, Germany
4 Institute for Food and Plant Sciences WZW Weihenstephan, Alte Akademie 8, Freising,
Germany
* Corresponding author: osalvador@rostlab.org, http://www.rostlab.org/
Tel: +49-289-17-811 (email rost: assistant@rostlab.org)
Running Title: Analysis of SAVs in helical TMPs
Document statistics: Abstract = 196 words, Text = 5629 words, 14 references, 3 figures; 10 tables
Abstract
Single nucleotide variants (SNVs) have been widely studied in the past due to being the
main source of human genetic variation. Less is known about the effect of single amino
acid variants (SAVs) due to the immense resources required for comprehensive
experimental studies. In contrast, in silico methods predicting the effects of sequence
variants upon molecular function and upon the organism are readily available and have
contributed unexpected suggestions, e.g. that SAVs common to a human population
(shared by >5% of the population) have, on average, more significant impact on the
molecular function of proteins than do rare SAVs (shared by <1% of the population).
Here, we investigated the impact of variants in a human population upon helical
transmembrane proteins (TMPs). Three main results stood out. Firstly, common SAVs,
on average, have stronger effects than rare SAVs for TMPs, and are enriched, in
particular, in the membrane helices. Secondly, proteins with seven transmembrane
helices (7TM, including GPCRs, i.e. G protein-coupled receptors) are depleted of SAVs
in comparison to other proteins, possibly due to increased evolutionary constraints in
these important proteins. Thirdly, rare SAVs with strong effect are significantly absent
(over common SAVs) in signal peptide regions.
Key words: helical transmembrane proteins, single amino acid variants, functional features
prediction, human proteome, topology prediction.
.CC-BY-NC-ND 4.0 International licenseperpetuity. It is made available under a
preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in
The copyright holder for thisthis version posted December 19, 2019. ; https://doi.org/10.1101/2019.12.18.881318doi: bioRxiv preprint

Abbreviations used: GPCR, G protein-coupled receptors (7-pass TMPs); non-TMP, non-
transmembrane proteins (due to dedication errors, these may include some beta-barrel
membrane proteins); SAV, single amino acid variant; SNV, single nucleotide variant; TMH,
transmembrane helix; TMP, transmembrane protein.
Introduction
Single amino acid variants (SAVs) have been found to be relevant for the molecular function
of proteins [1]. The impact of different SAVs has been predicted in silico showing that, within
human, more common SAVs (observed in more than 5% of the population) tend to have a
larger impact on protein function than the rare ones (observed in less than 1% of the population
[1]. The effect of common SAVs (gain or loss of functionality) on the survival of the species
remains, at least partially, unknown.
It was critical to continue the previous research on the functional impact of common SAVs to
have a better understanding of their effect on the survival of species. A more thorough analysis
of the effect of common and rare SAVs may lead to a significant connection between these
micro-molecular variations and their phenotypic impact on an organism.
A group of proteins with a different proportion of variants, such as an enrichment in common
SAVs associated with strong effect, can show locations or specific functionalities where a
higher variability helps the whole species. It can also show locations or functionalities with less
variability, indicating a benefit only for individuals. In addition, this analysis can be extended
not only to groups of proteins but regions of proteins: is it possible to find an enrichment in
common SAVs in a certain type of structure?
Here, we analyzed the proportion of common and rare human SAVs in helical transmembrane
proteins (for referred to as TMPs; note that a very small fraction of all human transmembrane
proteins cross the membrane with beta-strands). About 24% of all human proteins have at
least one transmembrane helix (TMH), and about 13% have two ore more TMHs [2]. A
separate analysis of SAVs in globular-water soluble and TMPs is supported by the functional
differences of these types of proteins, in particular by the many roles in the cell of TMPs, e.g.
for regulation, signalling and transport across membranes [3-5]. We largely refuted our initial
hypothesis, namely that a substantial fraction of the signal why common SAVs have, on
.CC-BY-NC-ND 4.0 International licenseperpetuity. It is made available under a
preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in
The copyright holder for thisthis version posted December 19, 2019. ; https://doi.org/10.1101/2019.12.18.881318doi: bioRxiv preprint

average, more effect than rare SAVs although we observed an over-representation of common
effect SAVs in TMPs.
Materials & Methods
First, we analyse common SAVs with strong effect on protein function in a specific group of
proteins: TMPs. Multiple comparisons are used to show differences in functionality between
common and rare SAVs in all proteins and only in TMPs.
Second, we study the SAV distribution within TMPs: is it possible that certain regions, for
instance transmembrane helices (TMHs), are enriched for a specific type of SAV?
Starting dataset
We took all SAVs in the human population from a previous study [1] based upon a raw dataset
with all proteins and SAVs reported by the Exome Aggregation Consortium (ExAC) [6]
collecting 60,706 human exomes. The dataset contained 10,474,468 SAVs. For about 73% of
these, namely for 7,599,572, we could find sequences in the UniProt repository [7]. For all of
those we predicted their effect upon molecular function using the machine learning based
method SNAP2 [8]. Transmembrane helices (TMHs) were identified through the method
TMSEG [2], also used to distinguish between globular-soluble and transmembrane proteins
(TMPs).
For each SAV, the dataset also contained the following information: allele frequency, location
within the protein, mutation description, type of mutation and the expected effect of the
mutation. Allele frequency was used to divide SAVs according to their abundancy in the human
population (here proxied by 60,706 individuals) into three types: common SAV: observed in
over 5% of the population, rare SAV: observed in less than 1% of the population, and
intermediate all in between (5%≥i≥1%) which were ignored in this study. Predicted SAVs were
classified as predicted with strong effect for high SNAP2 scores above 0 (probability
gradually increasing for more positive scores), and to predicted to be neutral for low SNAP2
scores below 0 (probabilty gradually increasing for more negative scores).
.CC-BY-NC-ND 4.0 International licenseperpetuity. It is made available under a
preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in
The copyright holder for thisthis version posted December 19, 2019. ; https://doi.org/10.1101/2019.12.18.881318doi: bioRxiv preprint

Table 1. Protein count during the pre-processing step of the dataset. Last two columns show the
amino acid and variant counts for the pre-processed dataset (16,644 proteins).
Dataset refining
The starting dataset contains 70,339 proteins with an available transmembrane prediction and
at least one SAV. Since there are about 20,000 different proteins in the human exome, it is
possible that there are multiple isoforms of the same proteins and small peptides that needed
to be removed from the analysis.
To this purpose, we filtered the dataset by mapping the potential isoforms to their canonical
proteins: we mapped SAVs occurring in isoforms to the canonical protein sequences by
retrieving the respective protein sequences from UniProtKB and aligning them with Kalign [9].
Then we calculated the positions of all SAVs according to those alignments, ignoring regions
that differed in their amino acids between the isoforms and canonical sequences.
In addition, proteins shorter than 50 amino acids were also discarded since they are, most
likely, not complete TMPs. Therefore, the total number of proteins in the dataset with available
topology prediction and SAVs was reduced to 16,644 (Table 1).
Validating the dataset
To validate the current dataset with the results obtained by the prior study [1], a comparison
is made between common SAVs and rare SAVs with strong effect on protein function. In order
to achieve results as similar as possible, the same parameter values are used:
To define which SAVs are common and rare, we used the same allele frequency
thresholds: SAVs with an allele frequency of 0.05 or higher are classified as common,
SAVs with a frequency of 0.01 or lower are classified as rare. SAVs with a frequency
between 0.01 and 0.05 are classified as uncommon, and will not be used in order to
have two clearly separated groups.
To define which SAVs have a significant effect on protein function, we used the same
thresholds for the SNAP2 prediction score used in the previous work: SAVs with a
Total
Proteins
Proteins
After Refining
Proteins with
SAVs + Prediction
Amino Acid
Count
SAV Count
70,339
16,649
16,644
8,086,918
4,566,032
.CC-BY-NC-ND 4.0 International licenseperpetuity. It is made available under a
preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in
The copyright holder for thisthis version posted December 19, 2019. ; https://doi.org/10.1101/2019.12.18.881318doi: bioRxiv preprint

SNAP2 score higher than 0 are likely to have a more significant effect on protein
function than SAVs with a SNAP2 score below 0.
Filtering the TMP dataset
Once the dataset has been validated, we continued the study with the proteins that have been
predicted to be TMPs using TMSEG. The number of TMPs in the current dataset is 4,527
(27.2% of all proteins).
In a more thorough look at this TMP dataset, we compare the proportion of amino acids and
variants to find possible enrichments. In addition, we repeat this comparison for different TMP
types (single-pass, multi-pass and 7-pass TMPs), paying special attention to 7-pass TMPs
(multi-pass TMPs that go through the membrane 7 times). This last group has special
biological relevance, being the G protein-coupled receptors (GPCRs) part of this group [10-
13].
SAV analysis on the TMP dataset
Once the TMP group is examined, we investigate possible SAV enrichments within the protein
types and regions. To find these enrichments, we extract the number of common and rare
SAVs in the whole dataset and within each TMP type. Further, we compare the same groups
regarding SAVs that have been predicted to have a significant effect on protein function.
However, the thresholds used are not completely binary: a SAV with a predicted SNAP2 score
of 10 may still have a near-neutral effect on protein function; in the end, the threshold choice
was decided arbitrarily. Likewise, the allele frequency thresholds used to define which SAVs
are common and which ones are rare were chosen arbitrarily, although an intermediate group
(uncommon variants) was defined so common and rare variants are clearly separated.
On the other hand, we analyse the effect of different SNAP2 score thresholds: with the use of
several graphs, we show the proportion of SAVs that are predicted to have an effect on protein
function, for all threshold values from -100 to 100 with a step of 1.
Results and Discussion
Here we will discuss the analysis of Single amino acid variants (SAVs) in TMPs. First we
validated the results with the ones from the previous work [1]. Then, we compared the effect
.CC-BY-NC-ND 4.0 International licenseperpetuity. It is made available under a
preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in
The copyright holder for thisthis version posted December 19, 2019. ; https://doi.org/10.1101/2019.12.18.881318doi: bioRxiv preprint

Citations
More filters
Journal ArticleDOI

Mutations in transmembrane proteins: diseases, evolutionary insights, prediction and comparison with globular proteins

TL;DR: This review includes an overview of the annotations for membrane protein variants that have been collated within databases dedicated to the topic, bioinformatics approaches that leverage evolutionary information in order to shed light on previously uncharacterized membrane protein structures or interaction interfaces.
Journal ArticleDOI

Protein-protein and protein-nucleic acid binding residues important for common and rare sequence variants in human.

TL;DR: Analysis of SAVs from 60,706 people through the lens of two prediction methods suggested that residues at protein-, DNA-, and RNA-binding interfaces contributed toward predicting that common S AVs more likely affect molecular function than rare SAVS.
References
More filters
Journal ArticleDOI

A global reference for human genetic variation.

Adam Auton, +517 more
- 01 Oct 2015 - 
TL;DR: The 1000 Genomes Project set out to provide a comprehensive description of common human genetic variation by applying whole-genome sequencing to a diverse set of individuals from multiple populations, and has reconstructed the genomes of 2,504 individuals from 26 populations using a combination of low-coverage whole-generation sequencing, deep exome sequencing, and dense microarray genotyping.
Journal ArticleDOI

Analysis of protein-coding genetic variation in 60,706 humans

Monkol Lek, +106 more
- 18 Aug 2016 - 
TL;DR: The aggregation and analysis of high-quality exome (protein-coding region) DNA sequence data for 60,706 individuals of diverse ancestries generated as part of the Exome Aggregation Consortium (ExAC) provides direct evidence for the presence of widespread mutational recurrence.
Journal ArticleDOI

How many drug targets are there

TL;DR: A consensus number of current drug targets for all classes of approved therapeutic drugs is proposed, and an emerging realization of the importance of polypharmacology and also the power of a gene-family-led approach in generating novel and important therapies is highlighted.
Journal ArticleDOI

High-Resolution Crystal Structure of an Engineered Human β2-Adrenergic G Protein–Coupled Receptor

TL;DR: Although the location of carazolol in the β2-adrenergic receptor is very similar to that of retinal in rhodopsin, structural differences in the ligand-binding site and other regions highlight the challenges in using rhodopin as a template model for this large receptor family.
Journal ArticleDOI

Kalign – an accurate and fast multiple sequence alignment algorithm

TL;DR: Kalign, a method employing the Wu-Manber string-matching algorithm, is developed to improve both the accuracy and speed of multiple sequence alignment and is especially well suited for the increasingly important task of aligning large numbers of sequences.
Related Papers (5)
Frequently Asked Questions (8)
Q1. What are the contributions mentioned in the paper "An exhaustive analysis of single amino acid variants in helical transmembrane proteins" ?

Single nucleotide variants ( SNVs ) have been widely studied in the past due to being the main source of human genetic variation. Here, the authors investigated the impact of variants in a human population upon helical transmembrane proteins ( TMPs ). In contrast, in silico methods predicting the effects of sequence variants upon molecular function and upon the organism are readily available and have contributed unexpected suggestions, e. g. that SAVs common to a human population ( shared by > 5 % of the population ) have, on average, more significant impact on the molecular function of proteins than do rare SAVs ( shared by < 1 % of the population ). 

Single nucleotide variants ( SNVs ) have been widely studied in the past due to being the main source of human genetic variation. Here, the authors investigated the impact of variants in a human population upon helical transmembrane proteins ( TMPs ). In contrast, in silico methods predicting the effects of sequence variants upon molecular function and upon the organism are readily available and have contributed unexpected suggestions, e. g. that SAVs common to a human population ( shared by > 5 % of the population ) have, on average, more significant impact on the molecular function of proteins than do rare SAVs ( shared by < 1 % of the population ). 

On the other hand, non-desired changes in the protein function will result in the variant being discarded, and its proportion decreasing. 

The second conclusion that can be drawn is that TMH regions have, proportionally, more common variants with a strong effect on protein function than non-TMH regions. 

Since these regions have a specific composition that makes them suitable to go through the membrane (hydrophobic amino acids that favour a helical structure), the variety of SAVs that can affect the protein function without breaking the TMH structure is lesser. 

On the other hand, rare variants may be the result of undesired or non-functional changes for the signal peptide which have been discarded, becoming particularly less frequent. 

It has been reported that the proportion of common and rare SAVs are very uneven in the human proteome: the majority of variants are rare (99%), while only 0.5% of the variants are considered common [6, 14]. 

Regarding the SAV distribution in the TMP regions for different SNAP2 thresholds, the authors see that common variants, when grouped by the region where they belong, have proportionally a higher effect on protein function than rare variants (Fig. 1).