how many ranks does a gene need to improve?

For instance, a 𝛽 value of 0.1 in a PPI network with 10,000 nodes requires a gene’s position to improve by a minimum of 1,000 ranks.

What are the likely targets of UMGs?

UMGs connected to driver subsets (i) and (ii) (olive and orange edges) and ones with no mutation score (e.g. POLR2E) are likely to be drug targets.

What is the case scenario for a node to rank higher?

For a node to rank higher, the best case scenario involves having near exclusive connections with multiple neighbors (k ≥ 1 steps) whose initial score is high.

What is the important factor that determines a node’s score after convergence?

In the propagation framework the authors use, two of the most important factors that determine a node’s score after convergence are the number of high scoring nodes within its neighborhood and the connectivity of these neighbors.

What is the significance of the Biological enrichment analysis of UMGs?

Biological enrichment analysis of UMGs, separately and in combination with known drivers, confirms already known functional importance of the UMGs and suggests new associations between cancer types and biological pathway alterations.

What was the coding factor used to determine the viability of the assay?

Signal-to-background (S/B), coefficient of variation (CV) and Z prime factor (Z’) were calculated for each screening plate using mean and standard deviation values of the positive and negative controls to monitor assay performance.

What is the common category of UMGs?

While most UMGs are designated either potential drug targets or weak drivers, others are connected to multiple types of driver genes and accordingly might be considered for both (e.g. RBBP5 with multi-colored edges in Figure 5).

What is the Rank threshold for a gene to be considered a UMG?

These values ensure that to be considered a UMG, a gene has to jump hundreds to thousands of ranks during propagation depending on the PPI network and cancer type under study.

(Open Access) A Network Propagation Approach to Prioritize Long Tail Genes in Cancer (2021) | Hussein Mohsen

Q: Who has a license to display the preprint in perpetuity?

In this paper, the authors rely on a framework that iteratively propagates information signals (i.e. mutation scores or other quantitative metrics) between each network node (i.e. gene product) and its neighbors..CC-BY-NC-ND 4.0 International licenseavailable under a was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity.

Q: Who has granted bioRxiv a license to display the preprint in perpetuity?

Manual curation of literature further validates UMGs’ connection to cancer which could be overlooked by automated literature mining alone..CC-BY-NC-ND 4.0 International licenseavailable under a was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity.

Network Propagation-based Prioritization of Long Tail

Genes in 17 Cancer Types

Hussein Mohsen

1,*

, Vignesh Gunasekharan

, Tao Qing

, Montrell Seay

, Yulia Surovtseva

Sahand Negahban

, Zoltan Szallasi

, Lajos Pusztai

2,*

, Mark B. Gerstein

1,6,7,4,*

Computational Biology & Bioinformatics Program, Yale University, New Haven, CT 06511,

USA

Breast Medical Oncology, Yale School of Medicine, New Haven, CT 06511, USA

Yale Center for Molecular Discovery, Yale University, West Haven, CT 06516, USA

Department of Statistics & Data Science, Yale University, New Haven, CT 06511, USA

Children’s Hospital Informatics Program, Harvard-MIT Division of Health Sciences and

Technology, Harvard Medical School, Boston, MA 02115, USA

Department of Molecular Biophysics and Biochemistry, Yale University, New Haven, CT

06511, USA

Department of Computer Science, Yale University, New Haven, CT 06511, USA

* Co-corresponding author

Abstract

Introduction. The diversity of genomic alterations in cancer pose challenges to fully

understanding the etiologies of the disease. Recent interest in infrequent mutations, in genes that

reside in the “long tail” of the mutational distribution, uncovered new genes with significant

implication in cancer development. The study of these genes often requires integrative approaches

.CC-BY-NC-ND 4.0 International licenseavailable under a

was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made

The copyright holder for this preprint (whichthis version posted April 1, 2021. ; https://doi.org/10.1101/2021.02.05.429983doi: bioRxiv preprint

with multiple types of biological data. Network propagation methods have demonstrated high

efficacy in uncovering genomic patterns underlying cancer using biological interaction networks.

Yet, the majority of these analyses have focused their assessment on detecting known cancer genes

or identifying altered subnetworks. In this paper, we introduce a network propagation approach

that entirely focuses on long tail genes with potential functional impact on cancer development.

Results. We identify sets of often overlooked, rarely to moderately mutated genes whose

biological interactions significantly propel their mutation-frequency-based rank upwards during

propagation in 17 cancer types. We call these sets “upward mobility genes” (UMGs, 28-83 genes

per cancer type) and hypothesize that their significant rank improvement indicates functional

importance. We report new cancer-pathway associations based on UMGs that were not previously

identified using driver genes alone, validate UMGs’ role in cancer cell survival in vitro—alone

and compared to other network methods—using extensive genome-wide RNAi and CRISPR data

repositories, and further conduct in vitro functional screenings resulting the validation of 8

previously unreported genes.

Conclusion. Our analysis extends the spectrum of cancer relevant genes and identifies novel

potential therapeutic targets.

1. Background

Rapid developments in sequencing technologies allowed comprehensive cataloguing of somatic

mutations in cancer. Early mutation-frequency-based methods identified highly recurrent

.CC-BY-NC-ND 4.0 International licenseavailable under a

was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made

The copyright holder for this preprint (whichthis version posted April 1, 2021. ; https://doi.org/10.1101/2021.02.05.429983doi: bioRxiv preprint

mutations in different cancer types, many of which were experimentally validated as functionally

important in the transformation process and are commonly referred to as cancer driver mutations.

However, the biological hypothesis that recurrent mutations in a few driver genes account fully

for malignant transformation turned out to be overly simplistic. Recent studies indicate that some

cancers do not harbor any known cancer driver mutations, and all cancers carry a large number of

rarely recurrent mutations in unique combinations in hundreds of potentially cancer relevant genes

[1-7]. These genes are part of a long tail in mutation frequency distributions and referred to as

“long tail” genes.

Many long tail mutations demonstrated functional importance in laboratory experiments, but

studying them all and assessing their combined impact is a daunting task for experimentalists. This

creates a need for new ways to estimate the functional importance and to prioritize long tail

mutations for functional studies. A central theme in finding new associations between genes and

diseases relies on the integration of multiple data types derived from gene expression analysis,

transcription factor binding, chromatin conformation, or genome sequencing and mechanistic

laboratory experiments. Protein-protein interaction (PPI) networks are comprehensive and readily

available repositories of biological data that capture interactions between gene products and can

be useful to identify novel gene-disease associations or to prioritize genes for functional studies.

In this paper, we rely on a framework that iteratively propagates information signals (i.e. mutation

scores or other quantitative metrics) between each network node (i.e. gene product) and its

neighbors.

.CC-BY-NC-ND 4.0 International licenseavailable under a

was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made

The copyright holder for this preprint (whichthis version posted April 1, 2021. ; https://doi.org/10.1101/2021.02.05.429983doi: bioRxiv preprint

Propagation methods have often leveraged information from genomic variation, biological

interactions derived from functional experiments, and pathway associations derived from the

biomedical literature. Studies consistently demonstrate the effectiveness of this type of methods in

uncovering new gene-disease and gene-drug associations using different network and score types.

Nitsch et al. [8] is one of the early examples that used differential expression-based scores to

suggest genes implicated in disease phenotypes of transgenic mice. A study by Lee et al. shortly

followed to suggest candidate genes using similar propagation algorithms in Crohn’s disease and

type 2 diabetes [9]. Other early works that use propagation account for network properties such as

degree distributions [10] and topological similarity between genes [11-13] to predict protein

function or to suggest new candidate genes.

Cancer has been the focus of numerous network propagation studies. We divide these studies into

two broad categories: (A) methods that initially introduced network propagation into the study of

cancer, often requiring several data types, and (B) recent methods that utilize genomic variation,

often focusing on patient stratification and gene module detection (for a complete list, see [14]).

Köhler et al. [15] used random walks and diffusion kernels to highlight the efficacy of propagation

in suggesting gene-disease associations in multiple disease families including cancer. The authors

made comprehensive suggestions and had to choose a relatively low threshold (0.4) for edge

quality filtering to retain a large number of edges given the limitations in PPI data availability in

2008. Shortly afterwards, Vanunu et al. [16] introduced PRINCE, a propagation approach that

leverages disease similarity information, known disease-gene associations, and PPI networks to

infer relationships between complex traits (including prostate cancer) and genes. Propagation-

.CC-BY-NC-ND 4.0 International licenseavailable under a

was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made

The copyright holder for this preprint (whichthis version posted April 1, 2021. ; https://doi.org/10.1101/2021.02.05.429983doi: bioRxiv preprint

based studies in cancer rapidly cascaded to connect gene sequence variations to gene expression

changes using multiple diffusions [17], to generate features used to train machine learning models

that predict gene-disease associations in breast cancer, glioblastoma multiforme, and other cancer

types [18, 19], or to suggest drug targets in acute myeloid leukemia by estimating gene knockout

effects in silico [20].

Hofree et al. introduced network-based stratification (NBS) [21], an approach that runs

propagation over a PPI network to smoothen somatic mutation signals in a cohort of patients before

clustering samples into subtypes using non-negative matrix factorization. Hierarchical HotNet [22]

is another approach that detects significantly altered subnetworks in PPI networks. It utilizes

propagation and scores derived from somatic mutation profiles as its first step to build a similarity

matrix between network nodes, constructs a threshold-based hierarchy of strongly connected

components, then selects the most significant hierarchy cutoff according to which mutated

subnetworks are returned. Hierarchical HotNet makes better gene selections than its counterparts

with respect to simultaneously considering known and candidate cancer genes, and it builds on

two earlier versions of HotNet (HotNet [23] and HotNet2 [24]).

These studies have addressed varying biological questions towards a better understanding of

cancer, and they have faced limitations with respect to (i) relying on multiple data types that might

not be readily available [17, 18], (ii) limited scope of biological analysis that often focused on a

single cancer type [17, 20], (iii) suggesting too many [20] or too few [19] candidate genes, or (iv)

being focused on finding connected subnetworks, which despite its demonstrated strength as an

approach to study cancer at a systems level might miss lone players or understudied genes [17, 22-

.CC-BY-NC-ND 4.0 International licenseavailable under a

was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made

The copyright holder for this preprint (whichthis version posted April 1, 2021. ; https://doi.org/10.1101/2021.02.05.429983doi: bioRxiv preprint

A Network Propagation Approach to Prioritize Long Tail Genes in Cancer

Figures

Citations

Network propagation-based prioritization of long tail genes in 17 cancer types.

A network medicine approach for identifying diagnostic and prognostic biomarkers and exploring drug repurposing in human cancer

Esearch3D: propagating gene expression in chromatin networks to illuminate active enhancers

Inferring time-aware models of cancer progression using Timed Hazard Networks

Tumour Genetic Heterogeneity in Relation to Oral Squamous Cell Carcinoma and Anti-Cancer Treatment

References

Cytoscape: A Software Environment for Integrated Models of Biomolecular Interaction Networks

STRING v11: protein-protein association networks with increased coverage, supporting functional discovery in genome-wide experimental datasets.

ANNOVAR: functional annotation of genetic variants from high-throughput sequencing data

Circos: An information aesthetic for comparative genomics

Mutational heterogeneity in cancer and the search for new cancer-associated genes

Related Papers (5)

A comparative analysis of network mutation burdens across 21 tumor types augments discovery from cancer genomes

Network-Based Coverage of Mutational Profiles Reveals Cancer Genes

Discovery of mutated subnetworks associated with clinical data in cancer.

Algorithms for detecting significantly mutated pathways in cancer

Functional genomics of cancer.

Frequently Asked Questions (13)

Q1. What contributions have the authors mentioned in the paper "Network propagation-based prioritization of long tail genes in 17 cancer types" ?

Q2. how many ranks does a gene need to improve?

Q3. What are the likely targets of UMGs?

Q4. What is the case scenario for a node to rank higher?

Q5. Who has a license to display the preprint in perpetuity?

Q6. What is the important factor that determines a node’s score after convergence?

Q7. What is the significance of the Biological enrichment analysis of UMGs?

Q8. What was the coding factor used to determine the viability of the assay?

Q9. Who has granted bioRxiv a license to display the preprint in perpetuity?

Q10. How many edges did they have to filter?

Q11. What is the common category of UMGs?

Q12. Who is the author/funder of the preprint?

Q13. What is the Rank threshold for a gene to be considered a UMG?

Trending Questions (1)