scispace - formally typeset
Open AccessPosted ContentDOI

A Network Propagation Approach to Prioritize Long Tail Genes in Cancer

Reads0
Chats0
TLDR
In this paper, the authors introduce a network propagation approach that focuses on long tail genes with potential functional impact on cancer development, and identify sets of often overlooked, rarely to moderately mutated genes whose biological interactions significantly propel their mutation frequency-based rank upwards during propagation in 17 cancer types.
Abstract
Introduction: The diversity of genomic alterations in cancer pose challenges to fully understanding the etiologies of the disease. Recent interest in infrequent mutations, in genes that reside in the "long tail" of the mutational distribution, uncovered new genes with significant implication in cancer development. The study of these genes often requires integrative approaches with multiple types of biological data. Network propagation methods have demonstrated high efficacy in uncovering genomic patterns underlying cancer using biological interaction networks. Yet, the majority of these analyses have focused their assessment on detecting known cancer genes or identifying altered subnetworks. In this paper, we introduce a network propagation approach that focuses on long tail genes with potential functional impact on cancer development. Results: We identify sets of often overlooked, rarely to moderately mutated genes whose biological interactions significantly propel their mutation frequency-based rank upwards during propagation in 17 cancer types. We call these sets "upward mobility genes" (UMGs, 42-81 genes per cancer type) and hypothesize that their significant rank improvement indicates functional importance. We validate UMGs9 role in cancer cell survival in vitro using genome-wide RNAi and CRISPR databases and report new cancer-pathway associations based on UMGs that were not previously identified using driver genes alone. Conclusion: Our analysis extends the spectrum of cancer relevant genes and identifies novel potential therapeutic targets.

read more

Content maybe subject to copyright    Report

1
Network Propagation-based Prioritization of Long Tail
Genes in 17 Cancer Types
Hussein Mohsen
1,*
, Vignesh Gunasekharan
2
, Tao Qing
2
, Montrell Seay
3
, Yulia Surovtseva
3
,
Sahand Negahban
4
, Zoltan Szallasi
5
, Lajos Pusztai
2,*
, Mark B. Gerstein
1,6,7,4,*
1
Computational Biology & Bioinformatics Program, Yale University, New Haven, CT 06511,
USA
2
Breast Medical Oncology, Yale School of Medicine, New Haven, CT 06511, USA
3
Yale Center for Molecular Discovery, Yale University, West Haven, CT 06516, USA
4
Department of Statistics & Data Science, Yale University, New Haven, CT 06511, USA
5
Children’s Hospital Informatics Program, Harvard-MIT Division of Health Sciences and
Technology, Harvard Medical School, Boston, MA 02115, USA
6
Department of Molecular Biophysics and Biochemistry, Yale University, New Haven, CT
06511, USA
7
Department of Computer Science, Yale University, New Haven, CT 06511, USA
* Co-corresponding author
Abstract
Introduction. The diversity of genomic alterations in cancer pose challenges to fully
understanding the etiologies of the disease. Recent interest in infrequent mutations, in genes that
reside in the “long tail” of the mutational distribution, uncovered new genes with significant
implication in cancer development. The study of these genes often requires integrative approaches
.CC-BY-NC-ND 4.0 International licenseavailable under a
was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprint (whichthis version posted April 1, 2021. ; https://doi.org/10.1101/2021.02.05.429983doi: bioRxiv preprint

2
with multiple types of biological data. Network propagation methods have demonstrated high
efficacy in uncovering genomic patterns underlying cancer using biological interaction networks.
Yet, the majority of these analyses have focused their assessment on detecting known cancer genes
or identifying altered subnetworks. In this paper, we introduce a network propagation approach
that entirely focuses on long tail genes with potential functional impact on cancer development.
Results. We identify sets of often overlooked, rarely to moderately mutated genes whose
biological interactions significantly propel their mutation-frequency-based rank upwards during
propagation in 17 cancer types. We call these sets “upward mobility genes” (UMGs, 28-83 genes
per cancer type) and hypothesize that their significant rank improvement indicates functional
importance. We report new cancer-pathway associations based on UMGs that were not previously
identified using driver genes alone, validate UMGs’ role in cancer cell survival in vitroalone
and compared to other network methodsusing extensive genome-wide RNAi and CRISPR data
repositories, and further conduct in vitro functional screenings resulting the validation of 8
previously unreported genes.
Conclusion. Our analysis extends the spectrum of cancer relevant genes and identifies novel
potential therapeutic targets.
1. Background
Rapid developments in sequencing technologies allowed comprehensive cataloguing of somatic
mutations in cancer. Early mutation-frequency-based methods identified highly recurrent
.CC-BY-NC-ND 4.0 International licenseavailable under a
was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprint (whichthis version posted April 1, 2021. ; https://doi.org/10.1101/2021.02.05.429983doi: bioRxiv preprint

3
mutations in different cancer types, many of which were experimentally validated as functionally
important in the transformation process and are commonly referred to as cancer driver mutations.
However, the biological hypothesis that recurrent mutations in a few driver genes account fully
for malignant transformation turned out to be overly simplistic. Recent studies indicate that some
cancers do not harbor any known cancer driver mutations, and all cancers carry a large number of
rarely recurrent mutations in unique combinations in hundreds of potentially cancer relevant genes
[1-7]. These genes are part of a long tail in mutation frequency distributions and referred to as
long tail genes.
Many long tail mutations demonstrated functional importance in laboratory experiments, but
studying them all and assessing their combined impact is a daunting task for experimentalists. This
creates a need for new ways to estimate the functional importance and to prioritize long tail
mutations for functional studies. A central theme in finding new associations between genes and
diseases relies on the integration of multiple data types derived from gene expression analysis,
transcription factor binding, chromatin conformation, or genome sequencing and mechanistic
laboratory experiments. Protein-protein interaction (PPI) networks are comprehensive and readily
available repositories of biological data that capture interactions between gene products and can
be useful to identify novel gene-disease associations or to prioritize genes for functional studies.
In this paper, we rely on a framework that iteratively propagates information signals (i.e. mutation
scores or other quantitative metrics) between each network node (i.e. gene product) and its
neighbors.
.CC-BY-NC-ND 4.0 International licenseavailable under a
was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprint (whichthis version posted April 1, 2021. ; https://doi.org/10.1101/2021.02.05.429983doi: bioRxiv preprint

4
Propagation methods have often leveraged information from genomic variation, biological
interactions derived from functional experiments, and pathway associations derived from the
biomedical literature. Studies consistently demonstrate the effectiveness of this type of methods in
uncovering new gene-disease and gene-drug associations using different network and score types.
Nitsch et al. [8] is one of the early examples that used differential expression-based scores to
suggest genes implicated in disease phenotypes of transgenic mice. A study by Lee et al. shortly
followed to suggest candidate genes using similar propagation algorithms in Crohn’s disease and
type 2 diabetes [9]. Other early works that use propagation account for network properties such as
degree distributions [10] and topological similarity between genes [11-13] to predict protein
function or to suggest new candidate genes.
Cancer has been the focus of numerous network propagation studies. We divide these studies into
two broad categories: (A) methods that initially introduced network propagation into the study of
cancer, often requiring several data types, and (B) recent methods that utilize genomic variation,
often focusing on patient stratification and gene module detection (for a complete list, see [14]).
Köhler et al. [15] used random walks and diffusion kernels to highlight the efficacy of propagation
in suggesting gene-disease associations in multiple disease families including cancer. The authors
made comprehensive suggestions and had to choose a relatively low threshold (0.4) for edge
quality filtering to retain a large number of edges given the limitations in PPI data availability in
2008. Shortly afterwards, Vanunu et al. [16] introduced PRINCE, a propagation approach that
leverages disease similarity information, known disease-gene associations, and PPI networks to
infer relationships between complex traits (including prostate cancer) and genes. Propagation-
.CC-BY-NC-ND 4.0 International licenseavailable under a
was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprint (whichthis version posted April 1, 2021. ; https://doi.org/10.1101/2021.02.05.429983doi: bioRxiv preprint

5
based studies in cancer rapidly cascaded to connect gene sequence variations to gene expression
changes using multiple diffusions [17], to generate features used to train machine learning models
that predict gene-disease associations in breast cancer, glioblastoma multiforme, and other cancer
types [18, 19], or to suggest drug targets in acute myeloid leukemia by estimating gene knockout
effects in silico [20].
Hofree et al. introduced network-based stratification (NBS) [21], an approach that runs
propagation over a PPI network to smoothen somatic mutation signals in a cohort of patients before
clustering samples into subtypes using non-negative matrix factorization. Hierarchical HotNet [22]
is another approach that detects significantly altered subnetworks in PPI networks. It utilizes
propagation and scores derived from somatic mutation profiles as its first step to build a similarity
matrix between network nodes, constructs a threshold-based hierarchy of strongly connected
components, then selects the most significant hierarchy cutoff according to which mutated
subnetworks are returned. Hierarchical HotNet makes better gene selections than its counterparts
with respect to simultaneously considering known and candidate cancer genes, and it builds on
two earlier versions of HotNet (HotNet [23] and HotNet2 [24]).
These studies have addressed varying biological questions towards a better understanding of
cancer, and they have faced limitations with respect to (i) relying on multiple data types that might
not be readily available [17, 18], (ii) limited scope of biological analysis that often focused on a
single cancer type [17, 20], (iii) suggesting too many [20] or too few [19] candidate genes, or (iv)
being focused on finding connected subnetworks, which despite its demonstrated strength as an
approach to study cancer at a systems level might miss lone players or understudied genes [17, 22-
.CC-BY-NC-ND 4.0 International licenseavailable under a
was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprint (whichthis version posted April 1, 2021. ; https://doi.org/10.1101/2021.02.05.429983doi: bioRxiv preprint

Citations
More filters
Journal ArticleDOI

Network propagation-based prioritization of long tail genes in 17 cancer types.

TL;DR: In this article, a network propagation approach that entirely focuses on prioritizing long tail genes with potential functional impact on cancer development is introduced. But, the authors do not identify new cancer-pathway associations based on upward mobility genes.
Journal ArticleDOI

A network medicine approach for identifying diagnostic and prognostic biomarkers and exploring drug repurposing in human cancer

TL;DR: Wang et al. as mentioned in this paper used a network-based method to prioritize genes in cancer-specific networks reconstructed using human transcriptome and interactome data, suggesting their vital contribution to tumorigenesis and tumor progression, and are therefore regarded as cancer genes.
Journal ArticleDOI

Esearch3D: propagating gene expression in chromatin networks to illuminate active enhancers

TL;DR: Esearch3D successfully leverages the relationship between chromatin architecture and global transcription and represents a novel approach to predict active enhancers and understand the complex underpinnings of regulatory networks.
Posted ContentDOI

Inferring time-aware models of cancer progression using Timed Hazard Networks

Jian Chen
- 24 Oct 2022 - 
TL;DR: This paper proposes a novel statistical framework Timed Hazard Networks (TimedHN), that treat progression times as hidden variables and jointly infers oncogenetic graph and pseudo-time order of samples and demonstrates that the method outperforms the state-of-the-art in graph reconstruction.
Journal ArticleDOI

Tumour Genetic Heterogeneity in Relation to Oral Squamous Cell Carcinoma and Anti-Cancer Treatment

TL;DR: In this article , the authors discuss some of the events in cancer evolution and the functional significance of driver-mutations in carcinoma-related genes in general and elaborate on mechanisms mediating resistance to anti-cancer treatment.
References
More filters
Journal ArticleDOI

Cytoscape: A Software Environment for Integrated Models of Biomolecular Interaction Networks

TL;DR: Several case studies of Cytoscape plug-ins are surveyed, including a search for interaction pathways correlating with changes in gene expression, a study of protein complexes involved in cellular recovery to DNA damage, inference of a combined physical/functional interaction network for Halobacterium, and an interface to detailed stochastic/kinetic gene regulatory models.
Journal ArticleDOI

STRING v11: protein-protein association networks with increased coverage, supporting functional discovery in genome-wide experimental datasets.

TL;DR: The latest version of STRING more than doubles the number of organisms it covers, and offers an option to upload entire, genome-wide datasets as input, allowing users to visualize subsets as interaction networks and to perform gene-set enrichment analysis on the entire input.
Journal ArticleDOI

ANNOVAR: functional annotation of genetic variants from high-throughput sequencing data

TL;DR: The ANNOVAR tool to annotate single nucleotide variants and insertions/deletions, such as examining their functional consequence on genes, inferring cytogenetic bands, reporting functional importance scores, finding variants in conserved regions, or identifying variants reported in the 1000 Genomes Project and dbSNP is developed.
Journal ArticleDOI

Circos: An information aesthetic for comparative genomics

TL;DR: Circos uses a circular ideogram layout to facilitate the display of relationships between pairs of positions by the use of ribbons, which encode the position, size, and orientation of related genomic elements.
Journal ArticleDOI

Mutational heterogeneity in cancer and the search for new cancer-associated genes

Michael S. Lawrence, +96 more
- 11 Jul 2013 - 
TL;DR: A fundamental problem with cancer genome studies is described: as the sample size increases, the list of putatively significant genes produced by current analytical methods burgeons into the hundreds and the list includes many implausible genes, suggesting extensive false-positive findings that overshadow true driver events.
Related Papers (5)
Frequently Asked Questions (13)
Q1. What contributions have the authors mentioned in the paper "Network propagation-based prioritization of long tail genes in 17 cancer types" ?

The study of these genes often requires integrative approaches. CC-BY-NC-ND 4. 0 International license available under a was not certified by peer review ) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. 

For instance, a 𝛽 value of 0.1 in a PPI network with 10,000 nodes requires a gene’s position to improve by a minimum of 1,000 ranks. 

UMGs connected to driver subsets (i) and (ii) (olive and orange edges) and ones with no mutation score (e.g. POLR2E) are likely to be drug targets. 

For a node to rank higher, the best case scenario involves having near exclusive connections with multiple neighbors (k ≥ 1 steps) whose initial score is high. 

In this paper, the authors rely on a framework that iteratively propagates information signals (i.e. mutation scores or other quantitative metrics) between each network node (i.e. gene product) and its neighbors..CC-BY-NC-ND 4.0 International licenseavailable under a was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. 

In the propagation framework the authors use, two of the most important factors that determine a node’s score after convergence are the number of high scoring nodes within its neighborhood and the connectivity of these neighbors. 

Biological enrichment analysis of UMGs, separately and in combination with known drivers, confirms already known functional importance of the UMGs and suggests new associations between cancer types and biological pathway alterations. 

Signal-to-background (S/B), coefficient of variation (CV) and Z prime factor (Z’) were calculated for each screening plate using mean and standard deviation values of the positive and negative controls to monitor assay performance. 

Manual curation of literature further validates UMGs’ connection to cancer which could be overlooked by automated literature mining alone..CC-BY-NC-ND 4.0 International licenseavailable under a was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. 

The authors made comprehensive suggestions and had to choose a relatively low threshold (0.4) for edge quality filtering to retain a large number of edges given the limitations in PPI data availability in 2008. 

While most UMGs are designated either potential drug targets or weak drivers, others are connected to multiple types of driver genes and accordingly might be considered for both (e.g. RBBP5 with multi-colored edges in Figure 5). 

CC-BY-NC-ND 4.0 International licenseavailable under a was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. 

These values ensure that to be considered a UMG, a gene has to jump hundreds to thousands of ranks during propagation depending on the PPI network and cancer type under study. 

Trending Questions (1)
Is there any papers study both long tail and network analysis?

Yes, the paper titled "A Network Propagation Approach to Prioritize Long Tail Genes in Cancer" combines long tail gene analysis with network analysis.