scispace - formally typeset
Search or ask a question
Posted ContentDOI

pDriver: A novel method for unravelling personalised coding and miRNA cancer drivers

TL;DR: The proposed novel method, pDriver, to discover personalised cancer drivers is more effective than other methods and can also detect miRNA cancer drivers and most of them have been confirmed to be associated with cancer by literature.
Abstract: Motivation Unravelling cancer driver genes is important in cancer research. Although computational methods have been developed to identify cancer drivers, most of them detect cancer drivers at population level. However, two patients who have the same cancer type and receive the same treatment may have different outcomes because each patient has a different genome and their disease might be driven by different driver genes. Therefore new methods are being developed for discovering cancer drivers at individual level, but existing personalised methods only focus on coding drivers while microRNAs (miRNAs) have been shown to drive cancer progression as well. Thus, novel methods are required to discover both coding and miRNA cancer drivers at individual level. Results We propose the novel method, pDriver, to discover personalised cancer drivers. pDriver includes two stages: (1) Constructing gene networks for each cancer patient and (2) Discovering cancer drivers for each patient based on the constructed gene networks. To demonstrate the effectiveness of pDriver, we have applied it to five TCGA cancer datasets and compared it with the state-of-the-art methods. The result indicates that pDriver is more effective than other methods. Furthermore, pDriver can also detect miRNA cancer drivers and most of them have been confirmed to be associated with cancer by literature. We further analyse the predicted personalised drivers for breast cancer patients and the result shows that they are significantly enriched in many GO processes and KEGG pathways involved in breast cancer. Availability and implementation pDriver is available at https://github.com/pvvhoang/pDriver Contact Thuc.Le@unisa.edu.au Supplementary information Supplementary data are available at Bioinformatics online.

Summary (3 min read)

1 Introduction

  • Cancer driver genes play a vital role in cancer initialisation and development.
  • Unravelling cancer drivers and their regulatory mechanisms is critical to the understanding of cancer and the design of effective cancer treatments.
  • Many computational methods have been developed to identify cancer drivers, mainly including mutation-based methods and network-based methods.

2.1 Datasets

  • The authors use five TCGA datasets (The Cancer Genome Atlas Research et al., 2013): breast invasive carcinoma (BRCA), lung adenocarcinoma (LUAD), lung squamous cell carcinoma (LUSC), kidney renal clear cell carcinoma (KIRC), head and neck squamous cell carcinoma (HNSC).
  • These datasets contain the gene expression data of cancer patients and they are used to construct gene networks.
  • The TF list is obtained from Lizio et al. (2017) and the authors use this list to distinguish TF genes from other coding genes.
  • Following the idea of LIONESS (Kuijjer et al., 2019) of building sample specific gene regulatory networks (details in Section 2.2.2), the authors build the miRNA-TF-mRNA network for each patient based on the matched miRNA, TF, and mRNA expression data.
  • Furthermore, to keep edges with strong interactions, the authors remove an edge from the patient-specific network if the absolute value of its weight is less than a threshold, which is the mean of the absolute weights of all edges in the network.

2.2.2 Constructing personalised gene regulation networks

  • PDriver applies LIONESS (Kuijjer et al., 2019) to build gene regulatory networks for single patients.
  • LIONESS considers that the network estimated from all samples is the average of the sample-specific networks which are estimated from individual samples.
  • Let e(α)ij be the weight of the edge between node i and node j in the network obtained from allN samples (the authors denote the network asα).
  • Since the output network from LIONESS contains all edges among nodes (i.e. a fully connected network), it may include untrue edges.
  • To overcome this limitation, in pDriver, the authors refine the network obtained by LIONESS by using the existing databases, including protein protein interactions, miRNA-TF/mRNA interactions, and TF-miRNA interactions.

2.2.3 Discovering personalised cancer drivers

  • Based on Kalman’s controllability rank condition (Kalman, 1963), the state and state transition of a system/network is fully controlled by a subset of nodes, but finding such a subset of nodes is computationally expensive or prohibitive for large networks like gene regulation networks.
  • Recently Liu et al. (2011) has provided an analytical method (called Network Control method) to detect the driver node set in a complex system modelled by a weighted directed network.
  • Nevertheless, the authors only focus on the driver node set which has the smallest number of driver nodes, called the Minimum Driver Node Set (MDNS), a minimum subset of nodes which can fully control the network (Liu et al., 2011) (The details are discussed in Section 1 of the Supplement).
  • After the authors have the MDNS, they detect critical nodes of the network by removing nodes one by one out of the network.
  • Since without the critical nodes, the size of the MDNS increases or the authors need to interact on more nodes to control the whole network, the critical nodes play the central role in controlling the network and they are considered as candidate cancer drivers.

3 Results

  • 1 pDriver is robust in identifying coding cancer drivers Besides, ActiveDriver, DriverML, MutSigCV, and OncodriveFM are mutation-based methods while DawnRank, DriverNet, PNC, and SCS are network-based methods.
  • The authors use CGC as the ground truth for predicted coding driver genes.
  • Furthermore, the authors use F1Score to measure the performance of the methods.
  • Furthermore, to see if pDriver discovers similar cancer driver genes as the other methods, the authors compare the driver genes discovered by pDriver with those discovered by the top 5 performing methods (i.e. PNC, ActiveDriver, DawnRank, MutSigCV, and DriverML) among the eight.

3.2 Detecting miRNA cancer drivers

  • In addition to identifying coding cancer drivers, pDriver can also discover miRNA cancer drivers.
  • The percentages of the predicted miRNA drivers in OncomiR for the five cancer types (BRCA, LUAD, LUSC, KIRC, and HNSC) are shown in Fig.
  • Similarly, Luo et al. (2017) considers hsa-miR-1293 as a prognostic biomarker for kidney cancer.

3.3 Discovering personalised cancer drivers

  • The authors discuss the ability of pDriver to discover personalised coding and miRNA cancer drivers, which differentiates pDriver from other existing methods.
  • Among the predicted rare coding drivers, there are some significant genes such as JUN, CREB1, and ID2, which are enriched in numerous biological processes and pathways.
  • The authors further analyse all predicted cancer drivers of the breast cancer patient TCGA-AC-A62Y and they see that this patient has 91 coding cancer drivers with 37 driver genes in CGC.
  • The detected miRNA driver hsa-miR-935 is novel and it can be used as a candidate cancer driver in web-lab experiments to confirm its role in the cancer development of the patient TCGA-AC-A62Y.

3.4 Distribution of personalised rare cancer drivers in breast cancer subtypes

  • Breast cancer includes numerous subtypes and each subtype has specific morphologies as well as clinical outcomes.
  • Thus, to elucidate the difference of breast cancer subtypes, the authors analyse the distribution of personalised rare cancer drivers in patients across the four major breast cancer subtypes, including Basal, Her2, Luminal A (LumA), and Luminal B (LumB).
  • The distribution results are shown in Fig. 5. As can be seen from the figure, the distribution of rare drivers in Basal patients is different from the distributions of patients in other subtypes.
  • In other words, they are driven by different rare drivers.
  • On the other hand, other cancer subtype patients such as LumA have a less heterogeneity and they have a good prognosis.

3.5 The effectiveness of the Network Control method for finding influential nodes in a network

  • The authors set the weight of a node (gene) as the absolute difference between average expression of the gene in normal state and its average expression in tumour state.
  • Intuitively, if the total effect from active neighbouring nodes on an inactive node is strong enough, they can change the state of that node from inactive to active.
  • The authors use the BRCA dataset to evaluate these methods in identifying driver genes in patient-specific networks.
  • The reason for this result may come from the fact that gene regulation is related to the control mechanism, which has been captured by the Network Control method while the other two methods discover influential nodes from the perspective of information propagation.

4 Conclusion

  • Because each cancer patient possesses a different genome, their disease may be driven by different cancer driver genes.
  • As a result, two cancer patients who have the same cancer type and experience the same treatment may have different outcomes.
  • Thus, it is necessary to develop novel methods to identify personalised cancer drivers, including both coding and non-coding drivers, to elucidate their regulatory mechanism in cancer patients.
  • The authors have assessed the performance of pDriver with different experiments.
  • PDriver can discover miRNA cancer drivers and most of them are confirmed to be involved in cancer by literature.

Did you find this useful? Give us your feedback

Figures (8)

Content maybe subject to copyright    Report

“pDriver-main” 2020/4/24 15:08 page 1 #1
i
i
i
i
i
i
i
i
Bioinformatics
doi.10.1093/bioinformatics/xxxxxx
Advance Access Publication Date: Day Month Year
Original Paper
Systems Biology
pDriver : A novel method for unravelling
personalised coding and miRNA cancer drivers
Vu VH Pham
1
, Lin Liu
1
, Cameron P Bracken
2,3
, Thin Nguyen
4
, Gregory J
Goodall
2,3
, Jiuyong Li
1
and Thuc D Le
1,
1
UniSA STEM, University of South Australia, Mawson Lakes, SA 5095, Australia
2
Centre for Cancer Biology, an alliance of SA Pathology and University of South Australia, Adelaide, SA 5000, Australia
3
Department of Medicine, The University of Adelaide, Adelaide, SA 5005, Australia
4
Applied Artificial Intelligence Institute, Deakin University, Australia
To whom correspondence should be addressed.
Associate Editor: XXXXXXX
Received on XXXXX; revised on XXXXX; accepted on XXXXX
Abstract
Motivation: Unravelling cancer driver genes is important in cancer research. Although computational
methods have been developed to identify cancer drivers, most of them detect cancer drivers at population
level. However, two patients who have the same cancer type and receive the same treatment may have
different outcomes because each patient has a different genome and their disease might be driven by
different driver genes. Therefore new methods are being developed for discovering cancer drivers at
individual level, but existing personalised methods only focus on coding drivers while microRNAs (miRNAs)
have been shown to drive cancer progression as well. Thus, novel methods are required to discover both
coding and miRNA cancer drivers at individual level.
Results: We propose the novel method, pDriver, to discover personalised cancer drivers. pDriver includes
two stages: (1) Constructing gene networks for each cancer patient and (2) Discovering cancer drivers
for each patient based on the constructed gene networks. To demonstrate the effectiveness of pDriver,
we have applied it to five TCGA cancer datasets and compared it with the state-of-the-art methods. The
result indicates that pDriver is more effective than other methods. Furthermore, pDriver can also detect
miRNA cancer drivers and most of them have been confirmed to be associated with cancer by literature.
We further analyse the predicted personalised drivers for breast cancer patients and the result shows that
they are significantly enriched in many GO processes and KEGG pathways involved in breast cancer.
Availability and implementation: pDriver is available at https://github.com/pvvhoang/pDriver
Contact: Thuc.Le@unisa.edu.au
Supplementary information: Supplementary data are available at Bioinformatics online.
1 Introduction
Cancer driver genes play a vital role in cancer initialisation and deve-
lopment. Unravelling cancer drivers and their regulatory mechanisms
is critical to the understanding of cancer and the design of effective
cancer treatments. Many computational methods have been developed
to identify cancer drivers, mainly including mutation-based methods and
network-based methods. Mutation-based methods discover cancer drivers
by investigating the characteristics of mutations. For instance, MutSi-
gCV (Lawrence et al., 2013) evaluates the significance of mutations in
genes, OncodriveFM (Gonzalez-Perez and Lopez-Bigas, 2012) and Dri-
verML (Han et al., 2019) examine the functional impact of mutations,
OncodriveCLUST (Tamborero et al., 2013) uses recurrence of mutations,
ActiveDriver (Reimand and Bader, 2013) looks at enrichment in externally
defined regions, and CoMEt (Leiserson et al., 2015) uses mutual exclusi-
vity. Network-based methods detect cancer drivers by evaluating the role
of genes in biological networks like DriverNet (Bashashati et al., 2012),
MEMo (Ciriello et al., 2012), HotNet (Reyna et al., 2018), NetSig (Horn
et al., 2018), and CBNA (Pham et al., 2019). All these methods detect
© The Author 2020. 1
.CC-BY-NC-ND 4.0 International licensereview) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under a
The copyright holder for this preprint (which was not certified by peerthis version posted April 25, 2020. ; https://doi.org/10.1101/2020.04.23.058727doi: bioRxiv preprint

“pDriver-main” 2020/4/24 15:08 page 2 #2
i
i
i
i
i
i
i
i
2 Pham et al.
cancer drivers at the population level. However, different patients have
different genomes and their diseases may be driven by different genes,
and therefore, two patients having the same cancer type and receiving
the same treatment may have different outcomes. Thus, there is a need
to study cancer drivers specific to individual patients, called personalised
cancer drivers in this paper.
Recently, some methods have been developed to identify personalised
cancer drivers such as DawnRank (Hou and Ma, 2014), SCS (Guo et al.,
2018), and PNC (Guo et al., 2019). DawnRank considers mutated genes
with higher connectivity in the gene regulatory network are more impactful
and identifies such genes by applying PageRank (Page et al., 1998; Brin
and Page, 1998) to the gene network. To evaluate the influence of genes,
DawnRank uses each individual patient’s gene expression data but the
same gene network is used for all patients. On the other hand, SCS builds
a gene network for each patient using the patient’s gene expression data
and the gene expression data of the neighbour (i.e. a normal sample). SCS
identifies cancer drivers of a patient as the minimal set of mutated genes
which controls the maximum number of differentially expressed genes in
that patient’s network. Similarly, PNC uses the gene expression data of a
tumour and its neighbour to build a personalised network. However, PNC
only keeps edges which have co-expression p-value of the two nodes less
than 0.05 in one state and greater than 0.05 in the other state. Then it
converts the gene network to a bipartite graph. In the bipartite graph, the
nodes on the top represent genes and the nodes on the bottom represent
edges. The cancer drivers predicted by PNC are the minimum set of genes
on the top which covers all the edges on the bottom in the bipartite graph.
Although these methods can be used to identify personalised cancer
driver genes, they have their own limitations. DawnRank uses the same
gene regulatory network for all patients and ignores the network informa-
tion which is specific to each patient (Hou and Ma, 2014). Although SCS
and PNC integrate the genetic data of each patient to build personalised
gene networks, they both require the gene expression data of a tumour and
its normal neighbour. However, identifying the neighbour of a tumour is
not easy and in some cases, the normal neighbour does not exist. Further-
more, these methods only uncover coding driver genes while cancer drivers
might be non-coding genes (e.g. miRNAs). Because protein-coding regi-
ons only account for about two percent of the human genome (Yang et al.,
2016a), a large percentage of mutations might exist in non-coding regions,
and thus non-coding genes may act as cancer drivers too (Puente et al.,
2015; Weinhold et al., 2014). Consequently, novel and effective methods
are needed for both coding and non-coding personalised cancer drivers.
In this paper, we develop a novel method, pDriver, to discover perso-
nalised coding and miRNA cancer drivers. pDriver provides a thorough
"treatment" to the problems with existing methods as it has been designed
with personalised gene regulation in mind and has considered gene regu-
lation network as a control system whose functioning is driven by some
key components. This view has led us to adopt and combine innovatively
the techniques from different disciplines to develop pDriver. Similar to the
existing methods like SCS and PNC, pDriver also takes the advantages
of gene regulatory networks and constructs the gene regulatory network
for each patient but unlike SCS and PNC, we do not require the gene
expression data of the neighbour of a patient, which greatly enhances the
usability of pDriver since in practice it is often difficult to find a matching
"neighbour" for a patient. Furthermore, pDriver provides a more compre-
hensive coverage for personalised driver discovery, by considering both
coding and non-coding drivers.
In particular, to make use of gene network information specific to a
patient, we firstly build gene regulatory networks for each cancer patient
based on the matched mRNAs, Transcription Factors (TFs), and miRNAs
expression data of the patient. Applying LIONESS, a method to estimate
sample-specific regulatory networks (Kuijjer et al., 2019), we build the
gene network for a patient based on the difference between the network
built from the data of all patients and the network built from the data of
all patients except the patient under consideration. Then based on the dire-
cted PPI network (Vinayagam et al., 2011) and the existing gene interaction
databases we refine each patient’s gene network by removing interactions
not supported by the PPIs and the gene interaction databases. We further
remove edges with a low weight to keep only edges which have a strong
connection for each patient. To predict cancer drivers of a patient, we need
to identify genes which play the critical role in controlling the whole gene
network of the patient. System controllability has been a central topic stu-
died for decades, especially in the engineering discipline. However, it is
computationally prohibitive to apply classic control theory, such as Kal-
man’s controllability theory (Kalman, 1963) to a gene regulation network
due to its high complexity. Thanks to the recent work by Liu et al. (2011),
which has provided an analytical method to identify the set of driver nodes
in a complex system modelled by a weighted directed network. Following
the Network Control method in Liu et al. (2011), in a system, there exists
a set of nodes which are critical to the control of the working of the system
such that removing such a node will require more nodes to control the netw-
ork, which nicely mimics the role of a cancer driver in a gene regulatory
network. Thus we adopt this Network Control method to identify cancer
drivers as such critical nodes. As we will show in Section 3.5, comparing
to other methods for finding influential nodes in a network, the Network
Control method outperforms the other methods since it has captured the
control mechanism of gene regulation while the others discover influential
nodes only from the information propagation perspective.
We apply pDriver to five TCGA cancer datasets and validate the results
with the Cancer Gene Census (CGC), in comparison with the state-of-
the-art cancer driver prediction methods, including 3 personalised driver
prediction methods (DawnRank (Hou and Ma, 2014), PNC (Guo et al.,
2019), and SCS (Guo et al., 2018)) and 5 population level driver predi-
ction methods (ActiveDriver (Reimand and Bader, 2013), DriverML (Han
et al., 2019), DriverNet (Bashashati et al., 2012), MutSigCV (Lawre-
nce et al., 2013), and OncodriveFM (Gonzalez-Perez and Lopez-Bigas,
2012)). Since there is no ground truth available for individual patients’
cancer drivers, following the same approach in existing literature on per-
sonalised driver prediction, we aggregate the results of personalised coding
drivers discovered by pDriver (and the other 3 personalised driver predi-
ction methods) for the comparison. In validating the miRNA cancer drivers
discovered by pDriver, we use OncomiR (Wong et al., 2018), a database of
miRNA dysregulation in pan-cancer. Over 50% of the discovered miRNA
drivers are in OncomiR and several predicted miRNA drivers are confirmed
to be involved in different cancer types by literature.
In addition, the personalised drivers found by pDriver are significantly
enriched in various GO biological processes and KEGG pathways related
to cancer. We focus on personalised rare coding drivers which may be
drivers of many patients but they have a low mutation frequency, as these
drivers are usually missed by other cancer driver identification methods,
especially methods based on gene mutations. The analysis of the distri-
bution of the personalised rare coding drivers found by pDriver in breast
cancer subtypes reveals that Basal cancer patients are driven by different
rare coding drivers while other subtypes such as Luminal A cancer patients
are usually driven by the same rare coding drivers. This finding may be
associated with the great heterogeneity of Basal breast cancer.
2 Datasets and methods
2.1 Datasets
In this paper, we use five TCGA datasets (The Cancer Genome Atlas Resea-
rch et al., 2013): breast invasive carcinoma (BRCA), lung adenocarcinoma
(LUAD), lung squamous cell carcinoma (LUSC), kidney renal clear cell
carcinoma (KIRC), head and neck squamous cell carcinoma (HNSC).
.CC-BY-NC-ND 4.0 International licensereview) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under a
The copyright holder for this preprint (which was not certified by peerthis version posted April 25, 2020. ; https://doi.org/10.1101/2020.04.23.058727doi: bioRxiv preprint

“pDriver-main” 2020/4/24 15:08 page 3 #3
i
i
i
i
i
i
i
i
Unravelling personalised cancer dr ivers 3
(1) Construct personalised regulation network
(2) Identify personalised cancer drivers
MDNS: Minimum Driver Node Set
(i.e. nodes with a dash arrow)
Gene
Driver node
Removed node
Cancer driver
|MDNS| = 3
Samples
Samples
Refined network
of sample q
(**)
(**) Refining network using protein protein interactions,
miRNA-TF/mRNA interactions, and TF-miRNA interactions
(***)
1
2
3
4
5
...
q
...
N
1
2
3
4
5
...
q
...
N
Genes
Genes
Network e
(
)
from all samples
Network e
(
-q)
from all samples except q
(*) Obtaining network of sample q (refer to Eq. 10)
(*)
Network of sample q
(***) Removing edges with low Pearson correlation coefficients
|MDNS| = 4
|MDNS| = 4
Fig. 1. An illustration of pDriver. (1) Building the gene network for each cancer patient based on gene expression data and refine these patient-specific networks using existing gene
interaction databases (including protein protein interactions, miRNA-TF/mRNA interactions, and TF-miRNA interactions) to remove unreal interactions and using Pearson correlation
coefficients between genes to keep only edges which have a strong correlation in each patient, and (2) Identifying coding and miRNA cancer drivers for each patient by evaluating the role
of genes in the personalised network.
These datasets contain the gene expression data of cancer patients and
they are used to construct gene networks. The TF list is obtained from
Lizio et al. (2017) and we use this list to distinguish TF genes from
other coding genes. Several gene interaction datasets are used to refine
the constructed gene networks, including PPIs (Vinayagam et al., 2011),
TransmiR 2.0 (Wang et al., 2010), TargetScan 7.0 (Agarwal et al., 2015),
miRTarBase 6.1 (Chou et al., 2016), TarBase 7.0 (Vlachos et al., 2015),
and miRWalk 2.0 (Dweep and Gretz, 2015). The details of these data-
sets will be discussed in the sections below and they are available at
https://github.com/pvvhoang/pDriver.
2.2 pDriver
2.2.1 The workflow of pDriver
As shown in Fig. 1, pDriver contains two stages to identify personalised
coding and miRNA cancer drivers: (1) Constructing the miRNA-TF-
mRNA network for each cancer patient, and (2) Identifying coding and
miRNA cancer drivers for each patient based on the personalised network
built in the first stage. The details of the two stages are described below.
Stage (1) Constructing personalised miRNA-TF-mRNA networks
Step 1. Prepare gene expression data. We obtain the expression data of
coding genes and miRNAs of matched samples from the TCGA datasets
(BRCA, LUAD, LUSC, KIRC, and HNSC) by keeping the samples which
have both coding expression data and miRNA expression data for the five
datasets respectively. We firstly take the intersection of the coding genes
in a TCGA dataset with the genes in the PPI network (Vinayagam et al.,
2011) to get the list of coding genes with respect to that TCGA dataset.
Then in the list of coding genes obtained, we use the TF list to distinguish
TFs and mRNAs (i.e. the other coding genes excluding TF genes). We keep
all obtained TFs since TFs play the key role in cell function (Vaquerizas
et al., 2009) and can be potential cancer drivers. However, for mRNAs
and miRNAs, to reduce the number of genes, we firstly calculate the mean
expression value for each gene and the standard deviation of each gene’s
expression. Then we select the top 100 genes with the highest standard
deviations and remove the other genes from the dataset. After the sample
and gene selection, we have the expression data of 747 samples for BRCA,
445 samples for LUAD, 336 samples for LUSC, 240 samples for KIRC, and
485 samples for HNSC, for 100 miRNAs and 939 coding genes (including
both TFs and mRNAs).
Step 2. Build miRNA-TF-mRNA network for each cancer patient.
Following the idea of LIONESS (Kuijjer et al., 2019) of building sam-
ple specific gene regulatory networks (details in Section 2.2.2), we build
the miRNA-TF-mRNA network for each patient based on the matched
miRNA, TF, and mRNA expression data. We firstly build a miRNA-TF-
mRNA network using all patients’ expression data. Then for each patient,
we build a miRNA-TF-mRNA network using the expression data of all
patients excluding the current patient. In the two networks, nodes repre-
sent miRNAs/TFs/mRNAs. The edge weight is the Pearson correlation
between the two nodes. Next, based on the two networks, we infer the
network for the patient using the method in LIONESS.
Step 3. Refine patient-specific gene regulatory networks. For the refi-
nement, we remove from each patient-specific network obtained in the
previous Step the following: (1) the TF-TF, TF-mRNA, and mRNA-mRNA
interactions which are not in the PPI network (Vinayagam et al., 2011)
since the interactions in this network are important for cellular informa-
tion processing; (2) the miRNA-TF and miRNA-mRNA interactions which
are not in TargetScan, miRTarBase, TarBase, or miRWalk; and (3) the
TF-miRNA interactions which are not in TransmiR. Furthermore, to keep
edges with strong interactions, we remove an edge from the patient-specific
network if the absolute value of its weight is less than a threshold, which
is the mean of the absolute weights of all edges in the network. Because
the obtained networks are constructed from the gene expression data of
.CC-BY-NC-ND 4.0 International licensereview) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under a
The copyright holder for this preprint (which was not certified by peerthis version posted April 25, 2020. ; https://doi.org/10.1101/2020.04.23.058727doi: bioRxiv preprint

“pDriver-main” 2020/4/24 15:08 page 4 #4
i
i
i
i
i
i
i
i
4 Pham et al.
a cancer type and gene interaction databases, they are more reliable and
specific to the corresponding cancer type. In addition, although we use the
gene interaction databases to refine the patient-specific gene networks, the
personalised information of the networks may not be lost as the weights of
the remained edges in each network are different. Then after refining the
patient-specific networks by keeping edges with strong interactions, the
networks are different and specific to each patient.
Stage (2) Identifying personalised cancer drivers
According to the Network Control method (Liu et al., 2011) (details
in Section 2.2.3), a network can be fully controlled by a minimum sub-
set of nodes of the network, called Minimum Driver Node Set (MDNS).
To capture the control machanism of the gene regulation, we apply the
Network Control method to detect the MDNS of each sample-specific
miRNA-TF-mRNA network constructed in Stage (1) above. Then we try
to remove nodes of the network one by one and re-evaluate the MDNS.
If the size of the MDNS increases, the removed node is a critical node
in the network. In other words, when a critical node is removed from the
network, a bigger MDNS is required to control the whole network. This
indicates that critical nodes play the central controlling or driving role in
the gene network and thus we consider them as candidate cancer drivers.
Because these drivers are discovered based on the network for a specific
patient, they are the predicted personalised cancer drivers for that patient.
2.2.2 Constructing personalised gene regulation networks
pDriver applies LIONESS (Kuijjer et al., 2019) to build gene regulatory
networks for single patients. LIONESS considers that the network estima-
ted from all samples is the average of the sample-specific networks which
are estimated from individual samples. Based on this idea, LIONESS uses
a linear framework to estimate sample-specific networks as follows.
We use the similar notation as that in Kuijjer et al. (2019). Let e
(α)
ij
be
the weight of the edge between node i and node j in the network obtained
from all N samples (we denote the network as α). LIONESS assumes that
e
(α)
ij
is the linear combination of the weights of the edge between nodes i
and j in the N sample-specific networks:
e
(α)
ij
=
N
X
s=1
w
(α)
s
e
(s)
ij
, (1)
N
X
s=1
w
(α)
s
= 1, (2)
where e
(s)
ij
is the edge weight between node i and node j of sample (s),
w
(α)
s
is the contribution of sample (s) to the aggregated network.
Similarly, the weight of the edge between node i and node j in the
network from all samples except a sample q can be modelled as:
e
(αq)
ij
=
N
X
s6=q
w
(αq)
s
e
(s)
ij
, (3)
N
X
s6=q
w
(αq)
s
= 1. (4)
Suppose that the ratio of the contributions of a sample to the two
networks (i.e. w
(α)
s
/w
(αq)
s
) is the same for any sample s {1, ..., N }
and s 6= q, from (2) and (4), we have:
w
(α)
q
= 1
N
X
s6=q
w
(α)
s
= 1
N
X
s6=q
w
(α)
s
/
N
X
s6=q
w
(αq)
s
= 1w
(α)
s
/w
(αq)
s
.
(5)
Subtracting Eq. 3 from Eq. 1, we find:
e
(α)
ij
e
(αq)
ij
= w
(α)
q
e
(q)
ij
+
N
X
s6=q
(w
(α)
s
w
(αq)
s
)e
(s)
ij
. (6)
Replacing Eq. 6 with Eq. 5, we obtain:
e
(α)
ij
e
(αq)
ij
= w
(α)
q
e
(q)
ij
w
(α)
q
N
X
s6=q
w
(αq)
s
e
(s)
ij
. (7)
Thus, the weight of an edge of the network of sample q is computed:
e
(q)
ij
= (e
(α)
ij
e
(αq)
ij
)/w
(α)
q
+ e
(αq)
ij
. (8)
From Eq. 8, we can build the network for a specific sample based on
the network from all samples and the network from all samples except the
sample of interest, as illustrated in Fig. 1.
In pDriver, we assume that all samples make the same contribution to
the network obtained from all N samples, which implies:
w
(α)
q
= 1/N. (9)
Replacing w
(α)
q
in Eq. 8 with w
(α)
q
according to Eq. 9, we have:
e
(q)
ij
= N(e
(α)
ij
e
(αq)
ij
) + e
(αq)
ij
. (10)
As we can see in Eq. 10, the mathematical framework is independent of
the method used to compute the edge weights of the aggregated networks,
we use Pearson, a common measure to quantify the association level betw-
een variables, to construct the network from all N samples and the network
from N 1 samples. Since the output network from LIONESS contains all
edges among nodes (i.e. a fully connected network), it may include untrue
edges. To overcome this limitation, in pDriver, we refine the network
obtained by LIONESS by using the existing databases, including pro-
tein protein interactions, miRNA-TF/mRNA interactions, and TF-miRNA
interactions. Then we further refine the resulted network by removing
edges which have the weight smaller than a threshold to assure the final
network only contains reliable edges.
2.2.3 Discovering personalised cancer drivers
Based on Kalman’s controllability rank condition (Kalman, 1963), the state
and state transition of a system/network is fully controlled by a subset of
nodes, but finding such a subset of nodes is computationally expensive
or prohibitive for large networks like gene regulation networks. However,
recently Liu et al. (2011) has provided an analytical method (called Netw-
ork Control method) to detect the driver node set in a complex system
modelled by a weighted directed network.
In this paper, we adopt this Network Control method to identify the dri-
ver node sets. Nevertheless, we only focus on the driver node set which has
the smallest number of driver nodes, called the Minimum Driver Node Set
(MDNS), a minimum subset of nodes which can fully control the network
(Liu et al., 2011) (The details are discussed in Section 1 of the Supple-
ment). After we have the MDNS, we detect critical nodes of the network
by removing nodes one by one out of the network. If the MDNS of the
network with a node removed becomes larger, the removed node is a criti-
cal node. Since without the critical nodes, the size of the MDNS increases
or we need to interact on more nodes to control the whole network, the
critical nodes play the central role in controlling the network and they are
considered as candidate cancer drivers.
2.2.4 Implementation
The R source code of pDriver and the scripts to reproduce the experiment
results in this paper are available at https://github.com/pvvhoang/pDriver.
.CC-BY-NC-ND 4.0 International licensereview) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under a
The copyright holder for this preprint (which was not certified by peerthis version posted April 25, 2020. ; https://doi.org/10.1101/2020.04.23.058727doi: bioRxiv preprint

“pDriver-main” 2020/4/24 15:08 page 5 #5
i
i
i
i
i
i
i
i
Unravelling personalised cancer dr ivers 5
3 Results
3.1 pDriver is robust in identifying coding cancer drivers
In this section, we compare the performance of pDriver with eight exi-
sting methods with different approaches for discovering cancer driver
genes, including three methods for identifying personalised cancer dri-
vers (DawnRank (Hou and Ma, 2014), PNC (Guo et al., 2019), and SCS
(Guo et al., 2018)) and five methods for identifying cancer drivers at the
population level (ActiveDriver (Reimand and Bader, 2013), DriverML
(Han et al., 2019), DriverNet (Bashashati et al., 2012), MutSigCV (Law-
rence et al., 2013), and OncodriveFM (Gonzalez-Perez and Lopez-Bigas,
2012)). Besides, ActiveDriver, DriverML, MutSigCV, and OncodriveFM
are mutation-based methods while DawnRank, DriverNet, PNC, and SCS
are network-based methods. Since these methods are developed to iden-
tify only coding driver genes, we compare these methods with pDriver in
unravelling coding driver genes.
As there is no ground truth for personalised drivers, following the
same approach in existing methods for predicting personalised drivers, we
aggregate the results of individuals to compare the performance of pDriver
with the other methods. We firstly apply pDriver to detect patient-specific
cancer driver genes. Based on the results, we compute the frequency of
predicted cancer driver genes in the population. The more frequent a pre-
dicted cancer driver gene is, the higher it is in the ranking list of candidate
cancer drivers at the population level.
From the five TCGA cancer datasets, BRCA, LUAD, LUSC, KIRC,
and HNSC, we obtain the gene expression data of the respective cancer
types. We use the results of the eight methods above for these selected
cancer types in the PNC paper (Guo et al., 2019) directly in the comparison.
We use CGC as the ground truth for predicted coding driver genes.
CGC is from the COSMIC database (Forbes et al., 2015) and it is a popular
cancer gene dataset commonly used to validate cancer driver genes disco-
vered by computational methods in cancer research. Furthermore, we use
F
1
Score to measure the performance of the methods. F
1
Score asses-
ses the enrichment ability of predicted cancer driver genes in the ground
truth (i.e. CGC). F
1
Score is a combination of Precision and Recall, and
it is computed as: F
1
Score = 2
P R
P +R
, where P (Precision) indicates
the fraction of correctly predicted driver genes among the predicted driver
genes and R (Recall) represents the fraction of correctly predicted driver
genes among the driver genes in the gold standard (i.e. CGC).
0.00 0.05 0.10 0.15 0.20
ActiveDriver
DawnRank
DriverML
DriverNet
MutSigCV
OncodriveFM
pDriver
PNC
SCS
Method
F1 Score
Fig. 2. Comparison of F1 Score of the results by ActiveDriver, DawnRank, DriverML,
DriverNet, MutSigCV, OncodriveFM, pDriver, PNC, and SCS. The x-axis shows the
9 methods and the y-axis is for F
1
Score. The results are computed based on 5 TCGA
cancer datasets BRCA, LUAD, LUSC, KIRC, and HNSC.
The results are shown in Fig. 2 (The F
1
Scores for all the methods
are shown in Sections 2 of Supplementary materials and the lists of coding
cancer drivers predicted by pDriver is shown in Sections 5 of Supplemen-
tary materials). It can be seen that the F
1
Score of pDriver is higher than
those of the other eight methods, indicating the effectiveness of pDriver
in identifying coding cancer drivers at the population level.
Furthermore, to see if pDriver discovers similar cancer driver genes as
the other methods, we compare the driver genes discovered by pDriver with
those discovered by the top 5 performing methods (i.e. PNC, ActiveDriver,
DawnRank, MutSigCV, and DriverML) among the eight. The findings
are shown in Fig. 3. In the figure, the cancer drivers for the five cancer
types (i.e. BRCA, LUAD, LUSC, KIRC, and HNSC) discovered by these
methods are validated with the CGC and intersected to find the overlaps.
Although there are some validated cancer drivers unravelled by different
methods, pDriver uncovers a large amount of cancer driver genes which are
not discovered by others. This shows that the results of pDriver and other
methods are complementary, and they can be used together to improve the
overall performance in cancer driver detection.
3.2 Detecting miRNA cancer drivers
In addition to identifying coding cancer drivers, pDriver can also discover
miRNA cancer drivers. As there is no ground truth for miRNA drivers,
we use OncomiR (Wong et al., 2018), a database of miRNA dysregulation
in pan-cancer, to analyse the miRNA drivers predicted by pDriver. The
percentages of the predicted miRNA drivers in OncomiR for the five cancer
types (BRCA, LUAD, LUSC, KIRC, and HNSC) are shown in Fig. 4.
From Fig. 4, the percentages of predicted miRNA drivers for OncomiR
are over 50% in all the five cancer types. Particularly, out of the 18 miR-
NAs which are identified by pDriver as BRCA drivers, 13 are recorded
in OncomiR as involved in the turmorigenesis of BRCA. The p-value is
9.856e-06 based on the following hypergeometic test:
p = 1
n1
X
x=0
K
x

NK
Mx
N
M
, (11)
where N is the number of miRNAs of interest, K indicates the number of
miRNAs in OncomiR, M denotes the number of predicted miRNA drivers,
and n indicates the number of predicted miRNA drivers in OncomiR.
The figures for LUAD, LUSC, KIRC, and HNSC are 10/16, 14/21,
8/14, and 13/20 respectively. The corresponding p-values are at 7.783e-05,
7.192e-06, 0.002, and 4.890e-06.
Among the miRNA cancer drivers predicted by pDriver, there are
many significant miRNAs which are confirmed to be related to different
cancer types by other works. For example, hsa-miR-375 in breast cancer
cells experiences a high constitute expression (Frank et al., 2019) and
it is bound to estrogen receptor α to trigger off the transcription of the
receptor in breast cancer (Yan et al., 2014). The overexpression of hsa-miR-
940 in MDA-MB-231 breast cancer cells induces extensive osteoblastic
lesions (Hashimoto et al., 2018) and it is considered as a diagnostic and
prognostic biomarker for breast cancer patients (Liu et al., 2018). Other
identified driver miRNAs involved in breast cancer include hsa-miR-760
(Hu et al., 2016), hsa-miR-326 (Du et al., 2019; Ghaemi et al., 2019; Liang
et al., 2010), hsa-miR-577 (Yin et al., 2018), and hsa-miR-429 (Pham
et al., 2019). In addition, hsa-miR-326 also targets phoxa2 to regulate
cell proliferation and migration in lung cancer (Wang et al., 2016). hsa-
miR-1293, hsa-miR-1269a, and hsa-miR-1269b are involved in the overall
survival of kidney cancer patients (Liang et al., 2017). Similarly, Luo
et al. (2017) considers hsa-miR-1293 as a prognostic biomarker for kidney
cancer. The expression level of hsa-miR-375 is used to predict head and
neck cancer in the work of Avissar et al. (2009).
.CC-BY-NC-ND 4.0 International licensereview) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under a
The copyright holder for this preprint (which was not certified by peerthis version posted April 25, 2020. ; https://doi.org/10.1101/2020.04.23.058727doi: bioRxiv preprint

Citations
More filters
Posted Content
TL;DR: The biologically relevant information identified by these tools can be seen through the enrichment of discovered cancer drivers in GO biological processes and KEGG pathways and through the identification of a small cancer-driver cohort that is capable of stratifying patient survival.
Abstract: Motivation: Uncovering the genomic causes of cancer, known as cancer driver genes, is a fundamental task in biomedical research Cancer driver genes drive the development and progression of cancer, thus identifying cancer driver genes and their regulatory mechanism is crucial to the design of cancer treatment and intervention Many computational methods, which take the advantages of computer science and data science, have been developed to utilise multiple types of genomic data to reveal cancer drivers and their regulatory mechanism behind cancer development and progression Due to the complexity of the mechanistic insight of cancer genes in driving cancer and the fast development of the field, it is necessary to have a comprehensive review about the current computational methods for discovering different types of cancer drivers Results: We survey computational methods for identifying cancer drivers from genomic data We categorise the methods into three groups, methods for single driver identification, methods for driver module identification, and methods for identifying personalised cancer drivers We also conduct a case study to compare the performance of the current methods We further analyse the advantages and limitations of the current methods, and discuss the challenges and future directions of the topic In addition, we investigate the resources for discovering and validating cancer drivers in order to provide a one-stop reference of the tools to facilitate cancer driver discovery The ultimate goal of the paper is to help those interested in the topic to establish a solid background to carry out further research in the field

10 citations

References
More filters
Journal ArticleDOI
01 Apr 1998
TL;DR: This paper provides an in-depth description of Google, a prototype of a large-scale search engine which makes heavy use of the structure present in hypertext and looks at the problem of how to effectively deal with uncontrolled hypertext collections where anyone can publish anything they want.
Abstract: In this paper, we present Google, a prototype of a large-scale search engine which makes heavy use of the structure present in hypertext. Google is designed to crawl and index the Web efficiently and produce much more satisfying search results than existing systems. The prototype with a full text and hyperlink database of at least 24 million pages is available at http://google.stanford.edu/. To engineer a search engine is a challenging task. Search engines index tens to hundreds of millions of web pages involving a comparable number of distinct terms. They answer tens of millions of queries every day. Despite the importance of large-scale search engines on the web, very little academic research has been done on them. Furthermore, due to rapid advance in technology and web proliferation, creating a web search engine today is very different from three years ago. This paper provides an in-depth description of our large-scale web search engine -- the first such detailed public description we know of to date. Apart from the problems of scaling traditional search techniques to data of this magnitude, there are new technical challenges involved with using the additional information present in hypertext to produce better search results. This paper addresses this question of how to build a practical large-scale system which can exploit the additional information present in hypertext. Also we look at the problem of how to effectively deal with uncontrolled hypertext collections where anyone can publish anything they want.

14,696 citations


"pDriver: A novel method for unravel..." refers background or methods in this paper

  • ...Comparison of the effectiveness of PageRank, Influence Maximisation, and Network Control method in identifying influential genes in the gene networks....

    [...]

  • ...PageRank is used by Google to rank the importance of webpages....

    [...]

  • ...DawnRank considers mutated genes with higher connectivity in the gene regulatory network are more impactful and identifies such genes by applying PageRank (Page et al., 1998; Brin and Page, 1998) to the gene network....

    [...]

  • ...3.5 The effectiveness of the Network Control method for finding influential nodes in a network To assess the effectiveness of pDriver applying the Network Control method in identifying driver genes in the personalised networks, we compare the performance of the Network Control method with the other two methods, PageRank (Brin and Page, 1998; Page et al., 1998) and Influence Maximisation (IM) (Gong et al., 2016; Yang et al., 2016b) when use each of them in Stage (2) of pDriver (the details of PageRank and IM are discussed in Section 3 and 4 of the Supplement respectively)....

    [...]

  • ...…driver genes in the personalised networks, we compare the performance of the Network Control method with the other two methods, PageRank (Brin and Page, 1998; Page et al., 1998) and Influence Maximisation (IM) (Gong et al., 2016; Yang et al., 2016b) when use each of them in Stage (2) of…...

    [...]

Proceedings Article
11 Nov 1999
TL;DR: This paper describes PageRank, a mathod for rating Web pages objectively and mechanically, effectively measuring the human interest and attention devoted to them, and shows how to efficiently compute PageRank for large numbers of pages.
Abstract: The importance of a Web page is an inherently subjective matter, which depends on the readers interests, knowledge and attitudes. But there is still much that can be said objectively about the relative importance of Web pages. This paper describes PageRank, a mathod for rating Web pages objectively and mechanically, effectively measuring the human interest and attention devoted to them. We compare PageRank to an idealized random Web surfer. We show how to efficiently compute PageRank for large numbers of pages. And, we show how to apply PageRank to search and to user navigation.

14,400 citations

Journal Article
TL;DR: Google as discussed by the authors is a prototype of a large-scale search engine which makes heavy use of the structure present in hypertext and is designed to crawl and index the Web efficiently and produce much more satisfying search results than existing systems.

13,327 citations

Journal ArticleDOI
TL;DR: A significant update to one of the tools in this domain called Enrichr, a comprehensive resource for curated gene sets and a search engine that accumulates biological knowledge for further biological discoveries is presented.
Abstract: Enrichment analysis is a popular method for analyzing gene sets generated by genome-wide experiments. Here we present a significant update to one of the tools in this domain called Enrichr. Enrichr currently contains a large collection of diverse gene set libraries available for analysis and download. In total, Enrichr currently contains 180 184 annotated gene sets from 102 gene set libraries. New features have been added to Enrichr including the ability to submit fuzzy sets, upload BED files, improved application programming interface and visualization of the results as clustergrams. Overall, Enrichr is a comprehensive resource for curated gene sets and a search engine that accumulates biological knowledge for further biological discoveries. Enrichr is freely available at: http://amp.pharm.mssm.edu/Enrichr.

6,201 citations

Journal ArticleDOI
12 Aug 2015-eLife
TL;DR: It is shown that recently reported non-canonical sites do not mediate repression despite binding the miRNA, which indicates that the vast majority of functional sites are canonical.
Abstract: Proteins are built by using the information contained in molecules of messenger RNA (mRNA). Cells have several ways of controlling the amounts of different proteins they make. For example, a so-called ‘microRNA’ molecule can bind to an mRNA molecule to cause it to be more rapidly degraded and less efficiently used, thereby reducing the amount of protein built from that mRNA. Indeed, microRNAs are thought to help control the amount of protein made from most human genes, and biologists are working to predict the amount of control imparted by each microRNA on each of its mRNA targets. All RNA molecules are made up of a sequence of bases, each commonly known by a single letter—‘A’, ‘U’, ‘C’ or ‘G’. These bases can each pair up with one specific other base—‘A’ pairs with ‘U’, and ‘C’ pairs with ‘G’. To direct the repression of an mRNA molecule, a region of the microRNA known as a ‘seed’ binds to a complementary sequence in the target mRNA. ‘Canonical sites’ are regions in the mRNA that contain the exact sequence of partner bases for the bases in the microRNA seed. Some canonical sites are more effective at mRNA control than others. ‘Non-canonical sites’ also exist in which the pairing between the microRNA seed and mRNA does not completely match. Previous work has suggested that many non-canonical sites can also control mRNA degradation and usage. Agarwal et al. first used large experimental datasets from many sources to investigate microRNA activity in more detail. As expected, when mRNAs had canonical sites that matched the microRNA, mRNA levels and usage tended to drop. However, no effect was observed when the mRNAs only had recently identified non-canonical sites. This suggests that microRNAs primarily bind to canonical sites to control protein production. Based on these results, Agarwal et al. further developed a statistical model that predicts the effects of microRNAs binding to canonical sites. The updated model considers 14 different features of the microRNA, microRNA site, or mRNA—including the mRNA sequence around the site—to predict which sites within mRNAs are most effectively targeted by microRNAs. Tests showed that Agarwal et al.'s model was as good as experimental approaches at identifying the effective target sites, and was better than existing computational models. The model has been used to power the latest version of a freely available resource called TargetScan, and so could prove a valuable resource for researchers investigating the many important roles of microRNAs in controlling protein production.

5,365 citations


"pDriver: A novel method for unravel..." refers methods in this paper

  • ...…interaction datasets are used to refine the constructed gene networks, including PPIs (Vinayagam et al., 2011), TransmiR 2.0 (Wang et al., 2010), TargetScan 7.0 (Agarwal et al., 2015), miRTarBase 6.1 (Chou et al., 2016), TarBase 7.0 (Vlachos et al., 2015), and miRWalk 2.0 (Dweep and Gretz, 2015)....

    [...]

Frequently Asked Questions (1)
Q1. What are the contributions in "Pdriver : a novel method for unravelling personalised coding and mirna cancer drivers" ?

The authors propose the novel method, pDriver, to discover personalised cancer drivers. The authors further analyse the predicted personalised drivers for breast cancer patients and the result shows that they are significantly enriched in many GO processes and KEGG pathways involved in breast cancer. To demonstrate the effectiveness of pDriver, the authors have applied it to five TCGA cancer datasets and compared it with the state-of-the-art methods. Furthermore, pDriver can also detect miRNA cancer drivers and most of them have been confirmed to be associated with cancer by literature.