scispace - formally typeset
Open AccessProceedings ArticleDOI

Using probabilistic generative models for ranking risks of Android apps

TLDR
In this paper, the authors introduce the notion of risk scoring and risk ranking for Android apps, to improve risk communication for Android applications, and identify three desiderata for an effective risk scoring scheme.
Abstract
One of Android's main defense mechanisms against malicious apps is a risk communication mechanism which, before a user installs an app, warns the user about the permissions the app requires, trusting that the user will make the right decision. This approach has been shown to be ineffective as it presents the risk information of each app in a "tand-alone" ashion and in a way that requires too much technical knowledge and time to distill useful information.We introduce the notion of risk scoring and risk ranking for Android apps, to improve risk communication for Android apps, and identify three desiderata for an effective risk scoring scheme. We propose to use probabilistic generative models for risk scoring schemes, and identify several such models, ranging from the simple Naive Bayes, to advanced hierarchical mixture models. Experimental results conducted using real-world datasets show that probabilistic general models significantly outperform existing approaches, and that Naive Bayes models give a promising risk scoring approach.

read more

Content maybe subject to copyright    Report

Using Probabilistic Generative Models for Ranking Risks
of Android Apps
Hao Peng
Purdue University
pengh@cs.purdue.edu
Chris Gates
Purdue University
gates2@cs.purdue.edu
Bhaskar Sarma
Purdue University
bsarma@cs.purdue.edu
Ninghui Li
Purdue University
ninghui@cs.purdue.edu
Alan Qi
Purdue University
alanqi@cs.purdue.edu
Rahul Potharaju
Purdue University
rpothara@cs.purdue.edu
Cristina Nita-Rotaru
Purdue University
crisn@cs.purdue.edu
Ian Molloy
IBM Research
molloyim@us.ibm.com
ABSTRACT
One of Android’s main defense mechanisms against malicious apps
is a risk communication mechanism which, before a user installs an
app, warns the user about the permissions the app requires, trusting
that the user will make the right decision. This approach has been
shown to be ineffective as it presents the risk information of each
app in a “stand-alone” fashion and in a way that requires too much
technical knowledge and time to distill useful information.
We introduce the notion of risk scoring and risk ranking for
Android apps, to improve risk communication for Android apps,
and identify three desiderata for an effective risk scoring scheme.
We propose to use probabilistic generative models for risk scor-
ing schemes, and identify several such models, ranging from the
simple Naive Bayes, to advanced hierarchical mixture models. Ex-
perimental results conducted using real-world datasets show that
probabilistic general models significantly outperform existing ap-
proaches, and that Naive Bayes models give a promising risk scor-
ing approach.
Categories and Subject Descriptors
D.4.6 [Security and Protection]: Invasive software
General Terms
Security
Keywords
mobile, malware, data mining, risk
1. INTRODUCTION
As mobile devices become increasingly popular for personal and
business use they are increasingly targeted by malware. Mobile de-
vices are becoming ubiquitous, and they provide access to personal
Permission to make digital or hard copies of all or part of this work for
personal or classroom use is granted without fee provided that copies are
not made or distributed for profit or commercial advantage and that copies
bear this notice and the full citation on the first page. To copy otherwise, to
republish, to post on servers or to redistribute to lists, requires prior specific
permission and/or a fee.
CCS’12, October 16–18, 2012, Raleigh, North Carolina, USA.
Copyright 2012 ACM 978-1-4503-1651-4/12/10 ...$15.00.
and sensitive information such as phone numbers, contact lists, ge-
olocation, and SMS messages, making their security an especially
important challenge. Compared with desktop and laptop comput-
ers, mobile devices have a different paradigm for installing new
applications. For computers, a typical user installs relatively few
applications, most of which are from reputable vendors with niche
applications increasingly being replaced by web-based or cloud ser-
vices. For mobile devices, one often downloads and uses many
applications (or apps) with limited functionality from multiple un-
known vendors. Therefore, the defense against malware must de-
pend to a large degree on decisions made by the users. Indeed
whether an app is malware or not may depend on the user’s pri-
vacy preference. Therefore, an important part of malware defense
on mobile devices is to communicate the risk of installing an app
to users, and to help them make the right decision about whether to
choose and install certain apps.
In this paper we study how to conduct effective risk communi-
cation for mobile devices. We focus on the Android platform. The
Android platform has emerged as one of the fastest growing oper-
ating systems. In June 2012, Google announced that 400 million
Android devices have been activated, with 1 million devices be-
ing activated daily. An increasing number of apps are available for
Android. The Google Play (formerly known as Android Market)
crossed more than 15 billion downloads in May of 2012, and was
adding about 1 billion downloads per month from Dec 2011 to May
2012. Such a wide user base coupled with ease of developing and
sharing applications makes Android an attractive target for mali-
cious application developers that seek personal gain while costing
users’ money and invading users’ privacy. Examples of malware
activities performed by malicious apps include stealing users’ pri-
vate data and sending SMS messages to premium rate numbers.
One of Android’s main defense mechanisms against malicious
apps is a risk communication mechanism which warns the user
about permissions an app requires before being installed, trusting
that the user will make the right decision. Google has made the fol-
lowing comment on malicious apps: “When installing an applica-
tion, users see a screen that explains clearly what information and
system resources the application has permission to access, such as
a phone’s GPS location. Users must explicitly approve this access
in order to continue with the installation, and they may uninstall
applications at any time. They can also view ratings and reviews to
help decide which applications they choose to install. We consis-
tently advise users to only install apps they trust. This approach,

however, has been shown to be ineffective. The majority of An-
droid apps request multiple permissions. When a user sees what ap-
pears to be the same warning message for almost every app, warn-
ings quickly lose any effectiveness as the users are conditioned to
ignore such warnings.
Recently, risk signals based on the set of permissions an app re-
quests have been proposed as a mechanism to improve the existing
warning mechanism for apps. In [11], requesting certain permis-
sion or combinations of two or three permissions triggers a warning
that the app is risky. In [24], requesting a critical permission that is
rarely requested is viewed as a signal that the app is risky.
Rather than using a binary risk signal that marks an app as ei-
ther risky or not risky, we propose to develop risk scoring schemes
for Android apps based on the permissions that they request. We
believe that the main reason for the failure of the current Android
warning approach is that it presents the risk information of each app
in a “stand-alone” fashion and in a way that requires too much tech-
nical knowledge and time to distill useful information. We believe
a more effective approach is to present “comparative” risk infor-
mation, i.e., each app’s risk is presented in a context of comparing
it with other apps. We propose to use a risk scoring function that
assigns to each app a real number score so that apps with higher
risks have a higher score. Given this function, one can derive a
risk ranking for each app, identifying the percentile of the app in
terms of its risk score. This number has a well-defined and easy-to-
understand meaning. Users can appreciate the difference between
an app ranked in the top 1% group versus one in the bottom 50%.
This ranking can be presented in a more user-friendly fashion, e.g.,
translated into categorical values such as high risk, average risk,
and low risk. An important feature of the mobile app ecosystem
is that users often have choices and alternatives when choosing a
mobile app. If the user knows that one app is significantly more
risky than another for the same functionality, then that may cause
the user to choose the less risky one.
To be most effective, we propose the following desiderata for
the risk scoring function. First, it should be monotonic, in the sense
that for any app, removing a permission from its set of requested
permissions should reduce the risk score. This way, a developer
can reduce the risk score of an app by following the least-privilege
principle. Second, apps that are known to be malicious should in
general have high risk scores. Third, it is desired that the risk scor-
ing function is simple and relatively easy to understand.
We propose to use probabilistic generative models for risk scor-
ing. Probabilistic generative models [7] have been used exten-
sively in a variety of applications in machine learning, computer
vision, and computational biology, to model complex data. The
main strength is to model features in a large amount of unlabeled
data. Using these models, we assume that some parameterized ran-
dom process generates the app data and learn the model parameter
based on the data. Then we can compute the probability of each
app generated by the model. The risk score can be any function
that is inversely related to the probability, so that lower probability
translates into a higher score.
More specifically, we consider the following models in this pa-
per. In the Basic Naive Bayes (BNB) model, we use only the per-
mission information of the apps, and assume that each app is gen-
erated by M independent Bernoulli random variables, where M is
the number of permissions. Let θ
m
be the probability that the m’th
permission is requested (which can be estimated by computing the
fraction of apps requesting that permission), then the probability
that an app requests a permission is computed by multiplying θ
i
s
if it requests the ith permission and (1 θ
i
) if it does not re-
quest the i’th permission. If θ
m
< 0.5 for every m, the model has
the monotonicity property. The BNB model treats all permissions
equally; however, some permissions are more critical than others.
To model this semantic knowledge about permissions, we also con-
sider Naive Bayes with informative Priors, which we use PNB to
denote. The effect of PNB model is to reduce θ
i
when the i’th
permission is considered critical. While PNB is slightly more com-
plex than BNB, it has the advantage that requesting a more critical
permission results in higher risk than requesting a similarly rare but
less critical permission, making it more difficult for a malicious app
to reduce its risk by removing unnecessary permissions.
We also investigate several sophisticated generative models. In
the Mixture of Naive Bayes (MNB) model, we assume that the
dataset is generated by a number of hidden classes, each is param-
eterized by M independent Bernoulli random variables; these hid-
den classes are shared among all categories. Each category has a
different multinomial distribution describing how likely an app in
this category is from a given hidden class. We also develop a Hier-
archical Bayesian model, which we call the Hierarchical Mixture of
Naive Bayes (HMNB) model. This is a novel extension to the influ-
ential Latent Dirichlet Allocation (LDA) [8] model to binary obser-
vations that integrates categorical information with hidden classes
and allows permission information to be shared between categories.
We have conducted extensive experiments using three datasets:
Market2011, Market2012, and Malware. Market2011 consists of
157,856 apps available at Android Market in February 2011. Mar-
ket2012 consists of 324,658 apps available at Google Play in Febru-
ary/March 2012. Malware consists of 378 known malwares. Our
experiments show that in terms of assigning high risk scores to mal-
ware apps, all generative models significantly outperform existing
approaches [11, 24]. Furthermore, while PNB is simpler than MNB
and HMNB, its performance is almost the same as MNB, and very
close to the best-performing HMNB model. Based on these results,
we conclude that PNB is good risking scoring scheme.
In summary, the contributions of this paper are as follows:
We introduce the notion of risk scoring and risk ranking for
Android apps, to improve risk communication for Android
apps, and identify three desiderata for an effective risk scor-
ing scheme.
We propose to use probabilistic generative models for risk
scoring schemes, and identify several such models, ranging
from the simple Basic Naive Bayes (BNB), to advanced hi-
erarchical mixture models.
We conduct extensive evaluations using real-world datasets.
Our experimental results show that probabilistic general
models significantly outperform existing approaches, and
PNB makes a promising risk scoring approach.
The rest of the paper is organized as follows. We present a de-
scription of the Android platform and the current warning mech-
anism in Section 2. Section 3 discusses the datasets that we have
collected. In Section 4 we discuss different generative models for
risk scoring. We then present experimental results in Section 5, and
discuss other findings in Section 6. We finish by discussing related
work in Section 7 and concluding in Section 8.
2. ANDROID PLATFORM
In this section we provide an overview of the current defense
mechanism provided by the Android platform and discuss its limi-
tations.

2.1 Platform Ecosystem
Android is an open source software stack for mobile devices that
includes an operating system, an application framework, and core
applications. The operating system relies on a kernel derived from
Linux. The application framework uses the Dalvik Virtual Ma-
chine. Applications are written in Java using the Android SDK,
compiled into Dalvik Executable files, and packaged into .apk
(Android package) archives for installation.
The app store hosted by Google is called Google Play (previ-
ously called Android Market). In order to submit applications to
Google Play, an Android developer first needs to obtain a publisher
account. After submission, each .apk file gets an entry on the mar-
ket in the form of a webpage, accessible to users through either the
Google Play homepage or the search interface. This webpage con-
tains meta-information that keeps track of information pertaining
to the application (e.g., name, category, version, size, prices) and
its usage statistics (e.g., rating, number of installs, user reviews).
This information is used by users when they are deciding to install
a new application.
Google recently started the Bouncer [3] service, which provides
automated scanning of applications on Google Play for potential
malware. Once an application is uploaded, the service immedi-
ately [3] starts analyzing it for known malware, spyware and tro-
jans. It also looks for behaviors that indicate an application might
be misbehaving, and compares it against previously analyzed apps
to detect possible red flags. Bouncer runs every application on their
cloud in an attempt to detect hidden, malicious behavior, and ana-
lyzes developer accounts to block malicious developers.
Bouncer does not fully solve the security and privacy prob-
lems of Android. First, the line between malicious apps and non-
malicious apps is very blurred. The behavior of many apps cannot
be classified as malicious, yet many users will find them risky and
intrusive. Bouncer has to be conservative when identifying apps
as malicious to prevent legitimate complaints from developers and
backlash from users for instrumenting a walled garden. Second,
details about Bouncer are fairly unknown to the security commu-
nity. At the time of writing this paper, except for the official blog
post by Google [3], there are no details about how Bouncer works
nor what algorithms it uses to detect malicious apps. Third, re-
searchers have found multiple ways to bypass Bouncer and upload
malware on Google Play. For example, a malicious app can try
to detect that it is running on Bouncer’s emulated Android device,
and refrain from performing any malicious activity, or malware can
perform malicious activities only when triggered by certain condi-
tions, such as time.
Other third party app websites exist, e.g., Amazon Appstore for
Android, GetJar, SlideMe Market, etc. Currently, these third-party
app stores have varying degrees of security associated with them.
2.2 In-Place Security and its Limitations
The Android system’s in-place defense against malware consists
of two parts: sandboxing each application and warning the user
about the permissions that the application is requesting. Specifi-
cally, each application runs with a separate user ID, as a separate
process in a virtual machine of its own, and by default does not
have permissions to carry out actions or access resources which
might have an adverse effect on the system or on other apps, and
have to explicitly request these privileges through permissions.
In tandem with the sandboxing approach is a risk communica-
tion mechanism that communicates the risks of installing an app to
a user, hoping/trusting that the user will make the right decision.
When a user downloads an app through the Google Play website,
the user is shown a screen that displays the permissions requested
by the application and the warnings about the potential damages
when these permissions are misused. These warnings are worded
with a high degree of seriousness (See Table 1 for Android’s warn-
ings of some permissions). This provides a final chance to verify
that the user is allowing the application access to the requested re-
sources. Installing the application means granting the application
all the requested permissions. A similar interface exists when a
user is browsing applications from a mobile device.
Despite its serious-wording, Android’s current permission warn-
ing approach has been largely ineffective. In [15], Felt et al. ana-
lyzed 100 paid and 856 free Android applications, and found that
Nearly all applications (93% of free and 82% of paid) ask for at
least one ‘Dangerous’ permission, which indicates that users are
accustomed to installing applications with Dangerous permissions.
The INTERNET permission is so widely requested that users cannot
consider its warning anomalous. Security guidelines or anti-virus
programs that warn against installing applications with access to
both the Internet and personal information are likely to fail be-
cause almost all applications with personal information also have
INTERNET.
Felt et al. argued Warning science literature indicates that fre-
quent warnings de-sensitize users, especially if most warnings do
not lead to negative consequences [29, 17]. Users are therefore
not likely to pay attention to or gain information from install-time
permission prompts in these systems. Changes to these permission
systems are necessary to reduce the number of permission warnings
shown to users.
While such ineffectiveness has been identified and criti-
cized [15, 29, 17], no alternative has been proposed. We argue
that a promising alternative is to present relative or comparative
risk information. This way, users can select apps based on easy-to-
consume risk information. Hopefully this will provides incentives
to developers to better follow the least-privilege principle and
request only necessary permissions.
Comparison with UAC: There is a parallel between Android’s
permission warning and Windows’ User Account Control (UAC).
Both are designed to inform the user of some potentially harmful
action that is about to occur. In UAC’s case, this happens when a
process is trying to elevate it’s privileges in some way, and in An-
droid’s case, this happens when a user is about to install an app that
will have all the requested permissions.
Recent research [19] suggests the ineffectiveness of UAC in en-
forcing security. Motiee et al. [19] reported that 69% of the sur-
vey participants ignored the UAC dialog and proceeded directly to
use the administrator account. Microsoft itself concedes that about
90% of the prompts are answered as “yes”, suggesting that “users
are responding out of habit due to the large number of prompts
rather than focusing on the critical prompts and making confident
decisions” [12].
According to [12] in the first several months after Vista was
available for use, people were experiencing a UAC prompt in 50%
of their “sessions” - a session is everything that happens from lo-
gon to logoff or within 24 hours. With Vista SP1 and over time,
this number has been reduced to about 30% of the sessions. This
suggests that UAC has been effective in incentivizing application
developers to write programs without elevated privileges unless
necessary. An effective risk communication approach for Android
could have similar effects.

3. DATASETS
In this section, we describe the two types of datasets we used
in our study of Android app permissions. Below we describe the
datasets and their characteristics.
3.1 Datasets Description
Market Datasets: We have collected two datasets from Google
Play spaced one year apart. Market2011, the first dataset, consists
of 157,856 apps available on Google Play in February 2011. Mar-
ket2012, the second dataset, consists of 324,658 apps and has been
collected in February 2012. For each app, we have the applica-
tion meta-information consisting of the developer name, its cate-
gory and the set of permissions that the app requests. We assume
that apps in these two datasets are mostly benign. While we believe
that a small number of malicious apps may be present in them, we
assume that these datasets are dominated by benign ones. We lever-
age the Market2011 dataset for our model generation and testing,
use Market2012 dataset for validation and market evolution analy-
sis.
Malware Dataset: Our malware dataset consists of 378 unique
.apk files that are known to be malicious. We obtained this dataset
from the authors of [31]. For each malware sample, we extract the
permissions requested using the AndroidManifest.xml file present
inside the package file. For these malicious apps we do not have
their category information.
3.2 Data Cleansing
In the two market datasets, we have observed the presence of
thousands of apps that have similar characteristics. This kind of
“duplication” can occur due to the following reasons:
Slight Variations (R1): One developer may release hun-
dreds or even thousands of nearly identical apps that provide
the same functionality with slight variation. A few examples
include wallpaper apps, city or country specific travel apps,
weather apps, or themed apps (i.e., a new app with essentially
the same functionalities can be written for any celebrity, in-
terest group,etc.) such as the one presented in Table 1 in
Section 6.
App Maker Tools (R2): There are a number of tools [1, 2]
that enable non-programmers to create Android apps. Often
times many apps that are generated by these tools have sim-
ilar app names and the same set of permissions. This occurs
when the developer just uses the default settings in the tool.
We decided to consolidate duplicate apps from the same devel-
oper (R1) into a single instance in the dataset to prevent any single
developer from having a large impact on the generated probabilistic
model. We detect apps due to R1 by looking for instances where
apps belonging to the same developer have the same set of per-
missions. This is a likely indication that developers are uploading
many applications with minor variations in the app content.
We decided to keep apps due to R2 unchanged in the datasets.
We do this because: (1) we observed instances where apps due to
R2 have different functionality and many developers using these
tools do modify the permissions given to their app and (2) the
line between such apps and all apps that use a specific ad-network
which require a certain set of permissions is blurry.
After cleansing is complete we have 71,331 apps in the 2011
market dataset, and 136,534 apps in the 2012 market dataset. This
represents a reduction of around 55%, and demonstrates the preva-
lence of apps that are slight variations of other apps, justifying our
INTERNET
ACCESS_NETWORK_STATE
WRITE_EXTERNAL_STORAGE
READ_PHONE_STATE
ACCESS_FINE_LOCATION
ACCESS_COARSE_LOCATION
VIBRATE
WAKE_LOCK
READ_CONTACTS
ACCESS_WIFI_STATE
CALL_PHONE
CAMERA
RECEIVE_BOOT_COMPLETED
SEND_SMS
WRITE_SETTINGS
RECEIVE_SMS
WRITE_CONTACTS
GET_TASKS
RECORD_AUDIO
READ_SMS
ACCESS_LOCATION_EXTRA
WRITE_SMS
INSTALL_PACKAGES
CHANGE_WIFI_STATE
READ_HISTORY_BOOKMARKS
WRITE_HISTORY_BOOKMARKS
0 20 40 60 80 100
Percent of Apps Requesting Permission
Market2011
Market2012
Malware
(a) The top 20 most used permissions in the datasets as a per-
cent of apps that request those permissions. Due to overlap in
the most used permissions, we need to show 26 permissions
to cover the most used in all datasets. 21st for Market 2012,
and last 5 for Malware.
0
5
10
15
20
25
0 1 2 3 4 5 6 7 8 9 10 1112 13 1415 16 1718 19 2021 22 2324 25 26 2728 29
Percent of Apps Requesting X permissions
Number of Permissions
Permission Distribution
Market2011
Market2012
Malware
(b) The percent of apps that request a specific number of per-
missions for each dataset.
0
5
10
15
20
25
0 1 2 3 4 5 6 7 8 9 10111213141516171819202122232425262728293031
Percent of Apps Requesting X permissions
Number of Permissions
Permission Distribution
2011-NoOverlap
2012-NoOverlap
Overlap
(c) The percent of apps that request a specific number of per-
missions in the market datasets. Apps that only appear in
2011, only in 2012, and the intersection of those two datasets
Figure 1: Permission information for various data sets

decision to combine these so as not to allow one developer to overly
influence any model.
For some experiments, we break up market dataset into three
sets. The intersection of the 2011 and 2012 data is called ‘over-
lap’, this contains 38,024 apps which have the same name and per-
missions in the two datasets. Then we have 2011-NoOverlap, the
2011 dataset with this overlap removed, containing 33,307 apps,
and 2012-NoOverlap, the 2012 dataset with this overlap removed,
containing 98,510 apps.
3.3 Dataset Discussion
The top 20 most frequently requested permissions in each
dataset are presented in Figure 1(a). There are 26 permissions
in this table, which represent the top 20 for all 3 datasets. AC-
CESS_LOCATION_EXTRA_COMMANDS was added for Mar-
ket2012, and the last 5 were added for the malware dataset. For
some permissions, the percentage of malware apps requesting a
specific permission is much higher than those in the market dataset.
For example, READ_SMS is requested by 59.78% of the malicious
apps, but only 2.33% from Market2011, and 1.98% from Mar-
ket2012. This might be due to the fact that a class of malware
apps attempt to intercept messages between a mobile phone and a
bank for out-of-band authentication.
Another observation from Figure 1(a) is that for almost every
permission a higher percent of apps in Market2012 request it when
compared to the Market2011 dataset. This shows a trend that pro-
portionally more applications are requesting sensitive permissions.
The one notable exception to this is related to SMS, where Mar-
ket2012 actually saw a slight decrease for all permissions related
to SMS.
Figure 1(b) shows the percent of apps that request different num-
bers of permissions. From this graph, we observe in general, ma-
licious apps are requesting more permissions than the ones in the
market datasets. However, there are many market dataset apps that
are requesting many permissions as well. Between Market2011 and
Market2012, we also see a confirmation that apps are requesting
a greater number of permissions on average. With proportionally
fewer apps requesting 0 or 1 permissions in Market2012, and then
for two permissions and greater, we see slight gains in the percent
of apps requesting permissions over Market2011. Overall, this in-
formation is an indication that the malicious apps are requesting
permissions in different ways then normal apps, and leads us to be-
lieve that looking at permission information is in fact promising. It
also shows that there may be a slow evolution in the market dataset.
Figure 1(c) shows a similar graph when we divide the datasets
into the overlap dataset and the two datasets with overlapping apps
removed. Interestingly, apps in the overlap dataset, which are the
“long-living” and stable apps generally request fewer permissions
than other apps.
4. MODELS
We aim at coming up with a risk score for apps based on their
requested permission sets and categories. Let the i’th app in the
dataset be represented by a
i
= (c
i
, x
i
= [x
i,1
, . . . , x
i,M
]), where
c
i
C is the category of the i’th app, M is the number of per-
missions, and x
i,m
{0, 1} indicates whether the ith app has
the m’th permission. Our goal is to come up with a risk function
rscore : C × {0, 1}
M
R such that it satisfies the following three
desiderata. First, the risk function should be monotonic. This con-
dition requires that removing a permission always reduces the risk
value of an app, formalized by the following definition.
DEFINITION 1 (MONOTONICITY). We say that a risk scoring
function rscore is monotonic if and only if for any c
i
C and any
x
i
, x
j
such that
k (x
i,k
= 0 x
j,k
= 1 m(m 6= k x
i,m
= x
j,m
))
rscore(c
i
, x
i
) < rscore(c
i
, x
j
).
The second desideratum is that malicious apps generally have
high risk scores. And the third is that the risk scoring function is
simple to understand.
Given any risk function, we can assign a risk ranking for each
app relative to a set A of reference apps, which can be, e.g., the set
of all apps available in Google Play:
rrank(a
i
) =
|{a A | rscore(a) rscore(a
i
)}|
|A|
If an app has a risk ranking of 1%, this means that the app’s risk
score is among the highest 1 percent.
The above gives a risk ranking relative to all apps in all cate-
gories. An alternative is to rank apps in each category separately,
so that one has a risk ranking for an app relative to other apps in the
same category.
Probabilistic generative models. We propose to use probabilistic
generative models for risk scoring. That is, we assume that some
parameterized random process generates the app datasets and learn
the parameter value θ that best explain the data. Next, for each app
we compute p(a
i
|θ), the probability that the app’s data is generated
by the model.
The risk score of an app can be any function that is monoton-
ically decreasing with respect to the probability of an app being
generated, such that a lower probability means a higher risk score.
For example, using rscore(a
i
) = ln p(a
i
|θ) satisfies the condi-
tion.
In the rest of this section we describe three generative models—
from simple Naive Bayesian models, to mixture of Naive Bayes
models and to novel hierarchical Bayesian models. We present es-
timation methods to learn the parameters for these models from the
data, and evaluate whether they satisfy our desiderata.
4.1 Naive Bayes Models
In the Naive Bayes models, we ignore the category information
c
i
; thus each app is given by x
i
= [x
i,1
, . . . , x
i,M
]. We assume
that each x
i
is generated by M independent Bernoulli random vari-
ables, where M is the number of permissions:
p(x
i
) =
M
Y
m=1
p(x
i,m
) =
M
Y
m=1
θ
x
i,m
m
(1 θ
m
)
(1x
i,m
)
(1)
where θ
m
p(x
i,m
= 1) is the Bernoulli parameter.
To avoid overfitting in our estimation (i.e., fitting the model to
noise), we use a Beta prior Beta(θ
m
|a
0
, b
0
) over each Bernoulli
parameter θ
m
. Using this prior, the Maximum a posteriori (MAP)
estimation is
ˆ
θ
m
=
P
N
i
x
i,m
+ a
0
N + a
0
+ b
0
(2)
where N is the total number of apps for this Naive Bayes model
estimation.
The Basic Naive Bayes Model (BNB). In the Basic Naive Bayes
(BNB) mode, we use uninformative prior and set a
0
= b
0
= 1, so
that the Beta prior becomes a uniform distribution on [0,1]. With

Citations
More filters
Proceedings ArticleDOI

DREBIN: Effective and Explainable Detection of Android Malware in Your Pocket.

TL;DR: DREBIN is proposed, a lightweight method for detection of Android malware that enables identifying malicious applications directly on the smartphone and outperforms several related approaches and detects 94% of the malware with few false alarms.
Book ChapterDOI

DroidAPIMiner: Mining API-Level Features for Robust Malware Detection in Android

TL;DR: In this article, a robust and lightweight classifier is proposed to mitigate Android malware installation through providing relevant features to malware behavior captured at API level, and evaluated different classifiers using the generated feature set.
Proceedings ArticleDOI

Semantics-Aware Android Malware Classification Using Weighted Contextual API Dependency Graphs

TL;DR: A novel semantic-based approach that classifies Android malware via dependency graphs that is capable of detecting zero-day malware with a low false negative rate and an acceptable false positive rate while tolerating minor implementation differences is proposed.
Proceedings ArticleDOI

Apposcopy: semantics-based detection of Android malware through static analysis

TL;DR: The signature matching algorithm of Apposcopy uses a combination of static taint analysis and a new form of program representation called Inter-Component Call Graph to efficiently detect Android applications that have certain control- and data-flow properties.
Journal ArticleDOI

A Survey of App Store Analysis for Software Engineering

TL;DR: This survey describes and compares the areas of research that have been explored thus far, drawing out common aspects, trends and directions future research should take to address open problems and challenges.
References
More filters
Journal ArticleDOI

Latent dirichlet allocation

TL;DR: This work proposes a generative model for text and other collections of discrete data that generalizes or improves on several previous models including naive Bayes/unigram, mixture of unigrams, and Hofmann's aspect model.
Book

Pattern Recognition and Machine Learning (Information Science and Statistics)

TL;DR: Looking for competent reading resources?
Proceedings ArticleDOI

TaintDroid: an information-flow tracking system for realtime privacy monitoring on smartphones

TL;DR: Using TaintDroid to monitor the behavior of 30 popular third-party Android applications, this work found 68 instances of misappropriation of users' location and device identification information across 20 applications.
Proceedings ArticleDOI

Dissecting Android Malware: Characterization and Evolution

TL;DR: Systematize or characterize existing Android malware from various aspects, including their installation methods, activation mechanisms as well as the nature of carried malicious payloads reveal that they are evolving rapidly to circumvent the detection from existing mobile anti-virus software.
Proceedings ArticleDOI

Android permissions demystified

TL;DR: Stowaway, a tool that detects overprivilege in compiled Android applications, is built and finds that about one-third of applications are overprivileged.
Related Papers (5)
Frequently Asked Questions (1)
Q1. What are the contributions mentioned in the paper "Using probabilistic generative models for ranking risks of android apps" ?

The authors introduce the notion of risk scoring and risk ranking for Android apps, to improve risk communication for Android apps, and identify three desiderata for an effective risk scoring scheme. The authors propose to use probabilistic generative models for risk scoring schemes, and identify several such models, ranging from the simple Naive Bayes, to advanced hierarchical mixture models. Experimental results conducted using real-world datasets show that probabilistic general models significantly outperform existing approaches, and that Naive Bayes models give a promising risk scoring approach.