scispace - formally typeset
Open AccessProceedings ArticleDOI

Targeted Online Password Guessing: An Underestimated Threat

Reads0
Chats0
TLDR
TarGuess, a framework that systematically characterizes typical targeted guessing scenarios with seven sound mathematical models, each of which is based on varied kinds of data available to an attacker, is proposed to design novel and efficient guessing algorithms.
Abstract
While trawling online/offline password guessing has been intensively studied, only a few studies have examined targeted online guessing, where an attacker guesses a specific victim's password for a service, by exploiting the victim's personal information such as one sister password leaked from her another account and some personally identifiable information (PII). A key challenge for targeted online guessing is to choose the most effective password candidates, while the number of guess attempts allowed by a server's lockout or throttling mechanisms is typically very small. We propose TarGuess, a framework that systematically characterizes typical targeted guessing scenarios with seven sound mathematical models, each of which is based on varied kinds of data available to an attacker. These models allow us to design novel and efficient guessing algorithms. Extensive experiments on 10 large real-world password datasets show the effectiveness of TarGuess. Particularly, TarGuess I~IV capture the four most representative scenarios and within 100 guesses: (1) TarGuess-I outperforms its foremost counterpart by 142% against security-savvy users and by 46% against normal users; (2) TarGuess-II outperforms its foremost counterpart by 169% on security-savvy users and by 72% against normal users; and (3) Both TarGuess-III and IV gain success rates over 73% against normal users and over 32% against security-savvy users. TarGuess-III and IV, for the first time, address the issue of cross-site online guessing when given the victim's one sister password and some PII.

read more

Content maybe subject to copyright    Report

Targeted Online Password Guessing:
An Underestimated Threat
Ding Wang
, Zijian Zhang
, Ping Wang
, Jeff Yan
, Xinyi Huang
School of EECS, Peking University, Beijing 100871, China
*
School of Computing and Communications, Lancaster University, United Kingdom
School of Mathematics and Computer Science, Fujian Normal University, Fuzhou 350007, China
{wangdingg, zhangzj, pwang}@pku.edu.cn; jeff.yan@lancaster.ac.uk; xyhuang81@gmail.com
ABSTRACT
While trawling online/offline password guessing has been inten-
sively studied, only a few studies have examined targeted online
guessing, where an attacker guesses a specific victim’s password
for a service, by exploiting the victim’s personal information such
as one sister password leaked from her another account and some
personally identifiable information (PII). A key challenge for tar-
geted online guessing is to choose the most effective password can-
didates, while the number of guess attempts allowed by a server’s
lockout or throttling mechanisms is typically very small.
We propose TarGuess, a framework that systematically charac-
terizes typical targeted guessing scenarios with seven sound math-
ematical models, each of which is based on varied kinds of data
available to an attacker. These models allow us to design novel and
efficient guessing algorithms. Extensive experiments on 10 large
real-world password datasets show the effectiveness of TarGuess.
Particularly, TarGuess I IV capture the four most representative
scenarios and within 100 guesses: (1) TarGuess-I outperforms its
foremost counterpart by 142% against security-savvy users and by
46% against normal users; (2) TarGuess-II outperforms its fore-
most counterpart by 169% on security-savvy users and by 72%
against normal users; and (3) Both TarGuess-III and IV gain suc-
cess rates over 73% against normal users and over 32% against
security-savvy users. TarGuess-III and IV, for the first time, address
the issue of cross-site online guessing when given the victim’s one
sister password and some PII.
Keywords
Password authentication; Targeted online guessing; Personal infor-
mation; Password reuse; Probabilistic model.
1. INTRODUCTION
Passwords firmly remain the most prevalent mechanism for user
authentication in various computer systems. To understand pass-
word security, a number of probabilistic guessing models, e.g.,
Markov n-grams [21, 25] and probabilistic context-free grammars
(PCFG) [31, 35], have been successively proposed. A common
feature of these guessing models is that they characterize a trawl-
ing offline guessing attacker who mainly works against the leaked
Permission to make digital or hard copies of all or part of this work for personal or
classroom use is granted without fee provided that copies are not made or distributed
for profit or commercial advantage and that copies bear this notice and the full citation
on the first page. Copyrights for components of this work owned by others than
ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or
republish, to post on servers or to redistribute to lists, requires prior specific permission
and/or a fee. Request permissions from permissions@acm.org.
CCS’16, October 24-28, 2016, Vienna, Austria
© 2016 ACM. ISBN 978-1-4503-4139-4/16/10. . . $15.00
DOI: http://dx.doi.org/10.1145/2976749.2978339
password files and aims to crack as many accounts as possible.
As highlighted in [16], offline guessing attacks, no matter trawling
ones or targeted ones, only pose a real concern in the very limited
circumstance: the server’s password file is leaked, the leakage
goes undetected, and the passwords are also properly hashed and
salted. Recent research [7, 16] has realized that it should be the
role of websites to protect user passwords from offline guessing by
securely storing password files, while normal users only need to
choose passwords that can survive online guessing.
Online guessing can be launched against the publicly facing
server by anyone using a browser at anytime, with the primary
constraint being the number of guesses allowed. Trawling online
guessing mainly exploits users’ behavior of choosing popular pass-
words [22, 34], and it can be well addressed by various security
mechanisms at the server (e.g., suspicious login detection [14],
rate-limiting and lockout [18]). However, targeted online guessing
(see Fig. 1) can exploit not only weak popular passwords, but also
passwords reused across sites and passwords containing personal
information. This is a serious security concern, since various
Personally Identifiable Information (PII) and leaked passwords be-
come readily available due to unending data breaches [2, 3, 17].
For instance, the most recent large-scale PII data breach in April
2016 [3] involves 50 million Turkish citizens, accounting for 64%
of the population. According to the CNNIC 2015 report [1], over
78.2% of the 668 million Chinese netizens have suffered PII data
leakage. In a series of recent breaches, over 253 million American
netizens become victims of PII and password leakage [27].
This indicates that the existing password creation rules (e.g., [15,
28]) and strength meters (e.g., [24,32]) grounded on these trawling
guessing models [21, 25, 31, 35] can mainly accommodate to the
limited offline guessing threat, taking no account of the targeted
online guessing threat which is increasingly more damaging and
realistic. This misplaced research focus largely attributes to the
failure (see [7, 33]) of the academic world to identify the crux
of current practices and to suggest convincingly better password
solutions than current practices to lead the industrial world.
The main challenge for targeted online password guessing is to
effectively characterize an attacker As guessing model, with multi-
ple dimensions of available information (see Fig. 2) well captured,
while the number of guesses allowed to A is small the NIST
Authentication Guideline [18] requires Level 1 and 2 systems to
keep login failures less than 100 per user account in any 30-day
period. The following explains why it is a challenge.
First, people’s password choices vary much among each other.
When creating a password, some people reuse an existing pass-
word, and some modify an existing password; Some incorporate
PII into their passwords, yet others do not; Some favor digits,
some favor letters, and so on. Thus, a user population’s passwords
created for a given web service can differ greatly. Therefore, the

Figure 1: Targeted online guessing.
Figure 2: Multiple info for A.
trawling guessing models [21, 25, 31, 35], which aim to produce
a single guess list for all users, are not suitable for characterizing
targeted online guessing.
Second, users’ PII is highly heterogeneous. Some kinds of
PII (e.g., name, and hobby) are composed of letters, some (e.g.,
birthday and phone number) are composed of digits, and some
(e.g., user name) are a mixture of letters, digits and symbols. Some
PII (e.g., name, birthday and hobby), as shown in Fig. 2, can be
directly used as password components, while others (e.g., gender
and education) cannot. As we will show, most of them have an
impact on people’s password choice. Thus, it is challenging to, at a
large-scale, automatically incorporate such heterogeneous PII into
guessing models when the guess attempts allowed is limited.
Third, users employ a diversified set of transformation rules to
modify passwords for cross-site reuse. As shown in [12, 32], when
given a password, there are over a dozen transformation rules,
such as insert, delete, capitalization and leet (e.g., password
passw0rd) and the synthesized ones (e.g., password
Passw0rd1), that a user can utilize to create a new password.
How to prioritize these rules for each individual user is not easy.
Moreover, which transformation rules users will apply for pass-
word reuse are often context dependent. Suppose attacker A targets
Alice’s eBay account which requires passwords of length 8
+
, and
knows that Alice is in her 30s. With access to a sister pass-
word Alice1978Yahoo leaked from Alice’s Yahoo account, A
will have a higher chance by guessing Alice1978eBay than by
Alice1978 due to the inertia of human behaviors. Yet, when
Alice’s leaked password is 123456, A would more likely succeed
by guessing Alice1978 than by Alice1978eBay. When site
password policies are also considered, the situation may further
vary. Such context dependence necessitate an adaptive, semantics-
aware cross-site guessing model.
1.1 Related work
Zhang et al. [37] suggested an algorithm for predicting a user’s
future password with previous ones for the same account. Das et
al. [12] studied the password reuse issue, and proposed a cross-site
cracking algorithm. However, their algorithm is not optimal for
targeted online guessing for four reasons. First, it does not consider
common popular passwords (e.g., iloveyou, and pa$$w0rd)
which do not involve reuse behaviors or user PII. Second, it as-
sumes that all users employ the transformation rules in a fixed
priority. Yet, as we observe, this priority is actually dynamic
and context-dependent. Third, their algorithm does not consider
various synthesized rules. Fourth, it is heuristics based.
Li et al. [20] examined how user’s PII may impact password
security, and found that 60.1% of users incorporate at least one
kind of PII into their passwords. They proposed a semantics-rich
algorithm, Personal-PCFG, which considers six types of personal
information: name, birthdate, phone number, National ID, email
address and user name. However, as we will show, its length-
based PII matching and substitution approach makes it inaccurate
to capture user PII usages, greatly hindering the cracking efficiency.
Our TarGuess-I manages to overcome this issue by using a type-
based PII matching approach and gains drastic improvements.
1.2 Our contributions
In this work, we make the following key contributions:
A practical framework. To overcome the challenges
discussed above, we propose TarGuess, a framework to
characterize typical targeted online guessing attacks, with
sound probabilistic models (rather than ad hoc models or
heuristics). TarGuess captures seven typical scenarios, with
each based on a different combination of various
information available to the attacker.
Four probabilistic algorithms. To model the most repre-
sentative targeted guessing scenarios, we propose four al-
gorithms by leveraging probabilistic techniques including
PCFG, Markov and Bayesian theory. Our algorithms all
significantly outperform prior art. We further show how
they can be readily employed to deal with the other three
remaining attacking scenarios.
An extensive evaluation. We perform a series of experi-
ments to demonstrate that both the efficacy and general ap-
plicability of our algorithms. Our empirical results show that
an overwhelming fraction of users’ passwords are vulnerable
to our targeted online guessing. This suggests that the danger
of this threat has been significantly underestimated.
New insights. For example, Type-based PII-tags are more
effective than length-based PII-tags in targeted guessing.
Simply incorporating many kinds of PII into algorithms will
not increase success rates, which is counter intuitive. The
success rate of a guess decreases with a Zipfs law as the
rank of this guess in the guess list increases.
2. PRELIMINARIES
We now explicate what kinds of user personal information are
considered in this work and elaborate on the security model.
2.1 Explication of personal information
The most prominent feature that differentiates a targeted guess-
ing attack from a trawling one is that, the former involves user-
specific data, or so-called “personal info”. This term is sometimes
used inter-changeably with the term “personally identifiable info”
[10, 20], while sometimes their definitions vary greatly in different
situations, laws, regulations [23, 29]. Generally, a user’s personal
info is “any info relating to” this user [29], and it is broader than
PII. For better comprehension, in Table 1 we provide the first
classification of personal info in the case of password cracking,
making a systematical investigation of targeted guessing possible.
We divide user personal information into three kinds, with each
kind having a varied degree of secrecy, different roles in passwords
and various types of specific elements. The first kind is user PII
(e.g., name and gender), which is natively semipublic: public to
friends, colleagues, acquaintances, etc., yet private to strangers.
The second kind is user identification credentials, and parts of them
(e.g., user name) are public, while parts of them (e.g., password)
are exclusively private. The remaining user personal data falls into
the third kind and is irrelevant to this work. We further divide user
PII into two types: Type-1 and Type-2. Type-1 PII (e.g., name and
birthday) can be the building blocks of passwords, while Type-2
PII (e.g., gender and education [22]) may impact user behavior of
password creation yet cannot be directly used in passwords. Each
type of PII shapes our guessing algorithms quite distinctly.
Here we highlight a special kind of user personal information
a user’s passwords at various web services. As shown in [12, 32],
users tend to reuse or modify their existing passwords at other sites
(called sister passwords) for new accounts. However, such sister
passwords are becoming more and more easily available due to the
unending catastrophic password file leakages (see [2, 4, 27]).

Table 1: Explication of user personal info (NID stands for National identification number, e.g., SSN; PW for password)
Different kinds of personal info Degree of secrecy Roles in PWs Considered in this work(X) Not Considered in this work(×)
Personally identifiable Type-1 Semipublic Explicit Name, Birthday, Phone number, NID Place of birth, Likes, Hobbies, etc.
information (PII) Type-2 Semipublic Implicit Gender, Age, Language Faith, Disposition, Education, etc.
User identification credentials
Private Explicit Passwords, Personal Identification Numbers Finger prints, Private keys, etc.
Public Explicit User name, Email address Debit card number, Health IDs, etc.
Other kinds of personal data Employment, Financial records, etc.
Table 2: A summary of the four most representative scenarios of targeted online guessing
Attacking scenario
Exploiting public information Exploiting user personal information
Existing literature Our model
(e.g., datasets and policies) One sister password Type-1 PII Type-2 PII
Trawling #1 X Ref. [21, 25, 35]
Targeted #1 X X Ref. [20] TarGuess-I
Targeted #2
X Ref. [12]
X X None TarGuess-II
Targeted #3 X X X None TarGuess-III
Targeted #4 X X X X None TarGuess-IV
As public password datasets are readily available, TarGuess-II and [12] is comparable because they exploit the same type of user PII.
A total of 7(=C
1
3
+C
2
3
+C
3
3
) scenarios result from combining the three types of personal info. With TarGuess-IIV, all 7 cases will be tackled in Sec. 4.
2.2 Security model
Without loss of generality, in this work we mainly focus on
the client-server architecture, the most common case of user au-
thentication, as shown in the right of Fig. 1. There are three
entities involved in a targeted online guessing attack: a user U,
an authentication server S and an attacker A.
User U has registered a password account at the server S. This
password is only known to S, though Us passwords at other sites
may have already been publicly disclosed. S may be remote (e.g.,
an e-commerce site) or local (e.g., a password-protected mobile
device). To be realistic, we assume that S enforces some security
mechanisms such as suspicious login detection and lockout [14,18],
and thus the number of guesses allowed to A is limited (e.g., 10
2
[8, 18]). A knows some amount of personal info about U, and may
be a curious friend, a jealous wife, a blackmailer, or even an evil
hacker group that buys personal info from the underground market.
As there is a messy mixture of multiple dimensions of info (see
Fig. 2) potentially available to the attacker A, it is challenging to
characterize A. We tackle this issue by assuming that all the public
info (e.g., leaked PW lists and site policies) should be available to
A, and then by defining a series of attacking scenarios (see Table
2) based on varied types of Us personal info given to A. This is
reasonable: (1) A is smart and likely to exploit the readily available
public info to increase her chance; and (2) A would use different
attacking strategies when given different personal info. Once A
has successfully guessed the password, the victim’s sensitive info
can be disclosed, reputation could be ruined (see [36]), password
account may be hijacked and money might be lost (see [26]).
Note that, here we only consider scenarios where A is with at
most one sister password of user U. The underlying reason is
that, among the 547.56M of leaked password accounts that we
have collected over a period of six years, less than 1.02% (resp.
1.73%) of them have more than one match by email (resp. user
name). Similarly, among the 7.96M accounts collected by Das et
al. in 2014 [12], only 152 (0.00191%) of them have more than one
match by email. Therefore, it is realistic to assume that most users
have leaked one sister password, and A can exploit U’s this sister
password for attacking.
3. HUMAN BEHAVIORS OF PASSWORD
CREATION
Here we report a large-scale empirical study of human behaviors
in creating passwords, in particular, how often they choose popular
passwords, how often to reuse passwords, how often to make use
of their own PII.
Table 3: Basic information about our 10 password datasets
Dataset Web service Language When leaked Total PWs With PII
Dodonew E-commerce Chinese Dec., 2011 16,258,891
CSDN Programmer Chinese Dec., 2011 6,428,277
126 Email Chinese Dec., 2011 6,392,568
12306 Train ticketing Chinese Dec., 2014 129,303 X
Rockyou Social forum English Dec., 2009 32,581,870
000webhost Web hosting English Oct., 2015 15,251,073
Yahoo Web portal English July, 2012 442,834
Rootkit Hacker forum English Feb., 2011 69,418 X
Xiaomi
Mobile, cloud Chinese May, 2014 8,281,385
Xato Synthesised English Feb., 2015 9,997,772
Xiaomi passwords are in salted-hash and will be used as real targets.
Table 4: Basic information about our personal-info datasets
Dataset Language Number of Items Types of PII useful for this work
Hotel Chinese 20,051,426
Name, Gender, Birthday, Phone, NID
51job Chinese 2,327,571
Email, Name, Gender, Birthday, Phone
12306 Chinese 129,303
Email, User name, Name, Gender, Birth-
day, Phone, NID
Rootkit English 69,324
Email, User name, Name, Age, Birthday
3.1 Our datasets
Our evaluation builds on ten large real-world password datasets
(see Table 3), including five from English sites and five from
Chinese sites. They were hacked by attackers or leaked by insiders,
and disclosed publicly on the Internet, and some of them have been
used in trawling password models [13, 19, 21]. Rootkit initially
contains 71,228 passwords hashed in MD5, and we recover 97.46%
of them by using our TarGuess-IV and various trawling guessing
models [21, 30] in one week. In total, these datasets consist of
95.83 million plain-text passwords and cover various popular web
services. The role of each dataset will be specified in Sec. 5.
In particular, two of these ten password datasets contain various
types of PII as shown in Table 4. Besides, we further employ two
auxiliary PII datasets, aiming to augment the password datasets
by matching the email address to facilitate a more comprehensive
understanding of the role of PII in user-chosen passwords. While
most of the PII attributes in Chinese PII-associated datasets are
available, 17.90% of names and 54.04% of birthdays in Rootkit
are null. These missing attributes may hinder the effectiveness of
targeted attacks against Rootkit users. To the best of knowledge,
our corpus is the largest and most diversified ever collected for
evaluating the security threat of targeted online guessing .
3.2 Popular passwords
Table 5 shows how often users from different services choose
popular passwords. It is disturbing that 0.79%10.44% of user-
chosen passwords can be guessed by just using the top 10 pass-
words. Generally, top Chinese passwords are more concentrated
than English ones [34], which may imply that the former would be

Table 5: Top-10 most popular passwords of each service
Rank Dodonew CSDN 126 12306 Rockyou 000webhost Xato Yahoo Rootkit
1 123456 123456789 123456 123456 123456 abc123 123456 123456 123456
2 a123456 12345678 123456789 a123456 12345 123456a password password password
3 123456789 11111111 111111 5201314 123456789 12qw23we 12345678 welcome rootkit
4 111111 dearbook password 123456a password 123abc qwerty ninja 111111
5 5201314 00000000 000000 111111 iloveyou a123456 123456789 abc123 12345678
6 123123 123123123 123123 woaini1314 princess 123qwe 12345 123456789 qwerty
7 a321654 1234567890 12345678 123123 1234567 secret666 1234 12345678 123456789
8 12345 88888888 5201314 000000 rockyou YfDbUfNjH10305070
111111 sunshine 123123
9 000000 111111111 18881888 qq123456 12345678 asd123 1234567 princess qwertyui
10 123456a 147258369 1234567 1qaz2wsx abc123 qwerty123 dragon qwerty 12345
% of top-10 3.28% 10.44% 3.52% 1.28% 2.05% 0.79% 1.46% 1.01% 3.94%
The letter-part (i.e., YfDbUfNjH) can be mapped to a Russian word which means “navigator”. Why it is so popular is beyond our comprehension.
more prone to online guessing. While most of the top Chinese pass-
words are only made of simple digits, popular English ones tend to
be meaningful letter strings or keyboard patterns. Love plays an
important role iloveyou and princess are among the top-
10 lists of two English sites, while 5201314 and woaini1314,
both of which sound as “I love you forever and ever” in Chinese,
are among the top-10 lists of three Chinese sites. Other factors
such as culture (see 18881888) and site name (see rockyou and
rootkit) also show their impacts on password creation.
Figure 3: Fraction of PWs shared between two sites.
Fig. 3 illustrates the fraction of top-k passwords shared between
two different services with varied thresholds of k. Generally, the
fraction of shared passwords from the same language is substantial-
ly higher than that of shared passwords from different languages. In
addition, the fraction of shared passwords between any two services
is less than 60% at any threshold k larger than 10. This implies that
both language and service play an important role in shaping users’
top popular passwords.
Rockyou and 000webhost share significantly fewer common
passwords than other pairs do. We examine these two datasets and
find that 99.29% of 000webhost passwords include both letters and
digits, indicating that this site enforces a password creation policy
that requires passwords to include both letters and digits. This can
also be corroborated by Table 5 where all top-10 000webhost
passwords are composed of both letters and digits. Similarly, we
find that CSDN requires passwords to be of length 8
+
.
3.3 Password reuse
While users have to maintain probably several times as many
password accounts as they did 10 years ago, human-memory ca-
pacity remains stable. As a result, users tend to cope by reusing
passwords across different services [16,32]. Several empirical stud-
ies [5, 12] have explored the password reuse behaviors of English
and European users, yet as far as we know, no empirical results
have been reported about Chinese users, who reached 668 million
by Dec., 2015 [11] and account for about 25% (and the largest
fraction) of the world’s Internet population.
To fill this gap, we intersect 12306 with Dodonew by matching
email, and further eliminate the users with identical password pairs.
This produces a new list 12306&Dodonew with two non-identical
sister passwords for each user. Similarly, we obtain two more
intersected Chinese password lists and three intersected English
lists as shown in Fig. 4. During the matching process, we find
that 34.02%71.11% of Chinese users’ sister password pairs are
identical (and thus are eliminated), while these figures for English
users are 6.25%21.96% (see Sec. 5.1). This suggests that our
English users reuse less.
Figure 4: Using the Levenshtein-distance similarity metric to
measure the similarity of two passwords chosen by the same
user across different services. Results suggest that most users
modify passwords in a non-trivial way.
We employ the widely accepted Levenshtein-distance metric
to measure the similarity between two different passwords of a
given user. Fig. 4 shows that, sister passwords of Chinese users
generally have higher similarity than English users, implying that
Chinese users modify passwords less complexly. About 30% of
the non-identical Chinese password pairs have similarity scores
in [0.7, 1.0], while this figure for our English password pairs is
less than 20%. We also employ the longest-common-subsequence
metric for measurement. Both metrics show similar results. Our
results imply that the majority of users modify passwords in a non-
trivial approach, and it would be challenging to model such users’s
modification behaviors.
We have observed that our English users reuse less and modify
passwords more complexly. A plausible reason for this observation
is that the two english sites are not normal: Rootkit is a hacker fo-
rum and 000webhost is mainly used by web administrators. There-
fore, the users of both sites are likely to be more security-savvy
than normal users. Thus, the lists Rootkit&000webhost,
Rootkit&Yahoo and 000webhost&Yahoo will show more
secure reuse behaviors than that of normal English/Chinese users.
In 2014, Das et al. [12] found that the fraction of identical sister
PW pairs of normal English users is 43%, which roughly accords
with our Chinese users yet 26 times higher than our English users.
They also showed that about 30% of their non-identical English PW
pairs have similarity scores in [0.7, 1.0], well in accord with that of
our Chinese users. Moreover, the survey results on password reuse
behaviors of normal Chinese users [32] are largely consistent with
the survey results on normal English users [12]. Both empirical and
survey results suggest that normal Chinese and English users have
similar reuse behaviors, while our English users would be good
representatives of security-savvy users.

Table 6: Percentages of users building passwords with (and only with) their own heterogeneous personal information
Typical usages of personal information (examples)
PII-Dodonew PII-126 PII-CSDN PII-12306 PII-Rootkit PII-Yahoo PII-000web-
(161,510) (30,741) (77,439) (129,303) (69,330) (214 ) host(2,950)
Full_name (lei wang, john smith) 4.68 0.82 3.00 1.32 4.85 1.81 5.02 1.13 1.38 0.75 2.34 1.87 2.44 1.32
Family_name (wang, smith) 11.15 0.01 6.16 0.00 9.75 0.00 11.23 0.00 2.28 0.78 4.67 1.87 3.73 1.46
Given_name (lei, john) 6.49 0.07 4.10 0.12 6.26 0.08 6.61 0.07 0.49 0.07 0.93 0.00 0.75 0.20
Abbr. full_name (wl, lwang, js, jsmith) 13.64 0.02 6.36 0.00 9.42 0.00 13.13 0.00 0.15 0.01 0.00 0.00 0.20 0.00
Birthday(19820607, 06071982, 07061982) 3.12 1.00 3.70 2.77 6.29 5.16 4.33 1.77 0.08 0.06 0.47 0.00 0.10 0.07
Year of bithday (1982) 8.92 0.00 8.84 0.01 11.37 0.00 10.78 0.00 0.75 0.01 1.40 0.00 1.12 0.00
Date of bithday (0607, 0706) 8.32 0.00 10.48 0.02 11.84 0.00 10.03 0.00 0.44 0.01 0.47 0.00 0.58 0.00
Abbr. bithday(198267, 671982, 761982, 820607, 060782) 2.37 0.59 2.60 1.71 2.89 1.45 3.31 1.12 0.10 0.05 0.00 0.00 0.20 0.14
Family_name+bithday (wang19820607, smith06071982) 0.08 0.08 0.05 0.05 0.03 0.03 0.14 0.14 0.00 0.00 0.00 0.00 0.00 0.00
Family_name+Abbr. bithdayÀ(wang198267, smith671982) 0.11 0.11 0.03 0.02 0.05 0.05 0.15 0.14 0.00 0.00 0.00 0.00 0.00 0.00
Family_name+Abbr. bithdayÁ(wang820607, smith060782) 0.17 0.17 0.07 0.07 0.13 0.11 0.17 0.16 0.00 0.00 0.00 0.00 0.00 0.00
Family_name+year of birth (wang1982, smith1982) 0.55 0.22 0.20 0.07 0.22 0.07 0.64 0.25 0.01 0.00 0.00 0.00 0.00 0.00
Family_name+date of birth (wang0607, smith0607) 0.12 0.09 0.05 0.03 0.08 0.04 0.16 0.12 0.01 0.00 0.00 0.00 0.00 0.00
User name (icemoon12, bluebirdz) 1.54 1.14 0.54 0.38 0.61 0.43 1.96 1.32 1.59 0.92 2.34 1.40 2.20 1.32
Email_prefix (l0veu4ever@example.com) 5.07 3.07 2.52 1.60 4.35 2.48 3.03 1.82 0.77 0.44 4.21 1.87 1.32 0.78
Phone number (11-digit Chinese mobile number 13511336677) 0.10 0.10 0.48 0.45 0.50 0.45 0.07 0.01
‘a’+birthday(a19820607, a06071982, a07061982) 0.16 0.13 0.04 0.02 0.03 0.02 0.16 0.12 0.00 0.00 0.00 0.00 0.00 0.00
Full_name+1 (wanglei1, johnsmith1) 1.49 0.22 0.51 0.03 0.84 0.03 1.65 0.17 0.06 0.01 0.00 0.00 0.03 0.00
All the decimals in the table use ‘%’ as the unit. For instance, 4.68 in the top left corner means that 4.68% of the 161,510 PII-associated Dodonew users
employ their full name to build passwords; 0.82 means that 0.82% of these 161,510 Dodonew users’ passwords are just their full names.
(a) Gender on freq. distribution.
(b) Age on length distribution.
Figure 5: Impact of type-2 PII on user password creation. Both
gender and age show tangible impacts.
3.4 Password containing personal info
We show in Table 6 how often users employ their own PII to
build passwords. Since some password lists have no PII (see Table
3), we correlate them with the PII datasets of the same language
in Table 4 by matching email. As a result, seven PII-associated
password lists are produced, and they are much more diversified
than those in [20]. The sample size of each PII-associated dataset
is shown in the first row of Table 4. As expected, highly heteroge-
neous PII becomes components of passwords, and users like to use
names, birthdays and their variations. Particularly, a non-negligible
fraction of users employ just their full names (0.75%1.87%)
as passwords, and 1.00%5.16% of Chinese users use just their
birthdays as passwords. Surprisingly, email and user name prevail
in passwords of both user groups, ranging from 0.77% to 5.07%
and from 0.54% to 2.34%, respectively. In comparison, English
users exhibit a more secure behavior in PII usages, for our English
users represent security-savvy ones.
Fig. 5 illustrates the impact of type-2 PII : (1) passwords of
Dodonew female users are more concentrated; (2) passwords of
Dodonew users in age24 and age46 have quite similar length
distributions (pairwise χ
2
test, p-value= 0.009), while users in
age 2545 are significantly different in length distributions (p-
values<10
6
). Similar results are found in all other datasets.
Type-based PII matching. To achieve accuracy in PII recognition,
we propose a type-based PII segment matching method: besides
the traditional PCFG-based L, D, S tags [35], we employ a few
kinds of PII tags (e.g., N for name and B for birthday), and each
subscript number of our PII tags stands for a particular sub-type of
one kind of PII considered. For instance, N
1
denotes the usage of
family name (e.g., li), B
5
denotes the usage of year in birthday
(e.g., 1982) and so on. More details will be given in Sec. 4.1.
This is inherently different from the length-based PII matching
method given in an independent study [20]. To avoid mismatching,
only PII segments with len 3 are considered in [20]. For
instance, a match with any length 3
+
substring (e.g., 195, 952,
520) of a birthday 19520123 will be considered as a birthday
match. However, this introduces both under-estimations and over-
estimations in PII matching. For example, the password li.520
of a user named “Wei Li” with birthday 19520123 will be tagged
as L
2
S
1
Birth
3
, because the family name li is of length <3. As
20% of the top-50 Chinese family names are with length <3 (e.g.,
li, wu and he ), a large fraction of users’ name usages may be
under-estimated by [20]. For instance, 30,926 (23.9%) of the 13K
12306 users are with a family name len 2, and 4,346 of these
30,926 users indeed use their family name in passwords, yet this
fact cannot be captured in [20].
On the other hand, the segments (e.g.,123, 520 and 201) in
top popular digital passwords (e.g., 123456, 123456789,
5201314) would often coincide with user birthdays and phone
numbers, leading to over-estimations of their usages in passwords.
As we will show in Sec. 4.1, this length-based matching method
also introduces a weakness in the guess generation process when
performing cracking, while either increasing or decreasing their
length threshold will not eliminate the problem.
Summary. Our PII-associated password corpus is so far the largest
and most diversified ever collected for evaluating targeted online
guessing. Particularly, it, for the first time, covers (security-savvy)
English users. While users’ three vulnerable behaviors might be
potentially exploited to improve cracking, our results show that
varied circumstances (e.g., language, service and policy), non-
trivial transformation rules and highly heterogeneous PII all would
make it a challenging task to automate this process, especially
when given a limited guessing number (e.g., 100 by NIST [8, 18]).
4. TARGUESS: A FRAMEWORK FOR
TARGETED ONLINE GUESSING
We now propose TarGuess, a practical framework that effec-
tively addresses the realistic yet challenging problem of modeling
various targeted online guessing scenarios.
As shown in Fig. 6, TarGuess consists of three phases (i.e.
preparing, training and guessing). The design of the first and third
phases is straightforward, and the main task lies in the second one.
TarGuess captures four types of the most representative targeted
online guessing scenarios, with each type based on varied kinds
of personal information available to A (see Table 2): (i) only

Citations
More filters
Journal ArticleDOI

Two Birds with One Stone: Two-Factor Authentication with Security Beyond Conventional Bound

TL;DR: In this paper, a security model that can accurately capture the practical capabilities of an adversary is defined and a broad set of twelve properties framed as a systematic methodology for comparative evaluation, allowing schemes to be rated across a common spectrum.
Journal ArticleDOI

Zipf’s Law in Passwords

TL;DR: Li et al. as discussed by the authors proposed two Zipf-like models (i.e., PDF-Zipf and CDF-ZipF) to characterize the distribution of passwords and proposed a new metric for measuring the strength of password data sets.
Journal ArticleDOI

Lightweight and Physically Secure Anonymous Mutual Authentication Protocol for Real-Time Data Access in Industrial Wireless Sensor Networks

TL;DR: It is shown that the proposed scheme ensures security even if a sensor node is captured by an adversary, and the proposed protocol uses the lightweight cryptographic primitives, such as one way cryptographic hash function, physically unclonable function, and bitwise exclusive operations.
Journal ArticleDOI

Measuring Two-Factor Authentication Schemes for Real-Time Data Access in Industrial Wireless Sensor Networks

TL;DR: An attempt toward breaking this undesirable cycle by proposing a systematical evaluation framework for schemes to be assessed objectively, revisiting two foremost schemes and conducting a measurement of 44 representative schemes under this evaluation framework, thereby providing the missing evaluation for two-factor schemes in industrial WSNs.
Journal ArticleDOI

TCALAS: Temporal Credential-Based Anonymous Lightweight Authentication Scheme for Internet of Drones Environment

TL;DR: A novel temporal credential based anonymous lightweight user authentication mechanism for IoD environment, called TCALAS, which has the capability to resist various known attacks against passive/active adversary and lower costs in both computation and communication as compared to existing schemes.
References
More filters
Proceedings ArticleDOI

The Science of Guessing: Analyzing an Anonymized Corpus of 70 Million Passwords

TL;DR: It is estimated that passwords provide fewer than 10 bits of security against an online, trawling attack, and only about 20 bits ofSecurity against an optimal offline dictionary attack, when compared with a uniform distribution which would provide equivalent security against different forms of guessing attack.
Proceedings ArticleDOI

Password Cracking Using Probabilistic Context-Free Grammars

TL;DR: This paper discusses a new method that generates password structures in highest probability order by automatically creating a probabilistic context-free grammar based upon a training set of previously disclosed passwords, and then generating word-mangling rules to be used in password cracking.
Proceedings ArticleDOI

The Tangled Web of Password Reuse

TL;DR: This paper investigates for the first time how an attacker can leverage a known password from one site to more easily guess that user's password at other sites and develops the first cross-site password-guessing algorithm, able to guess 30% of transformed passwords within 100 attempts.
Proceedings ArticleDOI

Fast dictionary attacks on passwords using time-space tradeoff

TL;DR: It is demonstrated that as long as passwords remain human-memorable, they are vulnerable to "smart-dictionary" attacks even when the space of potential passwords is large, calling into question viability of human- Memorable character-sequence passwords as an authentication mechanism.
Related Papers (5)
Frequently Asked Questions (13)
Q1. What are the contributions mentioned in the paper "Targeted online password guessing: an underestimated threat" ?

While trawling online/offline password guessing has been intensively studied, only a few studies have examined targeted online guessing, where an attacker guesses a specific victim ’ s password for a service, by exploiting the victim ’ s personal information such as one sister password leaked from her another account and some personally identifiable information ( PII ). The authors propose TarGuess, a framework that systematically characterizes typical targeted guessing scenarios with seven sound mathematical models, each of which is based on varied kinds of data available to an attacker. 

The authors believe that the new algorithms and knowledge of effectiveness of targeted guessing models can shed light on both existing password practice and future password research. 

The authors employ the widely accepted Levenshtein-distance metric to measure the similarity between two different passwords of a given user. 

Trawling online guessing mainly exploits users’ behavior of choosing popular passwords [22, 34], and it can be well addressed by various security mechanisms at the server (e.g., suspicious login detection [14], rate-limiting and lockout [18]). 

Since Rockyou does not contain email or user name, the authors further match Xato with Rootkit to obtain 15,304 PIIassociated Xato passwords to supplement Rockyou. 

Recent research [7, 16] has realized that it should be the role of websites to protect user passwords from offline guessing by securely storing password files, while normal users only need to choose passwords that can survive online guessing. 

As shown in [12, 32], when given a password, there are over a dozen transformation rules, such as insert, delete, capitalization and leet (e.g., password→ passw0rd) and the synthesized ones (e.g., password→ Passw0rd1), that a user can utilize to create a new password. 

The main challenge for targeted online password guessing is to effectively characterize an attacker A’s guessing model, with multiple dimensions of available information (see Fig. 2) well captured, while the number of guesses allowed to A is small – the NIST Authentication Guideline [18] requires Level 1 and 2 systems to keep login failures less than 100 per user account in any 30-day period. 

In 2014, Das et al. [12] found that the fraction of identical sister PW pairs of normal English users is 43%, which roughly accords with their Chinese users yet 2∼6 times higher than their English users. 

To model these four scenarios, the authors suggest four guessing models (I∼IV) by leveraging a number of probabilistic techniques such as PCFG, Markov and Bayesian theory. 

Since some password lists have no PII (see Table 3), the authors correlate them with the PII datasets of the same language in Table 4 by matching email. 

Online guessing can be launched against the publicly facing server by anyone using a browser at anytime, with the primary constraint being the number of guesses allowed. 

4. During the matching process, the authors find that 34.02%∼71.11% of Chinese users’ sister password pairs are identical (and thus are eliminated), while these figures for English users are 6.25%∼21.96% (see Sec. 5.1).