A framework for detection and measurement of phishing attacks

doi:10.1145/1314389.1314391

Home
/
Papers
/
A framework for detection and measurement of phishing attacks

Proceedings Article•DOI•

A framework for detection and measurement of phishing attacks

Sujata Garera¹, Niels Provos², Monica Chew², Aviel D. Rubin¹•Institutions (2)

Johns Hopkins University¹, Google²

02 Nov 2007-pp 1-8

TL;DR: It is found that it is often possible to tell whether or not a URL belongs to a phishing attack without requiring any knowledge of the corresponding page data.

read less

Abstract: Phishing is form of identity theft that combines social engineering techniques and sophisticated attack vectors to harvest financial information from unsuspecting consumers Often a phisher tries to lure her victim into clicking a URL pointing to a rogue page In this paper, we focus on studying the structure of URLs employed in various phishing attacks We find that it is often possible to tell whether or not a URL belongs to a phishing attack without requiring any knowledge of the corresponding page data We describe several features that can be used to distinguish a phishing URL from a benign one These features are used to model a logistic regression filter that is efficient and has a high accuracy We use this filter to perform thorough measurements on several million URLs and quantify the prevalence of phishing on the Internet today

...read moreread less

Citations

PDF

Open Access

More filters

Proceedings Article•DOI•

Beyond blacklists: learning to detect malicious web sites from suspicious URLs

[...]

Justin Ma¹, Lawrence K. Saul¹, Stefan Savage¹, Geoffrey M. Voelker¹•Institutions (1)

University of California, San Diego¹

28 Jun 2009

TL;DR: This paper describes an approach to this problem based on automated URL classification, using statistical methods to discover the tell-tale lexical and host-based properties of malicious Web site URLs.

...read moreread less

Abstract: Malicious Web sites are a cornerstone of Internet criminal activities. As a result, there has been broad interest in developing systems to prevent the end user from visiting such sites. In this paper, we describe an approach to this problem based on automated URL classification, using statistical methods to discover the tell-tale lexical and host-based properties of malicious Web site URLs. These methods are able to learn highly predictive models by extracting and automatically analyzing tens of thousands of features potentially indicative of suspicious URLs. The resulting classifiers obtain 95-99% accuracy, detecting large numbers of malicious Web sites from their URLs, with only modest false positives.

...read moreread less

806 citations

Cites background from "A framework for detection and measu..."

...is the most closely related to our study [9]....
[...]

Proceedings Article•DOI•

Identifying suspicious URLs: an application of large-scale online learning

[...]

Justin Ma¹, Lawrence K. Saul¹, Stefan Savage¹, Geoffrey M. Voelker¹•Institutions (1)

University of California, San Diego¹

14 Jun 2009

TL;DR: It is demonstrated that recently-developed online algorithms can be as accurate as batch techniques, achieving classification accuracies up to 99% over a balanced data set.

...read moreread less

Abstract: This paper explores online learning approaches for detecting malicious Web sites (those involved in criminal scams) using lexical and host-based features of the associated URLs. We show that this application is particularly appropriate for online algorithms as the size of the training data is larger than can be efficiently processed in batch and because the distribution of features that typify malicious URLs is changing continuously. Using a real-time system we developed for gathering URL features, combined with a real-time source of labeled URLs from a large Web mail provider, we demonstrate that recently-developed online algorithms can be as accurate as batch techniques, achieving classification accuracies up to 99% over a balanced data set.

...read moreread less

567 citations

Cites methods from "A framework for detection and measu..."

...RelatedWork The most direct comparison to our work comes from Garera et al. (2007), who classify phishing URLs using logistic regression over 18 hand-selected features....
[...]

Proceedings Article•DOI•

Design and Evaluation of a Real-Time URL Spam Filtering Service

[...]

Kurt Thomas¹, Chris Grier¹, Justin Ma¹, Vern Paxson¹, Dawn Song¹ - Show less +1 more•Institutions (1)

University of California, Berkeley¹

22 May 2011

TL;DR: It is shown that Monarch can provide accurate, real-time protection, but that the underlying characteristics of spam do not generalize across web services, and the distinctions between email and Twitter spam are explored.

...read moreread less

Abstract: On the heels of the widespread adoption of web services such as social networks and URL shorteners, scams, phishing, and malware have become regular threats. Despite extensive research, email-based spam filtering techniques generally fall short for protecting other web services. To better address this need, we present Monarch, a real-time system that crawls URLs as they are submitted to web services and determines whether the URLs direct to spam. We evaluate the viability of Monarch and the fundamental challenges that arise due to the diversity of web service spam. We show that Monarch can provide accurate, real-time protection, but that the underlying characteristics of spam do not generalize across web services. In particular, we find that spam targeting email qualitatively differs in significant ways from spam campaigns targeting Twitter. We explore the distinctions between email and Twitter spam, including the abuse of public web hosting and redirector services. Finally, we demonstrate Monarch's scalability, showing our system could protect a service such as Twitter -- which needs to process 15 million URLs/day -- for a bit under $800/day.

...read moreread less

508 citations

Journal Article•DOI•

CANTINA+: A Feature-Rich Machine Learning Framework for Detecting Phishing Web Sites

[...]

Guang Xiang¹, Jason Hong¹, Carolyn Penstein Rosé¹, Lorrie Faith Cranor¹•Institutions (1)

Carnegie Mellon University¹

01 Sep 2011-ACM Transactions on Information and System Security

TL;DR: A layered anti-phishing solution that aims at exploiting the expressiveness of a rich set of features with machine learning to achieve a high true positive rate (TP) on novel phish, and limiting the FP to a low level via filtering algorithms.

...read moreread less

Abstract: Phishing is a plague in cyberspace. Typically, phish detection methods either use human-verified URL blacklists or exploit Web page features via machine learning techniques. However, the former is frail in terms of new phish, and the latter suffers from the scarcity of effective features and the high false positive rate (FP). To alleviate those problems, we propose a layered anti-phishing solution that aims at (1) exploiting the expressiveness of a rich set of features with machine learning to achieve a high true positive rate (TP) on novel phish, and (2) limiting the FP to a low level via filtering algorithms.Specifically, we proposed CANTINA+, the most comprehensive feature-based approach in the literature including eight novel features, which exploits the HTML Document Object Model (DOM), search engines and third party services with machine learning techniques to detect phish. Moreover, we designed two filters to help reduce FP and achieve runtime speedup. The first is a near-duplicate phish detector that uses hashing to catch highly similar phish. The second is a login form filter, which directly classifies Web pages with no identified login form as legitimate.We extensively evaluated CANTINA+ with two methods on a diverse spectrum of corpora with 8118 phish and 4883 legitimate Web pages. In the randomized evaluation, CANTINA+ achieved over 92p TP on unique testing phish and over 99p TP on near-duplicate testing phish, and about 0.4p FP with 10p training phish. In the time-based evaluation, CANTINA+ also achieved over 92p TP on unique testing phish, over 99p TP on near-duplicate testing phish, and about 1.4p FP under 20p training phish with a two-week sliding window. Capable of achieving 0.4p FP and over 92p TP, our CANTINA+ has been demonstrated to be a competitive anti-phishing solution.

...read moreread less

462 citations

Journal Article•DOI•

The state of phishing attacks

[...]

Jason Hong¹•Institutions (1)

Carnegie Mellon University¹

01 Jan 2012-Communications of The ACM

TL;DR: Looking past the systems people use, they target the people using the systems.

...read moreread less

Abstract: Looking past the systems people use, they target the people using the systems.

...read moreread less

457 citations

1
2
3
4
…
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96

Collapse

References

PDF

Open Access

More filters

Book•

Applied Logistic Regression

[...]

David W. Hosmer, Stanley Lemeshow

01 Jan 1989

TL;DR: Hosmer and Lemeshow as discussed by the authors provide an accessible introduction to the logistic regression model while incorporating advances of the last decade, including a variety of software packages for the analysis of data sets.

...read moreread less

Abstract: From the reviews of the First Edition. "An interesting, useful, and well-written book on logistic regression models... Hosmer and Lemeshow have used very little mathematics, have presented difficult concepts heuristically and through illustrative examples, and have included references."- Choice "Well written, clearly organized, and comprehensive... the authors carefully walk the reader through the estimation of interpretation of coefficients from a wide variety of logistic regression models . . . their careful explication of the quantitative re-expression of coefficients from these various models is excellent." - Contemporary Sociology "An extremely well-written book that will certainly prove an invaluable acquisition to the practicing statistician who finds other literature on analysis of discrete data hard to follow or heavily theoretical."-The Statistician In this revised and updated edition of their popular book, David Hosmer and Stanley Lemeshow continue to provide an amazingly accessible introduction to the logistic regression model while incorporating advances of the last decade, including a variety of software packages for the analysis of data sets. Hosmer and Lemeshow extend the discussion from biostatistics and epidemiology to cutting-edge applications in data mining and machine learning, guiding readers step-by-step through the use of modeling techniques for dichotomous data in diverse fields. Ample new topics and expanded discussions of existing material are accompanied by a wealth of real-world examples-with extensive data sets available over the Internet.

...read moreread less

35,847 citations

Journal Article•DOI•

Applied Logistic Regression.

[...]

A. J. Scott, David W. Hosmer, Stanley Lemeshow

01 Dec 1991-Biometrics

TL;DR: Applied Logistic Regression, Third Edition provides an easily accessible introduction to the logistic regression model and highlights the power of this model by examining the relationship between a dichotomous outcome and a set of covariables.

...read moreread less

Abstract: \"A new edition of the definitive guide to logistic regression modeling for health science and other applicationsThis thoroughly expanded Third Edition provides an easily accessible introduction to the logistic regression (LR) model and highlights the power of this model by examining the relationship between a dichotomous outcome and a set of covariables. Applied Logistic Regression, Third Edition emphasizes applications in the health sciences and handpicks topics that best suit the use of modern statistical software. The book provides readers with state-of-the-art techniques for building, interpreting, and assessing the performance of LR models. New and updated features include: A chapter on the analysis of correlated outcome data. A wealth of additional material for topics ranging from Bayesian methods to assessing model fit Rich data sets from real-world studies that demonstrate each method under discussion. Detailed examples and interpretation of the presented results as well as exercises throughout Applied Logistic Regression, Third Edition is a must-have guide for professionals and researchers who need to model nominal or ordinal scaled outcome variables in public health, medicine, and the social sciences as well as a wide range of other fields and disciplines\"--

...read moreread less

30,190 citations

"A framework for detection and measu..." refers background in this paper

...Consequently anyone sur.ng the web can now get their hands on these kits and launch their own phishing attack....
[...]

Book•

Data Mining: Practical Machine Learning Tools and Techniques

[...]

Ian H. Witten, Eibe Frank, Mark Hall

25 Oct 1999

TL;DR: This highly anticipated third edition of the most acclaimed work on data mining and machine learning will teach you everything you need to know about preparing inputs, interpreting outputs, evaluating results, and the algorithmic methods at the heart of successful data mining.

...read moreread less

Abstract: Data Mining: Practical Machine Learning Tools and Techniques offers a thorough grounding in machine learning concepts as well as practical advice on applying machine learning tools and techniques in real-world data mining situations. This highly anticipated third edition of the most acclaimed work on data mining and machine learning will teach you everything you need to know about preparing inputs, interpreting outputs, evaluating results, and the algorithmic methods at the heart of successful data mining. Thorough updates reflect the technical changes and modernizations that have taken place in the field since the last edition, including new material on Data Transformations, Ensemble Learning, Massive Data Sets, Multi-instance Learning, plus a new version of the popular Weka machine learning software developed by the authors. Witten, Frank, and Hall include both tried-and-true techniques of today as well as methods at the leading edge of contemporary research. *Provides a thorough grounding in machine learning concepts as well as practical advice on applying the tools and techniques to your data mining projects *Offers concrete tips and techniques for performance improvement that work by transforming the input or output in machine learning methods *Includes downloadable Weka software toolkit, a collection of machine learning algorithms for data mining tasks-in an updated, interactive interface. Algorithms in toolkit cover: data pre-processing, classification, regression, clustering, association rules, visualization

...read moreread less

20,196 citations

Proceedings Article•

The PageRank Citation Ranking : Bringing Order to the Web

[...]

Lawrence Page, Sergey Brin, Rajeev Motwani, Terry Winograd

11 Nov 1999

TL;DR: This paper describes PageRank, a mathod for rating Web pages objectively and mechanically, effectively measuring the human interest and attention devoted to them, and shows how to efficiently compute PageRank for large numbers of pages.

...read moreread less

Abstract: The importance of a Web page is an inherently subjective matter, which depends on the readers interests, knowledge and attitudes. But there is still much that can be said objectively about the relative importance of Web pages. This paper describes PageRank, a mathod for rating Web pages objectively and mechanically, effectively measuring the human interest and attention devoted to them. We compare PageRank to an idealized random Web surfer. We show how to efficiently compute PageRank for large numbers of pages. And, we show how to apply PageRank to search and to user navigation.

...read moreread less

14,400 citations

Book•

Data Mining

[...]

Ian Witten

01 Jan 2008

TL;DR: In this paper, generalized estimating equations (GEE) with computing using PROC GENMOD in SAS and multilevel analysis of clustered binary data using generalized linear mixed-effects models with PROC LOGISTIC are discussed.

...read moreread less

Abstract: tic regression, and it concerns studying the effect of covariates on the risk of disease. The chapter includes generalized estimating equations (GEE’s) with computing using PROC GENMOD in SAS and multilevel analysis of clustered binary data using generalized linear mixed-effects models with PROC LOGISTIC. As a prelude to the following chapter on repeated-measures data, Chapter 5 presents time series analysis. The material on repeated-measures analysis uses linear additive models with GEE’s and PROC MIXED in SAS for linear mixed-effects models. Chapter 7 is about survival data analysis. All computing throughout the book is done using SAS procedures.

...read moreread less

9,995 citations