scispace - formally typeset
Search or ask a question
Proceedings Article•DOI•

Cantina: a content-based approach to detecting phishing web sites

TL;DR: The design, implementation, and evaluation of CANTINA, a novel, content-based approach to detecting phishing web sites, based on the TF-IDF information retrieval algorithm, are presented.
Abstract: Phishing is a significant problem involving fraudulent email and web sites that trick unsuspecting users into revealing private information. In this paper, we present the design, implementation, and evaluation of CANTINA, a novel, content-based approach to detecting phishing web sites, based on the TF-IDF information retrieval algorithm. We also discuss the design and evaluation of several heuristics we developed to reduce false positives. Our experiments show that CANTINA is good at detecting phishing sites, correctly labeling approximately 95% of phishing sites.

Content maybe subject to copyright    Report

Citations
More filters
Book•
11 Jan 2013
TL;DR: Outlier Analysis is a comprehensive exposition, as understood by data mining experts, statisticians and computer scientists, and emphasis was placed on simplifying the content, so that students and practitioners can also benefit.
Abstract: With the increasing advances in hardware technology for data collection, and advances in software technology (databases) for data organization, computer scientists have increasingly participated in the latest advancements of the outlier analysis field. Computer scientists, specifically, approach this field based on their practical experiences in managing large amounts of data, and with far fewer assumptions the data can be of any type, structured or unstructured, and may be extremely large. Outlier Analysisis a comprehensive exposition, as understood by data mining experts, statisticians and computer scientists. The book has been organized carefully, and emphasis was placed on simplifying the content, so that students and practitioners can also benefit. Chapters will typically cover one of three areas: methods and techniques commonly used in outlier analysis, such as linear methods, proximity-based methods, subspace methods, and supervised methods; data domains, such as, text, categorical, mixed-attribute, time-series, streaming, discrete sequence, spatial and network data; and key applications of these methods as applied to diverse domains such as credit card fraud detection, intrusion detection, medical diagnosis, earth science, web log analytics, and social network analysis are covered.

1,278 citations


Cites background from "Cantina: a content-based approach t..."

  • ...Numerous other applications of outlier analysis may be found in the domains of aviation safety [128], internet routing updates [373], malicious URL detection [319, 509], disease outbreaks [465], spatial linking of criminal incidents [311], failure management of large computer clusters [390], astronomical data [147, 213, 477], disturbance events in terrestrial ecosystems [370] and biological sequences [308]....

    [...]

Proceedings Article•DOI•
28 Jun 2009
TL;DR: This paper describes an approach to this problem based on automated URL classification, using statistical methods to discover the tell-tale lexical and host-based properties of malicious Web site URLs.
Abstract: Malicious Web sites are a cornerstone of Internet criminal activities. As a result, there has been broad interest in developing systems to prevent the end user from visiting such sites. In this paper, we describe an approach to this problem based on automated URL classification, using statistical methods to discover the tell-tale lexical and host-based properties of malicious Web site URLs. These methods are able to learn highly predictive models by extracting and automatically analyzing tens of thousands of features potentially indicative of suspicious URLs. The resulting classifiers obtain 95-99% accuracy, detecting large numbers of malicious Web sites from their URLs, with only modest false positives.

806 citations

Proceedings Article•DOI•
08 May 2007
TL;DR: This method is applicable, with slight modification, to detection of phishing websites, or the emails used to direct victims to these sites, and correctly identify over 96% of the phishing emails while only mis-classifying on the order of 0.1%" of the legitimate emails.
Abstract: Each month, more attacks are launched with the aim of making web users believe that they are communicating with a trusted entity for the purpose of stealing account information, logon credentials, and identity information in general. This attack method, commonly known as "phishing," is most commonly initiated by sending out emails with links to spoofed websites that harvest information. We present a method for detecting these attacks, which in its most general form is an application of machine learning on a feature set designed to highlight user-targeted deception in electronic communication. This method is applicable, with slight modification, to detection of phishing websites, or the emails used to direct victims to these sites. We evaluate this method on a set of approximately 860 such phishing emails, and 6950 non-phishing emails, and correctly identify over 96% of the phishing emails while only mis-classifying on the order of 0.1% of the legitimate emails. We conclude with thoughts on the future for such techniques to specifically identify deception, specifically with respect to the evolutionary nature of the attacks and information available.

641 citations

Proceedings Article•DOI•
18 Jul 2007
TL;DR: The design and evaluation of Anti-Phishing Phil, an online game that teaches users good habits to help them avoid phishing attacks are described and it is confirmed that games can be an effective way of educating people about phishing and other security attacks.
Abstract: In this paper we describe the design and evaluation of Anti-Phishing Phil, an online game that teaches users good habits to help them avoid phishing attacks We used learning science principles to design and iteratively refine the game We evaluated the game through a user study: participants were tested on their ability to identify fraudulent web sites before and after spending 15 minutes engaged in one of three anti-phishing training activities (playing the game, reading an anti-phishing tutorial we created based on the game, or reading existing online training materials) We found that the participants who played the game were better able to identify fraudulent web sites compared to the participants in other conditions We attribute these effects to both the content of the training messages presented in the game as well as the presentation of these materials in an interactive game format Our results confirm that games can be an effective way of educating people about phishing and other security attacks

531 citations

Proceedings Article•DOI•
Kurt Thomas1, Chris Grier1, Justin Ma1, Vern Paxson1, Dawn Song1 •
22 May 2011
TL;DR: It is shown that Monarch can provide accurate, real-time protection, but that the underlying characteristics of spam do not generalize across web services, and the distinctions between email and Twitter spam are explored.
Abstract: On the heels of the widespread adoption of web services such as social networks and URL shorteners, scams, phishing, and malware have become regular threats. Despite extensive research, email-based spam filtering techniques generally fall short for protecting other web services. To better address this need, we present Monarch, a real-time system that crawls URLs as they are submitted to web services and determines whether the URLs direct to spam. We evaluate the viability of Monarch and the fundamental challenges that arise due to the diversity of web service spam. We show that Monarch can provide accurate, real-time protection, but that the underlying characteristics of spam do not generalize across web services. In particular, we find that spam targeting email qualitatively differs in significant ways from spam campaigns targeting Twitter. We explore the distinctions between email and Twitter spam, including the abuse of public web hosting and redirector services. Finally, we demonstrate Monarch's scalability, showing our system could protect a service such as Twitter -- which needs to process 15 million URLs/day -- for a bit under $800/day.

508 citations


Cites background from "Cantina: a content-based approach t..."

  • ...Each of these features provides a means for potentially identifying common hosting infrastructure across spam....

    [...]

References
More filters
Book•
01 Jan 1983
TL;DR: Reading is a need and a hobby at once and this condition is the on that will make you feel that you must read.
Abstract: Some people may be laughing when looking at you reading in your spare time. Some may be admired of you. And some may want be like you who have reading hobby. What about your own feel? Have you felt right? Reading is a need and a hobby at once. This condition is the on that will make you feel that you must read. If you know are looking for the book enPDFd introduction to modern information retrieval as the choice of reading, you can find here.

12,059 citations


"Cantina: a content-based approach t..." refers background in this paper

  • ...It is unknown precisely how much phishing costs each year since impacted industries are reluctant to release figures; estimates range from $1 billion [24] to 2.8 billion [27] per year....

    [...]

Proceedings Article•DOI•
22 Apr 2006
TL;DR: This paper provides the first empirical evidence about which malicious strategies are successful at deceiving general users by analyzing a large set of captured phishing attacks and developing a set of hypotheses about why these strategies might work.
Abstract: To build systems shielding users from fraudulent (or phishing) websites, designers need to know which attack strategies work and why. This paper provides the first empirical evidence about which malicious strategies are successful at deceiving general users. We first analyzed a large set of captured phishing attacks and developed a set of hypotheses about why these strategies might work. We then assessed these hypotheses with a usability study in which 22 participants were shown 20 web sites and asked to determine which ones were fraudulent. We found that 23% of the participants did not look at browser-based cues such as the address bar, status bar and the security indicators, leading to incorrect choices 40% of the time. We also found that some visual deception attacks can fool even the most sophisticated users. These results illustrate that standard security indicators are not effective for a substantial fraction of users, and suggest that alternative approaches are needed.

1,368 citations

Journal Article•DOI•
TL;DR: Sometimes a "friendly" email message tempts recipients to reveal more online than they otherwise would, playing right into the sender's hand.
Abstract: Sometimes a "friendly" email message tempts recipients to reveal more online than they otherwise would, playing right into the sender's hand.

995 citations

Proceedings Article•DOI•
08 May 2007
TL;DR: This method is applicable, with slight modification, to detection of phishing websites, or the emails used to direct victims to these sites, and correctly identify over 96% of the phishing emails while only mis-classifying on the order of 0.1%" of the legitimate emails.
Abstract: Each month, more attacks are launched with the aim of making web users believe that they are communicating with a trusted entity for the purpose of stealing account information, logon credentials, and identity information in general. This attack method, commonly known as "phishing," is most commonly initiated by sending out emails with links to spoofed websites that harvest information. We present a method for detecting these attacks, which in its most general form is an application of machine learning on a feature set designed to highlight user-targeted deception in electronic communication. This method is applicable, with slight modification, to detection of phishing websites, or the emails used to direct victims to these sites. We evaluate this method on a set of approximately 860 such phishing emails, and 6950 non-phishing emails, and correctly identify over 96% of the phishing emails while only mis-classifying on the order of 0.1% of the legitimate emails. We conclude with thoughts on the future for such techniques to specifically identify deception, specifically with respect to the evolutionary nature of the attacks and information available.

641 citations

Proceedings Article•DOI•
22 Apr 2006
TL;DR: It is found that many subjects do not understand phishing attacks or realize how sophisticated such attacks can be, and security toolbars are found to be ineffective at preventingPhishing attacks.
Abstract: Security toolbars in a web browser show security-related information about a website to help users detect phishing attacks. Because the toolbars are designed for humans to use, they should be evaluated for usability -- that is, whether these toolbars really prevent users from being tricked into providing personal information. We conducted two user studies of three security toolbars and other browser security indicators and found them all ineffective at preventing phishing attacks. Even though subjects were asked to pay attention to the toolbar, many failed to look at it; others disregarded or explained away the toolbars' warnings if the content of web pages looked legitimate. We found that many subjects do not understand phishing attacks or realize how sophisticated such attacks can be.

604 citations

Trending Questions (1)
CANTINA : A Feature-Rich Machine Learning Framework for Detecting Phishing Web Sites

The paper does not mention CANTINA as a feature-rich machine learning framework for detecting phishing web sites.