scispace - formally typeset
Search or ask a question
Book ChapterDOI

Towards Proactive Spam Filtering (Extended Abstract)

29 Jun 2009-pp 38-47
TL;DR: This paper introduces a more proactive approach that allows us to directly collect spam message by interacting with the spam botnet controllers and generates templates that represent a concise summary of a spam run.
Abstract: With increasing security measures in network services, remote exploitation is getting harder. As a result, attackers concentrate on more reliable attack vectors like email: victims are infected using either malicious attachments or links leading to malicious websites. Therefore efficient filtering and blocking methods for spam messages are needed. Unfortunately, most spam filtering solutions proposed so far are reactive , they require a large amount of both ham and spam messages to efficiently generate rules to differentiate between both. In this paper, we introduce a more proactive approach that allows us to directly collect spam message by interacting with the spam botnet controllers. We are able to observe current spam runs and obtain a copy of latest spam messages in a fast and efficient way. Based on the collected information we are able to generate templates that represent a concise summary of a spam run. The collected data can then be used to improve current spam filtering techniques and develop new venues to efficiently filter mails.
Citations
More filters
Proceedings ArticleDOI
20 Apr 2010
TL;DR: An automated supervised machine learning solution which utilises web navigation behaviour to detect web spambots and proposes a new feature set (referred to as an action set) as a representation of user behaviour to differentiate web spamots from human users.
Abstract: Web robots have been widely used for various beneficial and malicious activities. Web spambots are a type of web robot that spreads spam content throughout the web by typically targeting Web 2.0 applications. They are intelligently designed to replicate human behaviour in order to bypass system checks. Spam content not only wastes valuable resources but can also mislead users to unsolicited websites and award undeserved search engine rankings to spammers’ campaign websites. While most of the research in anti-spam filtering focuses on the identification of spam content on the web, only a few have investigated the origin of spam content, hence identification and detection of web spambots still remains an open area of research. In this paper, we describe an automated supervised machine learning solution which utilises web navigation behaviour to detect web spambots. We propose a new feature set (referred to as an action set) as a representation of user behaviour to differentiate web spambots from human users. Our experimental results show that our solution achieves a 96.24% accuracy in classifying web spambots.

41 citations

Proceedings ArticleDOI
14 Jun 2010
TL;DR: To the knowledge, SCA is the first unsupervised spam filtering scheme that achieves accuracy comparable to the de-facto supervised spam filters by explicitly exploiting online campaign identification.
Abstract: Traditional content-based spam filtering systems rely on supervised machine learning techniques. In the training phase, labeled email instances are used to build a learning model (e.g., a Naive Bayes classifier or support vector machine), which is then applied to future incoming emails in the detection phase. However, the critical reliance on the training data becomes one of the major limitations of supervised spam filters. Preparing labeled training data is often labor-intensive and can delay the learning-detection cycle. Furthermore, any mislabeling of the training corpus (e.g., due to spammers’ obfuscations) can severely affect the detection accuracy. Supervised learning schemes share one common mechanism regardless of their algorithm details: learning is performed on an individual email basis. This is the fundamental reason for requiring training data for supervised spam filters. In other words, in the learning phase these classifiers can never tell whether an email is spam or ham because they examine one email instance at a time. We investigate the feasibility of a completely unsupervised-learningbased spam filtering scheme which requires no training data. Our study is motivated by three key observations of the spam in today’s Internet. (1) The vast majority of emails are spam. (2) A spam email should always belong to some campaign [2, 3]. (3) The spam from the same campaign are generated from templates that obfuscate some parts of the spam, e.g., sensitive terms, leaving the other parts unmodified [3]. These observations suggest that in principle we can achieve unsupervised spam detection by examining emails at the campaign level. In particular, we need robust spam identification algorithms to find common terms shared by spam belonging to the same campaign. These common terms form signatures that can be used to detect future spam of the same campaign. This paper presents SpamCampaignAssassin (SCA), an online unsupervised spam learning and detection scheme. SCA performs accurate spam campaign identification, campaign signature generation, and spam detection using campaign signatures. To our knowledge, SCA is the first unsupervised spam filtering scheme that achieves accuracy comparable to the de-facto supervised spam filters by explicitly exploiting online campaign identification. The full paper describing SCA is available as a technical report [4].

27 citations


Cites methods from "Towards Proactive Spam Filtering (E..."

  • ...(4) The final category includes automated systems like [46, 12, 31] AutoRE [46] is a technique that automatically extracts regular expressions from URLs that satisfy distributed and burstiness criteria as signatures (e....

    [...]

Journal ArticleDOI
TL;DR: It is concluded that finding pitfalls in the usage of tools by cybercriminals has the potential to increase the efficiency of disruption, interception, and prevention approaches.
Abstract: This work presents an overview of some of the tools that cybercriminals employ to trade securely. It will look at the weaknesses of these tools and how the behavior of cybercriminals will sometimes...

24 citations

01 Jan 2010
TL;DR: In this article, a rule-based web usage behavior action string that can be analyzed using Trie data structures to detect web spambots is proposed to eliminate spam in web 2.0 applications.
Abstract: spambots are a new type of internet robot that spread spam content through Web 2.0 applications like online discussion boards, blogs, wikis, social networking platforms etc. These robots are intelligently designed to act like humans in order to fool safeguards and other users. Such spam content not only wastes valuable resources and time but also may mislead users with unsolicited content. Spam content typically intends to misinform users (scams), generate traffic, make sales (marketing/advertising), and occasionally compromise parties, people or systems by spreading spyware or malwares. Current countermeasures do not effectively identify and prevent web spambots. Proactive measures to deter spambots from entering a site are limited to question / response scenarios. The remaining efforts then focus on spam content identification as a passive activity. Spammers have evolved their techniques to bypass existing anti-spam filters. In this paper, we describe a rule-based web usage behaviour action string that can be analysed using Trie data structures to detect web spambots. Our experimental results show the proposed system is successful for on-the-fly classification of web spambots hence eliminating spam in web 2.0 applications.

11 citations

Book
12 Jul 2011
TL;DR: This thesis introduces two new malware detection sensors that make use of the so-called honeypots and study the change in exploit behavior and derive predictions about preferred targets of todays’ malware.
Abstract: Many different network and host-based security solutions have been developed in the past to counter the threat of autonomously spreading malware. Among the most common detection techniques for such attacks are network traffic analysis and the so-called honeypots. In this thesis, we introduce two new malware detection sensors that make use of the above mentioned techniques. The first sensor called Rishi, passively monitors network traffic to automatically detect bot infected machines. The second sensor called Amun follows the concept of honeypots and detects malware through the emulation of vulnerabilities in network services that are commonly exploited. Both sensors were operated for two years and collected valuable data on autonomously spreading malware in the Internet. From this data we were able to, for example, study the change in exploit behavior and derive predictions about preferred targets of todays’ malware.

5 citations

References
More filters
Proceedings ArticleDOI
25 Oct 2004
TL;DR: In this paper, the authors present quantitative data about SMTP traffic to MIT's Computer Science and Artificial Intelligence Laboratory (CSAIL) based on packet traces taken in December 2000 and February 2004, and show that the volume of email has increased by 866% between 2000 and 2004.
Abstract: This paper presents quantitative data about SMTP traffic to MIT's Computer Science and Artificial Intelligence Laboratory (CSAIL) based on packet traces taken in December 2000 and February 2004. These traces show that the volume of email has increased by 866% between 2000 and 2004. Local mail hosts utilizing black lists generated over 470,000 DNS lookups, which accounts for 14% of all DNS lookups that were observed on the border gateway of CSAIL on a given day in 2004. In comparison, DNS black list lookups accounted for merely 0.4% of lookups in December 2000.The distribution of the number of connections per remote spam source is Zipf-like in 2004, but not so in 2000. This suggests that black lists may be ineffective at fully stemming the tide of spam. We examined seven popular black lists and found that 80% of spam sources we identified are listed in some DNS black list. Some DNS black lists appear to be well-correlated with others, which should be considered when estimating the likelihood that a host is a spam source.

205 citations

Proceedings Article
21 Apr 2009
TL;DR: For each spam campaign, spammers must gather and target a particular set of recipients, construct enticing message content, ensure sufficient IP address Diversity to evade blacklists, and maintain sufficient content diversity to evade spam filters.
Abstract: Over the last decade, unsolicited bulk email—spam— has evolved dramatically in its volume, its delivery infrastructure and its content. Multiple reports indicate that more than 90% of all email traversing the Internet today is considered spam. This growth is partially driven by a multi-billion dollar anti-spam industry whose dedication to filtering spam in turn requires spammers to recruit botnets to send ever greater volumes to maintain profits. While we all bear witness to this evolution via the contents of our inboxes, far less is understood about the spammer’s viewpoint. In particular, for each spam campaign, spammers must gather and target a particular set of recipients, construct enticing message content, ensure sufficient IP address diversity to evade blacklists, and maintain sufficient content diversity to evade spam filters.

103 citations

15 Apr 2008
TL;DR: A new methodology— distribution infiltration— for measuring spam campaigns from the inside is explored, motivated by the observation that as spammers have migrated from open relays and open proxies to more complex malware-based “botnet” email distribution, they have unavoidably opened their infrastructure to outside observation.
Abstract: Over the last decade, unsolicited bulk email, or spam, has transitioned from a minor nuisance to a major scourge, adversely affecting virtually every Internet user Industry estimates suggest that the total daily volume of spam now exceeds 120 billion messages per day [10]; even if the actual figure is 10 times smaller, this means thousands of unwanted messages annually for every Internet user on the planet Moreover, spam is used not only to shill for cheap pharmaceuticals, but has also become the de facto delivery mechanism for a range of criminal endeavors, including phishing, securities manipulation, identity theft and malware distribution This problem has spawned a multi-billion dollar anti-spam industry that in turn drives spammers to ever greater sophistication and scale Today, even a single spam campaign may target hundreds of millions of email addresses, sent in turn via hundreds of thousands of compromised “bot” hosts, with polymorphic “message templates” carefully crafted to evade widely used filters However, while there is a considerable body of research focused on spam from the recipient’s point of view, we understand considerably less about the sender’s perspective: how spammers test, target, distribute and deliver a large spam campaign in practice At the heart of this discrepancy is the limited vantage point available to most research efforts While it is straightforward to collect individual spam messages at a site (eg, via a “spam trap”), short of infiltrating a spammer organization it is difficult to observe a campaign being orchestrated in its full measure We believe ours is the first study to approach the problem from this direction In this paper, we explore a new methodology— distribution infiltration—for measuring spam campaigns from the inside This approach is motivated by the observation that as spammers have migrated from open relays and open proxies to more complex malware-based “botnet” email distribution, they have unavoidably opened their infrastructure to outside observation By hooking into a botnet’s command-and-control (C&C) protocol, one can infiltrate a spammer’s distribution platform and measure spam campaigns as they occur In particular, we present an initial analysis of spam campaigns conducted by the well-known Storm botnet, based on data we captured by infiltrating its distribution platform We first look at the system components used to support spam campaigns These include a work queue model for distributing load across the botnet, a modular campaign framework, a template language for introducing per-message polymorphism, delivery feedback for target list pruning, per-bot address harvesting for acquiring new targets, and special test campaigns and email accounts used to validate that new spam templates can bypass filters We then also look at the dynamics of how such campaigns unfold We analyze the address lists to characterize the targeting of different campaigns, delivery failure rates (a metric of address list “quality”), and estimated total campaign sizes as extrapolated from a set of samples From these estimates, one such campaign— focused on perpetuating the botnet itself—spewed email to around 400 million email addresses during a threeweek period

97 citations

Proceedings Article
06 Aug 2007
TL;DR: It is demonstrated that the history and the structure of the IP addresses can reduce the adverse impact of mail server overload, by increasing the number of legitimate e-mails accepted by a factor of 3.
Abstract: E-mail has become indispensable in today's networked society. However, the huge and ever-growing volume of spam has become a serious threat to this important communication medium. It not only affects e-mail recipients, but also causes a significant overload to mail servers which handle the e-mail transmission. We perform an extensive analysis of IP addresses and IP aggregates given by network-aware clusters in order to investigate properties that can distinguish the bulk of the legitimate mail and spam. Our analysis indicates that the bulk of the legitimate mail comes from long-lived IP addresses. We also find that the bulk of the spam comes from network clusters that are relatively long-lived. Our analysis suggests that network-aware clusters may provide a good aggregation scheme for exploiting the history and structure of IP addresses. We then consider the implications of this analysis for prioritizing legitimate mail. We focus on the situation when mail server is overloaded, and the goal is to maximize the legitimate mail that it accepts. We demonstrate that the history and the structure of the IP addresses can reduce the adverse impact of mail server overload, by increasing the number of legitimate e-mails accepted by a factor of 3.

64 citations

15 Apr 2008
TL;DR: A spam analysis study using sinkholes based on open relays reveals how LVS appear to coordinate with each other to share the spamming workload among themselves and holds the promise of finally reverse engineering the workload distribution strategies by the LVS coordinator.
Abstract: Understanding the spammer behavior is a critical step in the long-lasting battle against email spams. Previous studies have focused on setting up honeypots or email sinkholes containing destination mailboxes for spam collection. A spam trace collected this way offers the limited viewpoint from a single organizational domain and hence is short of reflecting the global behavior of spammers. In this paper, we present a spam analysis study using sinkholes based on open relays. A relay sinkhole offers a unique vantage point in spam collection: it has the broader view of spam originated from multiple spam origins destined to mailboxes belonging to multiple organizational domains. The trace collected using this methodology opens the door to study spammer behaviors that were difficult to do using spam collected from a single organization. Seeing the aggregate behavior of spammers allows us to systematically separate High-Volume Spammers (HVS, e.g. direct spammers) from Low-Volume Spammers (LVS, e.g. low-volume bots in a botnet). Such a separation in turn gives rise to the notion of "spam campaigns", which reveals how LVS appear to coordinate with each other to share the spamming workload among themselves. A detailed spam campaign analysis holds the promise of finally reverse engineering the workload distribution strategies by the LVS coordinator.

43 citations