scispace - formally typeset

Book ChapterDOI

Towards Proactive Spam Filtering (Extended Abstract)

29 Jun 2009-pp 38-47

TL;DR: This paper introduces a more proactive approach that allows us to directly collect spam message by interacting with the spam botnet controllers and generates templates that represent a concise summary of a spam run.
Abstract: With increasing security measures in network services, remote exploitation is getting harder. As a result, attackers concentrate on more reliable attack vectors like email: victims are infected using either malicious attachments or links leading to malicious websites. Therefore efficient filtering and blocking methods for spam messages are needed. Unfortunately, most spam filtering solutions proposed so far are reactive , they require a large amount of both ham and spam messages to efficiently generate rules to differentiate between both. In this paper, we introduce a more proactive approach that allows us to directly collect spam message by interacting with the spam botnet controllers. We are able to observe current spam runs and obtain a copy of latest spam messages in a fast and efficient way. Based on the collected information we are able to generate templates that represent a concise summary of a spam run. The collected data can then be used to improve current spam filtering techniques and develop new venues to efficiently filter mails.
Topics: Forum spam (71%), Spambot (69%), Srizbi botnet (68%), Spam and Open Relay Blocking System (66%), Botnet (60%)
Citations
More filters

Proceedings ArticleDOI
20 Apr 2010-
TL;DR: An automated supervised machine learning solution which utilises web navigation behaviour to detect web spambots and proposes a new feature set (referred to as an action set) as a representation of user behaviour to differentiate web spamots from human users.
Abstract: Web robots have been widely used for various beneficial and malicious activities. Web spambots are a type of web robot that spreads spam content throughout the web by typically targeting Web 2.0 applications. They are intelligently designed to replicate human behaviour in order to bypass system checks. Spam content not only wastes valuable resources but can also mislead users to unsolicited websites and award undeserved search engine rankings to spammers’ campaign websites. While most of the research in anti-spam filtering focuses on the identification of spam content on the web, only a few have investigated the origin of spam content, hence identification and detection of web spambots still remains an open area of research. In this paper, we describe an automated supervised machine learning solution which utilises web navigation behaviour to detect web spambots. We propose a new feature set (referred to as an action set) as a representation of user behaviour to differentiate web spambots from human users. Our experimental results show that our solution achieves a 96.24% accuracy in classifying web spambots.

40 citations


Proceedings ArticleDOI
Feng Qian1, Abhinav Pathak2, Yu Charlie Hu2, Zhuoqing Morley Mao1  +1 moreInstitutions (3)
14 Jun 2010-
TL;DR: To the knowledge, SCA is the first unsupervised spam filtering scheme that achieves accuracy comparable to the de-facto supervised spam filters by explicitly exploiting online campaign identification.
Abstract: Traditional content-based spam filtering systems rely on supervised machine learning techniques. In the training phase, labeled email instances are used to build a learning model (e.g., a Naive Bayes classifier or support vector machine), which is then applied to future incoming emails in the detection phase. However, the critical reliance on the training data becomes one of the major limitations of supervised spam filters. Preparing labeled training data is often labor-intensive and can delay the learning-detection cycle. Furthermore, any mislabeling of the training corpus (e.g., due to spammers’ obfuscations) can severely affect the detection accuracy. Supervised learning schemes share one common mechanism regardless of their algorithm details: learning is performed on an individual email basis. This is the fundamental reason for requiring training data for supervised spam filters. In other words, in the learning phase these classifiers can never tell whether an email is spam or ham because they examine one email instance at a time. We investigate the feasibility of a completely unsupervised-learningbased spam filtering scheme which requires no training data. Our study is motivated by three key observations of the spam in today’s Internet. (1) The vast majority of emails are spam. (2) A spam email should always belong to some campaign [2, 3]. (3) The spam from the same campaign are generated from templates that obfuscate some parts of the spam, e.g., sensitive terms, leaving the other parts unmodified [3]. These observations suggest that in principle we can achieve unsupervised spam detection by examining emails at the campaign level. In particular, we need robust spam identification algorithms to find common terms shared by spam belonging to the same campaign. These common terms form signatures that can be used to detect future spam of the same campaign. This paper presents SpamCampaignAssassin (SCA), an online unsupervised spam learning and detection scheme. SCA performs accurate spam campaign identification, campaign signature generation, and spam detection using campaign signatures. To our knowledge, SCA is the first unsupervised spam filtering scheme that achieves accuracy comparable to the de-facto supervised spam filters by explicitly exploiting online campaign identification. The full paper describing SCA is available as a technical report [4].

26 citations


Cites methods from "Towards Proactive Spam Filtering (E..."

  • ...(4) The final category includes automated systems like [46, 12, 31] AutoRE [46] is a technique that automatically extracts regular expressions from URLs that satisfy distributed and burstiness criteria as signatures (e....

    [...]


Journal ArticleDOI
TL;DR: It is concluded that finding pitfalls in the usage of tools by cybercriminals has the potential to increase the efficiency of disruption, interception, and prevention approaches.
Abstract: This work presents an overview of some of the tools that cybercriminals employ to trade securely. It will look at the weaknesses of these tools and how the behavior of cybercriminals will sometimes...

17 citations


01 Jan 2010-
Abstract: spambots are a new type of internet robot that spread spam content through Web 2.0 applications like online discussion boards, blogs, wikis, social networking platforms etc. These robots are intelligently designed to act like humans in order to fool safeguards and other users. Such spam content not only wastes valuable resources and time but also may mislead users with unsolicited content. Spam content typically intends to misinform users (scams), generate traffic, make sales (marketing/advertising), and occasionally compromise parties, people or systems by spreading spyware or malwares. Current countermeasures do not effectively identify and prevent web spambots. Proactive measures to deter spambots from entering a site are limited to question / response scenarios. The remaining efforts then focus on spam content identification as a passive activity. Spammers have evolved their techniques to bypass existing anti-spam filters. In this paper, we describe a rule-based web usage behaviour action string that can be analysed using Trie data structures to detect web spambots. Our experimental results show the proposed system is successful for on-the-fly classification of web spambots hence eliminating spam in web 2.0 applications.

11 citations


Book
12 Jul 2011-
TL;DR: This thesis introduces two new malware detection sensors that make use of the so-called honeypots and study the change in exploit behavior and derive predictions about preferred targets of todays’ malware.
Abstract: Many different network and host-based security solutions have been developed in the past to counter the threat of autonomously spreading malware. Among the most common detection techniques for such attacks are network traffic analysis and the so-called honeypots. In this thesis, we introduce two new malware detection sensors that make use of the above mentioned techniques. The first sensor called Rishi, passively monitors network traffic to automatically detect bot infected machines. The second sensor called Amun follows the concept of honeypots and detects malware through the emulation of vulnerabilities in network services that are commonly exploited. Both sensors were operated for two years and collected valuable data on autonomously spreading malware in the Internet. From this data we were able to, for example, study the change in exploit behavior and derive predictions about preferred targets of todays’ malware.

5 citations


References
More filters

Proceedings Article
01 Jul 1998-
TL;DR: This work examines methods for the automated construction of filters to eliminate such unwanted messages from a user’s mail stream, and shows the efficacy of such filters in a real world usage scenario, arguing that this technology is mature enough for deployment.
Abstract: In addressing the growing problem of junk E-mail on the Internet, we examine methods for the automated construction of filters to eliminate such unwanted messages from a user’s mail stream. By casting this problem in a decision theoretic framework, we are able to make use of probabilistic learning methods in conjunction with a notion of differential misclassification cost to produce filters Which are especially appropriate for the nuances of this task. While this may appear, at first, to be a straight-forward text classification problem, we show that by considering domain-specific features of this problem in addition to the raw text of E-mail messages, we can produce much more accurate filters. Finally, we show the efficacy of such filters in a real world usage scenario, arguing that this technology is mature enough for deployment.

1,515 citations


Journal ArticleDOI
H. Drucker1, Donghui Wu, Vladimir VapnikInstitutions (1)
TL;DR: The use of support vector machines in classifying e-mail as spam or nonspam is studied by comparing it to three other classification algorithms: Ripper, Rocchio, and boosting decision trees, which found SVM's performed best when using binary features.
Abstract: We study the use of support vector machines (SVM) in classifying e-mail as spam or nonspam by comparing it to three other classification algorithms: Ripper, Rocchio, and boosting decision trees. These four algorithms were tested on two different data sets: one data set where the number of features were constrained to the 1000 best features and another data set where the dimensionality was over 7000. SVM performed best when using binary features. For both data sets, boosting trees and SVM had acceptable test performance in terms of accuracy and speed. However, SVM had significantly less training time.

1,463 citations


Journal ArticleDOI
01 Mar 2007-
TL;DR: The design and implementation of CWSandbox is described, a malware analysis tool that fulfills the three design criteria of automation, effectiveness, and correctness for the Win32 family of operating systems.
Abstract: Malware is notoriously difficult to combat because it appears and spreads so quickly. In this article, we describe the design and implementation of CWSandbox, a malware analysis tool that fulfills our three design criteria of automation, effectiveness, and correctness for the Win32 family of operating systems

742 citations


Posted Content
TL;DR: It is reached that additional safety nets are needed for the Naive Bayesian anti-spam filter to be viable in practice.
Abstract: It has recently been argued that a Naive Bayesian classifier can be used to filter unsolicited bulk e-mail (“spam”). We conduct a thorough evaluation of this proposal on a corpus that we make publicly available, contributing towards standard benchmarks. At the same time we investigate the effect of attribute-set size, training-corpus size, lemmatization, and stop-lists on the filter’s performance, issues that had not been previously explored. After introducing appropriate cost-sensitive evaluation measures, we reach the conclusion that additional safety nets are needed for the Naive Bayesian anti-spam filter to be viable in practice.

631 citations


Journal ArticleDOI
Anirudh Ramachandran1, Nick Feamster1Institutions (1)
11 Aug 2006-
TL;DR: It is found that most spam is being sent from a few regions of IP address space, and that spammers appear to be using transient "bots" that send only a few pieces of email over very short periods of time.
Abstract: This paper studies the network-level behavior of spammers, including: IP address ranges that send the most spam, common spamming modes (e.g., BGP route hijacking, bots), how persistent across time each spamming host is, and characteristics of spamming botnets. We try to answer these questions by analyzing a 17-month trace of over 10 million spam messages collected at an Internet "spam sinkhole", and by correlating this data with the results of IP-based blacklist lookups, passive TCP fingerprinting information, routing information, and botnet "command and control" traces.We find that most spam is being sent from a few regions of IP address space, and that spammers appear to be using transient "bots" that send only a few pieces of email over very short periods of time. Finally, a small, yet non-negligible, amount of spam is received from IP addresses that correspond to short-lived BGP routes, typically for hijacked prefixes. These trends suggest that developing algorithms to identify botnet membership, filtering email messages based on network-level properties (which are less variable than email content), and improving the security of the Internet routing infrastructure, may prove to be extremely effective for combating spam.

582 citations