What would make BGP spectrum agility attacks more difficult to mount?

A routing infrastructure that instead provided protection against route hijacking (specifically, unauthorized announcement of IP address blocks) would make BGP spectrum agility attacks more difficult to mount.

How many blacklists were used to test this hypothesis?

To test this hypothesis, the authors used theresults from real-time DNSBL lookups performed by Mail Avenger to 8 different blacklists at the time the mail was received .

What is the common way of sending spam?

A small portion of spam is sent by sophisticated spammers, who briefly advertise IP prefixes, establish a connection to the victim’s mail relay, and withdraw the route to that IP address space after spam is sent.

What is the common reason for the large fraction of spam coming from Windows hosts?

Because a very large fraction of spam comes from Windows hosts, their hypothesis is that many of these machines are infected hosts that are bots.

How many spamming bots persist in the trace?

The persistence of Bobax-infected hosts appears to be mildly bimodal: although roughly 75% of Bobax drones persist for less than two minutes, the remainder persist for a day or longer, about 50 persist for about six months, and 10 persist for entire length of the trace.

Why are the authors interested in measuring the persistence of IP addresses?

Since one of their objectives is to study the effectiveness of IP-based filtering (rather than, say, count the total number of hosts), the authors are interested more in measuring the persistence of IP addresses, not hosts.

What are the properties of network-level spam?

2. Network-level properties may be observable in the middle of the network, or closer to the source of the spam, which may allow spam to be quarantined or disposed of before it ever reaches a destination mail server.

What is the effect of the'short-lived' routing announcements?

As an added benefit, route announcements for shorter IP prefixes (i.e., larger blocks of IP addresses) are less likely to be blocked by ISPs’ route filters than route announcements or hijacks for longer prefixes.

How many ASes appear among the top 10 persistent and voluminous spammers?

only two ASes—AS 4788 (Telekom Malaysia) and AS 4678 (Canon Network Communications, in Japan)—appear among both the top-10 most persistent and most voluminous spammers using short-lived BGP routing announcements.

How many hosts are responsible for the amount of spam the authors receive?

More striking is that, while only about 4% of the hosts from which the authors receive spam are from hosts are running operating systems other than Windows, this small set of hosts appears to be responsible for at least 8% of the spam the authors receive.

What are the main characteristics of mail headers?

Although many aspects of mail headers can be forged, the authors base their analysis strictly on properties of the sender that are difficult to forge (e.g., IP addresses that made connections to their mail servers, passive TCP fingerprints, corresponding route announcements, etc.).

How did the authors determine that the spam was particularly prevalent?

Given the sophistication required to send spam under the protection of short-lived routing announcements (especially compared with the relative simplicity of purchasing access to a botnet), the authors doubted that it was particularly prevalent.

How do you explain the behavior of the spammers using this technique?

The authors are at a loss to explain certain aspects of this behavior, such as why some of the machines appear to have IP addresses from allocated space, when it would be simpler to “step around” the allocated prefix blocks, but, needless to say, the spammers using this technique appear to be very sophisticated.

(Open Access) Understanding the network-level behavior of spammers (2006) | Anirudh Ramachandran

Q: What is the main reason for the skewed distribution of spam?

This heavily skewed distribution suggests that spam filtering efforts might better focus on identifying high-volume, persistent groups of spammers (e.g., by AS number), rather than on blacklisting individual IP addresses, many of which are transient.

Understanding the Network-Level Behavior of Spammers

Anirudh Ramachandran and Nick Feamster

College of Computing, Georgia Tech

{avr, feamster}@cc.gatech.edu

ABSTRACT

This paper studies the network-level behavior of spammers, includ-

ing: IP address ranges that send the most spam, common spamming

modes (e.g., BGP route hijacking, bots), how persistent across time

each spamming host is, and characteristics of spamming botnets.

We try to answer these questions by analyzing a 17-month trace

of over 10 million spam messages collected at an Internet “spam

sinkhole”, and by correlating this data with the results of IP-based

blacklist lookups, passive TCP ﬁngerprinting information, routing

information, and botnet “command and control” traces.

We ﬁnd that most spam is being sent from a few regions of

IP address space, and that spammers appear to be using transient

“bots” that send only a few pieces of email over very short peri-

ods of time. Finally, a small, yet non-negligible, amount of spam

is received from IP addresses that correspond to short-lived BGP

routes, typically for hijacked preﬁxes. These trends suggest that de-

veloping algorithms to identify botnet membership, ﬁltering email

messages based on network-level properties (which are less vari-

able than email content), and improving the security of the Internet

routing infrastructure, may prove to be extremely effective for com-

bating spam.

Categories and Subject Descriptors

C.2.0 [Computer Communication Networks]: Security and pro-

tection; C.2.3 [Computer Communication Networks]: Network

operations – network management

General Terms

Design, Management, Reliability, Security

Keywords

spam, botnet, BGP, network management, security

1. Introduction

This paper presents a study of the network-level characteristics

of unsolicited commercial email (“spam”). Much attention has been

devoted to studying the content of spam, but comparatively little at-

tention has been paid to spam’s network-level properties. Conven-

tional wisdom often asserts that most of today’s spam comes from

botnets, and that a large fraction of spam comes from Asia; a few

studies have attempted to quantify some of these characteristics [

5].

Permission to make digital or hard copies of all or part of this work for

personal or classroom use is granted without fee provided that copies are

not made or distributed for proﬁt or commercial advantage and that copies

bear this notice and the full citation on the ﬁrst page. To copy otherwise, to

republish, to post on servers or to redistribute to lists, requires prior speciﬁc

permission and/or a fee.

SIGCOMM’06,September11-15,2006,Pisa,Italy.

$5.00.

Unfortunately, little is known about how much spam comes from

botnets versus other techniques (e.g., short-lived route announce-

ments, open relays, etc.), the geographic and topological distribu-

tion of where most spam originates (in terms of Internet Service

Providers, countries, and IP address space), the extent to which dif-

ferent spammers use the same network resources, the stationarity

of these properties over time, and so forth. A primary goal of this

paper is to shed some light on these relatively unstudied questions.

Beyond merely exposing spammers’ behavior, gathering infor-

mation about the network-level behavior of spam could be a ma-

jor asset for designing spam ﬁlters that are based on spammers’

network-level behavior (presuming that the network-level charac-

teristics of spam are sufﬁciently different than those of legitimate

mail, a question we explore further in Section 4). Whereas spam-

mers have the ﬂexibility to alter the content of emails—both per-

recipient and over time as users update spam ﬁlters—they have far

less ﬂexibility when it comes to altering the network-level proper-

ties of the spam they send. It is far easier for a spammer to alter the

content of email messages to evade spam ﬁlters than it is for that

spammer to change the ISP, IP address space, or botnet from which

spam is sent.

Towards the goal of developing techniques that will help in the

design of more robust network-level spam ﬁlters, this paper char-

acterizes the network-level behavior of spammers as observed at

a large spam sinkhole domain, which stores complete logs of all

spam received from August 2004 through December 2005. We

perform a joint analysis of the data collected at this sinkhole with

an archive of BGP route advertisements as heard from the receiving

network, traces from the “command and control” of a Bobax botnet,

and traces of legitimate email from the mail server logs of a large

email service provider. Although many aspects of mail headers can

be forged, we base our analysis strictly on properties of the sender

that are difﬁcult to forge (e.g., IP addresses that made connections

to our mail servers, passive TCP ﬁngerprints, corresponding route

announcements, etc.).

We draw the following surprising conclusions from our study:

• The vast majority of received spam arrives from a few con-

centrated portions of IP address space (Section

4). Spam

ﬁltering techniques currently make no assumptions about

the distribution of spam across IP address space. In a re-

lated area, many worm propagation models assume a uni-

form distribution of vulnerable hosts across IP address space

(e.g., [

29]). In contrast, we ﬁnd that the vast majority

of spamming hosts—and, perhaps not coincidentally, most

Bobax-infected hosts—lie within a small number of IP ad-

dress space regions. Unfortunately, with a few exceptions

(e.g., 60.* – 70.*), most legitimate email comes from the

same regions of IP address space, which suggests that, in

general, effective ﬁltering based on network-level properties

may require determining second-order characteristics (e.g.,

botnet membership).

291

• Most received spam is sent from Windows hosts, each of

which sends a relatively small volume of spam to our do-

main (Section 5). Most bots send a relatively small volume

of spam to our sinkhole (i.e., less than 100 pieces of spam

over 17 months), and about three-quarters of them are only

active for a single time period of less than two minutes (65%

of them send all spam in a “single shot”).

• A small set of spammers continually use short-lived route an-

nouncements to remain untraceable (Section

6). A small por-

tion of spam is sent by sophisticated spammers, who brieﬂy

advertise IP preﬁxes, establish a connection to the victim’s

mail relay, and withdraw the route to that IP address space

after spam is sent. Anecdotal evidence has suggested that

spammers might be exploiting the routing infrastructure to

remain untraceable [

1, 30]; this paper quantiﬁes and docu-

ments this activity for the ﬁrst time. To our surprise, we dis-

covered a new class of attack, where spammers attempt to

evade detection by hijacking large IP address blocks (e.g.,

/8s) and sending spam from widely dispersed “dark” (i.e.,

unused or unallocated) IP addresses within this space.

Beyond these ﬁndings, this paper’s joint analysis of several

datasets provides a unique window into the network-level charac-

teristics of spam. To our knowledge, this paper presents the ﬁrst

study that examines the interplay between spam, botnets, and the

Internet routing infrastructure.

We acknowledge that our spam corpus represents only a sin-

gle vantage point, and, as such, drawing general conclusions about

Internet-wide spam is not possible. Our goal is not to present con-

clusive ﬁgures about Internet-wide characteristics of spam. Indeed,

the data we have collected is a small, localized sample of all spam

trafﬁc, and our statistics may not be reﬂective of Internet-wide char-

acteristics. However, the spam we have collected represents an in-

teresting dataset as it reﬂects the complete set of spam emails re-

ceived by a single Internet domain. This dataset exposes spamming

as a typical network operator for some Internet domain might also

witness it. This unique view can help us better understand whether

the features of spam that any single network operator observes

could be useful in developing more effective ﬁltering techniques.

With these goals in mind and an understanding of the context

of our data, we offer the following additional observations on the

implications of our results for the design of more effective tech-

niques for spam mitigation, which we revisit in more detail in Sec-

tion

7. First, the ability to trace the identities of spammers hinges

on securing the routing infrastructure. Second, the distribution of

spam and botnet activity across IP space suggests that, for some IP

address ranges and networks, spam ﬁlters might monitor network-

wide spam arrival patterns and attribute higher levels of suspicion

to spam originating from networks with higher spam activity. Given

the highly variable nature of the content of spam messages, incor-

porating general network-level properties of spam into ﬁlters may

ultimately provide signiﬁcant gains over more traditional methods

(e.g., content-based ﬁltering), both through increased robustness

and the ability to stop spam closer to its source.

The rest of this paper is organized as follows. Section

2 pro-

vides background on spamming and an overview of previous re-

lated work. In Section 3, we describe our data collection techniques

and the datasets we used in our analysis. In Section

4, we study the

distribution of spammers, spamming botnets, and legitimate mail

senders across IP address space. Section 5 presents our ﬁndings

regarding the relationship between the spam received at our sink-

holes and known spamming bots. Section

6 examines the extent to

which spammers use IP addresses that are generally unreachable

(e.g., using short-lived BGP route announcements) to send spam

untraceably. Based on our ﬁndings, Section

7 offers positive rec-

ommendations for designing more effective mitigation techniques.

We conclude in Section 8.

2. Background and Related Work

This section provides an overview of techniques both for sending

and for mitigating spam and discusses related work in these areas.

2.1 Spam: Methods and Mitigation

In this section, we offer background on the main techniques used

by spammers to send email, as well as some of the more commonly

used mitigation techniques.

2.1.1 Spamming methods

Spammers use various techniques to send large volumes of mail

while attempting to remain untraceable. We describe several of

these techniques, beginning with “conventional” methods and pro-

gressing to more intricate techniques.

Direct spamming. Spammers may purchase upstream connec-

tivity from “spam-friendly ISPs”, which turn a blind eye to the

activity. Occasionally, spammers buy connectivity and send spam

from ISPs that do not condone this activity and are forced to change

ISPs. Ordinarily, changing from one ISP to another would require

a spammer to renumber the IP addresses of their mail relays. To

remain untraceable and avoid renumbering headaches, spammers

sometimes obtain a pool of dispensable dialup IP addresses, send

outgoing trafﬁc from a high-bandwidth connection the IP address

spoofed to appear as if it came from the dialup connection, and

proxy the reverse trafﬁc through the dialup connection back to the

spamming hosts [

25].

Open relays and proxies. Open relays are mail servers that

allow unauthenticated Internet hosts to connect and relay email

through them. Originally intended for user convenience (e.g., to let

users send mail from a particular relay while they are traveling or

otherwise in a different network), open relays have been exploited

by spammers due to the anonymity and ampliﬁcation offered by

the extra level of indirection. It appears that the widespread deploy-

ment and use of blacklisting techniques have all but extinguished

the use of open relays and proxies to send spam [

21, 26].

Botnets. Conventional wisdom suggests that the majority of

spam on the Internet today is sent by botnets—collections of ma-

chines acting under one centralized controller [

3, 4, 31]. The

W32/Bobax (“Bobax”) worm (of which there are many variants)

exploits the DCOM and LSASS vulnerabilities on Windows sys-

tems [

18], allows infected hosts to be used as a mail relay, and at-

tempts to spread itself to other machines affected by the above vul-

nerabilities, as well as over email. This paper studies the network-

level properties of spam sent by Bobax drones. Agobot and SDBot

are two other bots purported to send spam [12].

BGP spectrum agility. This study has discovered a new type of

cloaking mechanism—BGP “spectrum agility”—whereby spam-

mers brieﬂy announce (often hijacked) IP address space from

which they send spam and the routes to that IP address space once

the spam has been sent. Although we have observed this behavior

informally several years ago [6] and subsequent anecdotal evidence

has suggested that spammers may use this technique [

1], our study

thoroughly documents this activity, and further ﬁnds that spammers

may be using spectrum agility to complement spamming by other

methods.

2.1.2 Mitigation techniques

Techniques for mitigating spam are as varied as techniques to

send spam, and most existing techniques have signiﬁcant draw-

292

backs. One of the most widely used anti-spam techniques is ﬁlter-

ing, which typically classiﬁes email based on its content; content-

based ﬁltering uses features of the contents of an email’s headers

or body to determine whether it is likely to be spam. Content-based

ﬁlters, such as those incorporated by popular spam ﬁlters like Spa-

mAssassin [

27], successfully reduce the amount of spam that ac-

tually reaches a user’s inbox. On the other hand, content-based ﬁl-

tering has drawbacks. Users and system administrators must con-

tinually update their ﬁltering rules and use large corpuses of spam

for training; in response, spammers devise new ways of altering the

contents of an email to circumvent these ﬁlters. The cost of evading

content-based ﬁlters for spammers is negligible, since spammers

can easily alter content to attempt to evade these ﬁlters.

In addition to performing content-based checks, many mail ﬁl-

ters, including SpamAssassin, also perform lookups to determine

whether the sending IP address is in a “blacklist”. Blacklists of

known spammers, open relays and open proxies remain one of to-

day’s predominant spam ﬁltering techniques. There are more than

30 widely used blacklists in use today; each of these lists is sep-

arately maintained, and insertion into these lists is based on many

different types of observations (e.g., operating an open relay, send-

ing mail to a spam trap, etc.). The results in this paper—in par-

ticular, that IP address space is often “stolen” to send spam and

that many bot IP addresses are short-lived—indicate that this long-

standing method for ﬁltering spam could become much less effec-

tive as spammers adopt these more sophisticated techniques.

2.2 Related Work

In this section, we ﬁrst review previous work that has studied

various spamming and spam-mitigation techniques, as well as the

behavior of various worms and botnets. We then brieﬂy discuss pre-

vious studies of unorthodox routing announcements. Previous work

has studied each of these phenomena to some degree in isolation,

but this study is the ﬁrst to perform a joint analysis of spamming be-

havior, botnet characteristics, and Internet routing to better under-

stand the characteristics and network-level behavior of spammers.

2.2.1 Spam and botnets

Previous studies have investigated the behavior and properties of

worms, botnets, and other spam sources. Casado et al. used passive

measurements of packet traces captured from about 2,500 spam

sources to estimate the bottleneck bandwidths of roughly 25,000

TCP ﬂows from spam sources and found peaks at common band-

widths (e.g., modem speeds) [

2]. Kumar et al. deconstructed the

source code of the “Witty” worm to estimate various properties

about Internet hosts (e.g., host uptime) as well as about the propaga-

tion of the worm itself (e.g., who infected whom) [

14]. In contrast,

our work explores the behavior of spammers in depth, although we

also peripherally study malware whose exclusive purpose is to send

spam (i.e., the “Bobax” drone).

Several previous and ongoing projects are studying spammers’

attempts to harvest email addresses for the purposes of spamming.

For instance, Project Honeypot sinks email trafﬁc for unused MX

records and hands out “trap” email addresses to investigate harvest-

ing behavior and to help identify spammers [23]. A previous study

has used the data from Project Honeypot to analyze the methods

employed by spammers; monitor the time it takes from when an

email address is harvested to the time when that address ﬁrst re-

ceives spam; the countries where most harvesting infrastructure is

located; and the persistence (across time) of various harvesters [22].

We present preliminary results from a similar study in a technical

report version of this paper [24].

In Section

5, we correlate spam arrivals with traces of hosts

known to be infected with malware. Moore et al. found that the ma-

jority of hosts—and more than 80% of the hosts in Asia—did not

patch the relevant vulnerability until well after actual outbreak [

19],

which makes it more reasonable to assume that IP addresses of

Bobax drones remain infected for the duration of our spam trace.

2.2.2 Mitigation

A recent presentation from the SpamAssassin project discusses

several techniques that the SpamAssassin spam ﬁltering tool has

incorporated to detect forged X-Mailer headers, weak “hash-

busting” schemes, etc. [17]. Although their work also involves re-

verse engineering, the project focuses on analyzing mail contents

to reverse-engineer spamming tools and techniques (with the goal

of using this analysis to incorporate better content-ﬁltering rules

into SpamAssassin). Though our paper also studies such properties

of spam, our analysis hinges on network-level properties—for in-

stance, the IP address of the last remote mail relay (which previous

work has also observed as one of the few parts of the SMTP header

that cannot be forged [

10])—rather than the artifacts of spamming

software that appear in email content.

Jung et al. performed a study of DNS blacklist (DNSBL) trafﬁc

and the effectiveness of blacklists [

13] and observed that 80% of the

IP addresses that were sending spam were listed in DNSBLs two

months after the collection of the trafﬁc trace. Our study also mea-

sures the effectiveness of DNSBLs albeit in real time—we examine

whether a host IP is listed in a set of DNSBLs at the time the host

spammed our domain. While we also ﬁnd that about 80% of the re-

ceived spam was listed in at least one of eight blacklists, hosts that

employ spamming techniques such as BGP spectrum agility tend

to be listed in far fewer blacklists. We also ﬁnd that even the most

aggressive blacklist has a false negative rate of about 50%.

2.2.3 Unorthodox route announcements

Feamster et al. studied route advertisements for “bogon” IP ad-

dress space (i.e., private address space or unassigned addresses) [

8].

However, since bogus or reserved address ranges are well-known,

transit ISPs often ﬁlter them, resulting in little or no spam from

such ranges. Cursory studies have suggested that spammers adver-

tise routes to hijacked IP preﬁxes for short amounts of time to send

spam [

6, 28, 30]. In Section 6, we quantify the extent to which the

sending of spam coincides with short-lived BGP route announce-

ments for IP preﬁxes containing the mail relays that send spam.

3. Data Collection

This section describes the datasets that we use in our analysis.

Our primary dataset consists of the actual spam email messages

collected at a large spam sinkhole. To study the speciﬁc charac-

teristics of certain subsets of spammers, we augment this dataset

with three other data sources. First, to compare the network-level

characteristics of spam received at our sinkhole with similar char-

acteristics of legitimate email trafﬁc, we obtain a corpus of email

logs from a large email provider who automatically rejects email

likely to be spam (thus allowing us to distinguish legitimate mail

from spam). Second, we intercept the “command and control” traf-

ﬁc from a Bobax botnet at a sinkhole to identify IP addresses that

were infected with the Bobax worm (and, hence, are likely mem-

bers of botnets that are used for the sole purpose of sending spam).

Third, we collect BGP routing data at the upstream border router

of the same network where we are receiving spam and monitor the

routing activity for the IP preﬁxes corresponding to the IP addresses

from which spam was sent.

293

20000

40000

60000

80000

100000

120000

140000

160000

0 100 200 300 400 500

Count

Day

Spam

Distinct IPs

Figure 1: The amount of spam received per day at our sinkhole from

August 2004 through December 2005.

3.1 Spam Email Traces

To obtain a sample of spam, we registered a domain with no le-

gitimate email addresses and established a DNS Mail Exchange

(MX) record for it. Hence, all mail received by this server is spam.

The “sinkhole” has been capturing spam since August 5, 2004. Fig-

ure

1 shows the amount of spam that this sinkhole received per day

through January 6, 2006 (the period of time over which we conduct

our analysis). Although the total amount of spam received on any

given day is rather erratic, the data indicates two unsettling trends.

First, the amount of spam that the sinkhole is receiving generally

appears to be increasing. Second, and perhaps more troubling, the

number of distinct IP addresses from which we see spam on any

given day also appears to be on the rise.

In addition to simply collecting spam traces, the sinkhole runs

Mail Avenger [

16], a customizable Simple Mail Transfer Protocol

(SMTP) server that allows us to take speciﬁc actions upon receiv-

ing email from a mail relay (e.g., running traceroute to the mail

relay sending the mail, performing DNSBL lookups for the relay’s

IP address, performing a passive TCP ﬁngerprint of the relay). We

have conﬁgured Mail Avenger to (1) accept all mail, regardless

of the username for which the mail was destined and (2) gather

network-level properties about the mail relay from which spam is

received. In particular, the mail server collects the following infor-

mation about the mail relay when the spam is received:

• the IP address of the relay that established the SMTP con-

nection to the sinkhole

• a traceroute to that IP address, to help us estimate the network

location of the mail relay

• a passive “p0f” TCP ﬁngerprint, based on properties of the

TCP stack, to allow us to determine the operating system of

the mail relay

• the result of DNS blacklist (DNSBL) lookups for that mail

relay at eight different DNSBLs.

Note that, unlike many features of the SMTP header, these features

are not easily forged.

3.2 Legitimate Email Traces

One of the motivations for our study was to determine whether

the network-level characteristics of spam differ markedly from

those of legitimate email. To perform this comparison, we obtained

a corpus of mail logs from a large email provider that runs a Post-

ﬁx mail server. Because this provider manages millions of mail-

boxes, it performs extensive spam ﬁltering at its incoming SMTP

servers. Accordingly, the logs for this mail server record, for each

SMTP connection attempt, the time at which the connection at-

tempt was made, the IP address of the connecting host, whether the

mail was accepted or rejected, and, if the email was rejected, the

reason for rejection. Using these logs, we can estimate the network-

level properties of email that this domain deems to be legitimate.

We performed our analysis over approximately 700,000 pieces of

legitimate mail, as received at this provider’s mail server on June

13, 2006. Although the corpus of legitimate mail is from a different

domain than our sinkhole, both the spam sinkhole and the domain

for legitimate email constitute large, domain-wide data sources for

spam and legitimate mail, respectively, and are representative sam-

ples of spam and legitimate email that could be expected at any

Internet domain.

3.3 Botnet Command and Control Data

To identify a set of hosts that are sending email from botnets,

we used a trace of hosts infected by the W32/Bobax (“Bobax”)

worm from April 28-29, 2005. This trace was captured by hijack-

ing the authoritative DNS server for the domain running the com-

mand and control of the botnet and redirecting it to a machine at

a large campus network. This method was only possible because

(1) the Bobax drones contacted a centralized controller using a do-

main name, and (2) the researchers who obtained the trace were

able to obtain the trust of the network operators hosting the author-

itative DNS for that domain name. This technique directs control of

the botnet to the honeypot, which effectively disables it for spam-

ming for this period. On the upside, because all Bobax drones now

attempt to contact our command-and-control sinkhole rather than

the intended command-and-control host, we can collect a packet

trace to determine the members of the botnet.

To obtain a sample of spamming behavior from known botnets,

we correlate Bobax botnet membership from the 1.5-day trace of

Bobax drones with the IP addresses from which we receive spam in

the sinkhole trace. This technique, of course, is not perfect: over the

course of our spam trace, hosts may be patched. Although we can-

not precisely determine the extent to which the transience of bots

affects our analysis, previous work suggests that, even for highly

publicized worms, the rate at which vulnerable hosts are patched

is slow enough to expect that many of these infected hosts remain

unpatched [

19]. We also acknowledge another shortcoming of our

approach: if hosts use dynamic addressing, different hosts (some of

which may be Bobax-infected and some of which may not be) may

use one of the IP addresses observed in the Bobax trace. However,

we believe that the resulting inaccuracies are small: We observe

a signiﬁcantly higher percentage of Windows hosts in the subset

of spam messages sent by IP addresses in our Bobax trace than in

the complete spam dataset, which indirectly suggests that the hosts

with IP addresses from the Bobax trace were indeed part of a spam-

ming botnet when they spammed our sinkhole.

3.4 BGP Routing Measurements

In this paper, we study whether an IP address of the mail relay

from which we receive spam is reachable and how long it remains

reachable. We are particularly interested in cases where a route for

an IP address is reachable for only a short period of time, coinciding

with time at which spam was sent. To measure network-layer reach-

ability from the network where spam was received, we co-located

a “BGP monitor” in the same network as our spam sinkhole, sim-

ilar to that in our previous work [

7]. The monitor receives BGP

updates from the border router, and our analysis includes a BGP

update stream that overlaps with our spam trace. Since the moni-

294

tor has an internal BGP session to the network’s border router, it

will see only those BGP updates that cause a change in the border

router’s choice of best route to a preﬁx. Despite not observing all

BGP updates, the monitor receives enough information to allow us

to study the properties of short-lived BGP route announcements:

the monitor will have no route to the preﬁx at all if the preﬁx is

unreachable.

4. Network-level Characteristics of Spammers

In this section, we study some ﬁrst-order network-level char-

acteristics of spam sources. We survey the portions of IP address

space from which our sinkhole received spam and the ASes that

sent spam to the sinkhole. We also observe the persistence of these

characteristics over time. To determine whether these network level

characteristics could be suitable for ﬁltering spam, we compare the

network-level characteristics of spam to the same characteristics

for legitimate email, as received at a large domain that manages

approximately 40 million mailboxes.

We ﬁnd that the distribution of spam across IP address space is

(1) nearly identical to the legitimate mail distributions (with a few

exceptions), and (2) quite persistent over time. Still, the distribu-

tion of spam senders across IP address space is far from uniform,

and spam arrival by IP address range is much more pronounced,

persistent, and concentrated than similar characteristics by IP ad-

dress. Additionally, we ﬁnd that a large fraction of spam is received

from just a handful of ASes: nearly 12% of all received spam origi-

nates from mail relays in just two ASes (from Korea and China, re-

spectively), and the top 20 ASes are responsible for sending nearly

37% of all spam. This distribution (as well as the main perpetrators)

is also persistent over time. This heavily skewed distribution sug-

gests that spam ﬁltering efforts might better focus on identifying

high-volume, persistent groups of spammers (e.g., by AS number),

rather than on blacklisting individual IP addresses, many of which

are transient.

4.1 Distribution Across Networks

To determine the address space from which spam was arriving

(“prevalence”) and whether the distribution across IP addresses

changes over time (“persistence”), we tabulated the spam in our

trace by IP address space. We ﬁnd that spam arrivals across IP space

are far from uniform.

Finding 4.1 (Distribution across IP address space) The major-

ity of spam is sent from a relatively small fraction of IP address

space.

Figure

2 shows the number of spam email messages received

over the course of the entire trace, as a function of IP address space.

Several ranges of IP address space originate large amount of email

trafﬁc (both spam and legitimate), including space allocated to ca-

ble modem providers (e.g., 24.*) and the address space allocated

to the Asia Paciﬁc Network Information Center (APNIC) regional

Internet registry (e.g., 61.*). Although most IP address ranges that

originate a signiﬁcant amount of spam also originate a lot of legit-

imate mail trafﬁc, a few IP address ranges have signiﬁcantly more

spam than legitimate mail (e.g., 80.*–90.*), and vice versa (e.g.,

60.*–70.*). This characteristic suggests that it may be possible to

use IP address ranges to distinguish spam from legitimate email.

We repeated the analysis of the network-level characteristics of

spam per day across months, per month across years, and so forth.

We also compared the distribution of spam collected at our sink-

hole to the distribution of rejected SMTP connections at the domain

where we performed our analysis of legitimate email and found

0.2

0.4

0.6

0.8

240.0.0.0

210.0.0.0

180.0.0.0

150.0.0.0

120.0.0.0

90.0.0.0

60.0.0.0

30.0.0.0

0.0.0.0

CDF

IP Space

Legitimate email

Spam

Spamming IPs

Figure 2: Fraction of spam email messages and comparison with legit-

imate email received (as a function of IP address space); also, fraction

of client IP addresses that sent spam, binned by /24.

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1 10 100 1000 10000 100000

Fraction of clients

Number of Appearances

Figure 3: The number of distinct times that each client IP sent mail to

our sinkhole (regardless of the nu mber emails sent in each batch).

that the distribution of these connections across IP address space

is similar to that shown in Figure

2. All of these distributions have

remained roughly constant over time (i.e., the results look similar

to those shown in Figure 2). In contrast, individual IP addresses

are far more transient. Figure

3 shows that even though a few IP

addresses sent more than 10,000 emails, about 85% of client IP ad-

dresses sent less than 10 emails to the sinkhole, indicating that tar-

geting an individual IP address might not help mitigate spam with-

out sharing information across domains. This ﬁnding has an impor-

tant implication for spam ﬁlter design: Though the individual IP ad-

dresses from which spam is received changes from day-to-day, the

fact that spam continually comes from the same IP address space

suggests that incorporating these more persistent features may be

more effective, particularly in portions of the IP address space that

send either mostly spam or mostly legitimate email.

In many cases, IP address ranges are not adequate for distin-

guishing spam from legitimate email. To determine whether other

network-level properties, such as the AS from which the email was

sent, could serve as better classiﬁers, we examined the distribution

of spam across ASes and compared this feature to the distribution

of legitimate email across ASes.

Finding 4.2 (Distribution across ASes) More than 10% of spam

received at our sinkhole originated from mail rel ays in tw o ASes,

295

Understanding the network-level behavior of spammers

Figures

Citations

BotMiner: clustering analysis of network traffic for protocol- and structure-independent botnet detection

SybilGuard: defending against sybil attacks via social networks

BotHunter: detecting malware infection through IDS-driven dialog correlation

Your botnet is my botnet: analysis of a botnet takeover

SybilLimit: A Near-Optimal Social Network Defense against Sybil Attacks

References

How to Own the Internet in Your Spare Time

Code-Red: a case study on the spread and victims of an internet worm

Understanding BGP misconfiguration

An empirical study of spam traffic and the use of DNS black lists

Measuring the effects of internet path faults on reactive routing

Related Papers (5)

Spamming botnets: signatures and characteristics

The Zombie roundup: understanding, detecting, and disrupting botnets

A multifaceted approach to understanding the botnet phenomenon

BotSniffer: Detecting Botnet Command and Control Channels in Network Traffic

BotMiner: clustering analysis of network traffic for protocol- and structure-independent botnet detection

Frequently Asked Questions (15)

Q1. What would make BGP spectrum agility attacks more difficult to mount?

Q2. Why are open relays used by spammers?

Q3. How many blacklists were used to test this hypothesis?

Q4. What is the common way of sending spam?

Q5. What is the common reason for the large fraction of spam coming from Windows hosts?

Q6. What is the main reason for the skewed distribution of spam?

Q7. How many spamming bots persist in the trace?

Q8. Why are the authors interested in measuring the persistence of IP addresses?

Q9. What are the properties of network-level spam?

Q10. What is the effect of the'short-lived' routing announcements?

Q11. How many ASes appear among the top 10 persistent and voluminous spammers?

Q12. How many hosts are responsible for the amount of spam the authors receive?

Q13. What are the main characteristics of mail headers?

Q14. How did the authors determine that the spam was particularly prevalent?

Q15. How do you explain the behavior of the spammers using this technique?