Proceedings Article•DOI•

Outside the Closed World: On Using Machine Learning for Network Intrusion Detection

Robin Sommer¹, Vern Paxson²•Institutions (2)

Lawrence Berkeley National Laboratory¹, University of California, Berkeley²

16 May 2010-pp 305-316

TL;DR: The main claim is that the task of finding attacks is fundamentally different from these other applications, making it significantly harder for the intrusion detection community to employ machine learning effectively.

read less

Abstract: In network intrusion detection research, one popular strategy for finding attacks is monitoring a network's activity for anomalies: deviations from profiles of normality previously learned from benign traffic, typically identified using tools borrowed from the machine learning community However, despite extensive academic research one finds a striking gap in terms of actual deployments of such systems: compared with other intrusion detection approaches, machine learning is rarely employed in operational "real world" settings We examine the differences between the network intrusion detection problem and other areas where machine learning regularly finds much more success Our main claim is that the task of finding attacks is fundamentally different from these other applications, making it significantly harder for the intrusion detection community to employ machine learning effectively We support this claim by identifying challenges particular to network intrusion detection, and provide a set of guidelines meant to strengthen future research on anomaly detection

...read moreread less

Summary (4 min read)

Jump to: [Introduction] – [II. MACHINE LEARNING IN INTRUSION DETECTION] – [III. CHALLENGES OF USING MACHINE LEARNING] – [A. Outlier Detection] – [B. High Cost of Errors] – [C. Semantic Gap] – [D. Diversity of Network Traffic] – [E. Difficulties with Evaluation] – [IV. RECOMMENDATIONS FOR USING MACHINE LEARNING] – [A. Understanding the Threat Model] – [B. Keeping The Scope Narrow] – [C. Reducing the Costs] – [D. Evaluation] and [V. CONCLUSION]

Introduction

In this paper the authors set out to examine the differences between the intrusion detection domain and other areas where machine learning is used with more success.
In addition, the authors identify further characteristics that their domain exhibits that are not well aligned with the requirements of machine-learning.
By “machine-learning” the authors mean algorithms that are first trained with reference input to “learn” its specifics (either supervised or unsupervised), to then be deployed on previously unseen input for the actual detection process.

II. MACHINE LEARNING IN INTRUSION DETECTION

Anomaly detection systems find deviations from expected behavior.
To capture normal activity, IDES (and its successor NIDES [10]) used a combination of statistical metrics and profiles.
Since then, many more approaches have been pursued.
Often, they borrow schemes from the machine learning community, such as information theory [11], neural networks [12], support vector machines [13], genetic algorithms [14], artificial immunesystems [15], and many more.
The authors discussion in this paper aims to develop a different general point: that much of the difficulty with anomaly detection systems stems from using tools borrowed from the machine learning community in inappropriate ways.

III. CHALLENGES OF USING MACHINE LEARNING

It can be surprising at first to realize that despite extensive academic research efforts on anomaly detection, the success of such systems in operational environments has been very limited.
The authors believe that this “success discrepancy” arises because the intrusion detection domain exhibits particular characteristics that make the effective deployment of machine learning approaches fundamentally harder than in many other contexts.
In the following the authors identify these differences, with an aim of raising the community’s awareness of the unique challenges anomaly detection faces when operating on network traffic.
The authors note that their examples from other domains are primarily for illustration, as there is of course a continuous spectrum for many of the properties discussed (e.g., spam detection faces a similarly adversarial environment as intrusion detection does).
Based on discussions with colleagues who work with machine learning on a daily basis, the authors believe these intuitive arguments match well with what a more formal analysis would yield.

A. Outlier Detection

Fundamentally, machine-learning algorithms excel much better at finding similarities than at identifying activity that does not belong there: the classic machine learning application is a classification problem, rather than discovering meaningful outliers as required by an anomaly detection system [21].
Filtering, matching each of a user’s purchased (or positively rated) items with other similar products, where similarity is determined by products that tend be bought together.
The idea of specifying only positive examples and adopting a standing assumption that the rest are negative is called the closed world assumption.
Originally proposed by Graham [8], Bayesian frameworks trained with large corpora of both spam and ham have evolved into a standard tool for reliably identifying unsolicited mail.
The observation that machine learning works much better for such true classification problems then leads to the conclusion that anomaly detection is likely in fact better suited for finding variations of known attacks, rather than previously unknown malicious activity.

B. High Cost of Errors

In intrusion detection, the relative cost of any misclassification is extremely high compared to many other machine learning applications.
While for the seller a good recommendation has the potential to increase sales, a bad choice rarely hurts beyond a lost opportunity to have made a more enticing recommendation.
Spelling and grammar checkers are commonly employed to clean up results, weeding out the obvious mistakes.
Spam detection faces a highly unbalanced cost model: false positives (i.e., ham declared as spam) can prove very expensive, but false negatives (spam not identified as such) do not have a significant impact.
Overall, an anomaly detection system faces a much more stringent limit on the number of errors that it can tolerate.

C. Semantic Gap

Anomaly detection systems face a key challenge of transferring their results into actionable reports for the network operator.
Unfortunately, in the intrusion detection community the authors find a tendency to limit the evaluation of anomaly detection systems to an assessment of a system’s capability to reliably identify deviations from the normal profile.
When addressing the semantic gap, one consideration is the incorporation of local security policies.
Returning to the P2P example, when examining only NetFlow records, it is hard to imagine how one might spot inappropriate content.
As another example, consider exfiltration of personally identifying information (PII).

D. Diversity of Network Traffic

Network traffic often exhibits much more diversity than people intuitively expect, which leads to misconceptions about what anomaly detection technology can realistically achieve in operational environments.
Wright et al. [27] infer the language spoken on encrypted VOIP sessions;.
However these examples all demonstrate the power of exploiting structural knowledge informed by very careful examination of the particular domain of study—results not obtainable by simply expecting an anomaly detection system to develop inferences about “peculiar” activity.
While highly variable over smallto-medium time intervals, traffic properties tend to greater stability when observed over longer time periods (hours to days, sometimes weeks).
Finally, the authors note that traffic diversity is not restricted to packet-level features, but extends to application-layer information as well, both in terms of syntactic and semantic variability.

E. Difficulties with Evaluation

For an anomaly detection system, a thorough evaluation is particularly crucial to perform, as experience shows that many promising approaches turn out in practice to fall short of one’s expectations.
The two publicly available datasets that have provided something of a standardized setting in the past—the DARPA/Lincoln Labs packet traces [41], [42] and the KDD Cup dataset derived from them [43]—are now a decade old, and no longer adequate for any current study.
It is understandable that in the face of such high risks, researchers frequently encounter insurmountable organizational and legal barriers when they attempt to provide datasets to the community.
The authors argue that when evaluating an anomaly detection system, understanding the system’s semantic properties— the operationally relevant activity that it can detect, as well as the blind spots every system will necessarily have— is much more valuable than identifying a concrete set of parameters for which the system happens to work best for a particular input.
Exploiting the specifics of a machine learning implementation requires significant effort, time, and expertise on the attacker’s side.

IV. RECOMMENDATIONS FOR USING MACHINE LEARNING

The authors note that they view these guidelines as touchstones rather than as firm rules; there is certainly room for further discussion within the wider intrusion detection community.
If the authors could give only one recommendation on how to improve the state of anomaly detection research, it would be: Understand what the system is doing.
The nature of their domain is such that one can always find a variation that works slightly better than anything else in a particular setting.
The point the authors wish to convey however is that they are working in an area where insight matters much more than just numerical results.

A. Understanding the Threat Model

Before starting to develop an anomaly detector, one needs to consider the anticipated threat model, as that establishes the framework for choosing trade-offs.
Operation in a small network faces very different challenges than for a large enterprise or backbone network; academic environments impose different requirements than commercial enterprises.
Possible answers ranges from “very little” to “lethal.”.
The degree to which attackers might analyze defense techniques and seek to circumvent them determines the robustness requirements for any detector.

B. Keeping The Scope Narrow

A common pitfall is starting with the premise to use machinelearning (or, worse, a particular machine-learning approach) and then looking for a problem to solve.
Question is identifying the feature set the detector will work with: insight into the features’ significance (in terms of the domain) and capabilities (in terms of revealing the targeted activity) goes a long way towards reliable detection.
Laying out the land like this sets up the stage for a well-grounded study.

C. Reducing the Costs

Per the discussion in Section III-B, it follows that one obtains enormous benefit from reducing the costs associated with using an anomaly detection system.
As the authors have seen, an anomaly detection system does not necessarily make more mistakes than machine learning systems deployed in other domains—yet the high cost associated with each error often conflicts with effective operation.
Likely the most important step towards fewer mistakes is reducing the system’s scope, as discussed in Section IV-B.
The setup of the underlying machine-learning problem also has a direct impact on the number of false positives.
As a simple flow-level example, the set of destination ports a particular internal host contacts will likely fluctuate quite a bit for typical client systems; but the authors might often find the set of ports on which it accepts incoming connections to be stable over extended periods of time.

D. Evaluation

When evaluating an anomaly detection system, the primary objective should be to develop insight into the system’s capabilities:.
The authors discuss evaluation separately in terms of working with data, and interpreting results.
Often measurements include artifacts that can impact the results (such as filtering or unintended loss), or unrelated noise that one can safely filter out if readily identified (e.g., an internal vulnerability scan run by the security department), also known as No dataset is perfect.
Thus, machine learning can sometimes serve very effectively to “point the way” to how to develop detectors that are themselves based on different principles.
The successful operation of an anomaly detection system typically requires significant experience with the particular system, as it needs to be tuned to the local setting—experience that can prove cumbersome to collect if the underlying objective is instead to understand the new system.

V. CONCLUSION

The authors work examines the surprising imbalance between the extensive amount of research on machine learning-based anomaly detection pursued in the academic intrusion detection community, versus the lack of operational deployments of such systems.
The authors argue that this discrepancy stems in large part from specifics of the problem domain that make it significantly harder to apply machine learning effectively than in many other areas of computer science where such schemes are used with greater success.
It is crucial to acknowledge that the nature of the domain is such that one can always find schemes that yield marginally better ROC curves than anything else has for a specific given setting.
Such results however do not contribute to the progress of the field without any semantic understanding of the gain.

Did you find this useful? Give us your feedback

Content maybe subject to copyright Report

Outside the Closed World:

On Using Machine Learning For Network Intrusion Detection

Robin Sommer

International Computer Scienc e Institute, and

Lawrenc e Be rkeley National Laboratory

Vern Paxson

International Computer Science Institute, and

University of California, Berkeley

Abstract—In network intrusion detection research, one pop-

ular strategy for ﬁnding attacks is monitoring a network’s

activity for anomalies: deviations from proﬁles of normality

previously learned from benign trafﬁc, typically identiﬁed

using tools borrowed from the machine learning community.

However, despite extensive academic research one ﬁnds a

striking gap in terms of actual deployments of such systems:

compared with other intrusion detection ap proaches, machine

learning is rarely employed in operational “real world” settings.

We examine the differences b etween t he network intrusion

detection problem and oth er areas where machine learning

regularly ﬁnds much more success. Our main claim is that

the task of ﬁnding attacks is fundamentally different from

these other applications, making it signiﬁ cantly harder for the

intrusion detection community to employ machine learning

effectively. We support this claim by i dentifying challenges

particular to network intrusion detection, and provid e a set

of guid el ines meant to strengthen future research on anomaly

detection.

Keywords-anomaly detection; machine learning; intrusion

detection; network security.

I. INTRODUCTION

Traditionally, network intrusion detection systems (NIDS)

are broadly classiﬁed based on the style of detection they are

using: systems relying on misuse-detection monitor activity

with precise descriptions of known malicious behavior, while

anomaly-detection systems have a notion of no rmal activity

and ﬂag deviations from tha t proﬁle.

Both approaches have

been extensively studied by the research community for

many years. However, in terms of actu a l deployments, we

observe a striking imbalance: in operational settings, of

these two main classes we ﬁnd almost exclusively only

misuse detectors in use—most common ly in the form of

signature systems that scan network trafﬁc for characteristic

byte sequenc es.

This situation is somewhat striking when considering

the success that mac hine-learn ing—which fr e quently forms

the basis for anomaly-d etection—sees in m any other areas

of com puter scienc e, where it often results in large-scale

Other styles include speciﬁcation-based [1] and behavioral detec-

tion [2]. These approaches focus respectively on deﬁning allowed types

of activity in order to ﬂag any other activity as forbidden, and analyzing

patterns of activity and surrounding context to ﬁnd secondary evidence of

attacks.

deployments in the commercial world. Examples from other

domains include product recommendations systems such

as used by Amazon [3] and Netﬂix [4]; optical character

recogn ition sy stems (e.g., [ 5], [6]); natural language trans-

lation [7]; and also spam detection, as an example closer to

home [8].

In this paper we set out to examine the differences

between the intrusion detection domain and other areas

where mac hine learning is used with more success. Our main

claim is tha t the task of ﬁnding attacks is fu ndamentally

different from other applications, making it signiﬁcantly

harder for the intrusion detection com munity to em ploy

machine learning effectively. We believe that a signiﬁcant

part o f the problem already or iginates in the premise, found

in virtually any relevant textbook, that anomaly detection is

suitable for ﬁnding novel attacks; we argue that this premise

does not hold with the gen erality commonly implied. Rather,

the strength of machine-learning tools is ﬁn ding a c tivity

that is similar to something previously seen, without the

need however to precisely describe that activity up front (as

misuse detection must).

In addition, we identify further characteristics that ou r d o-

main exhibits tha t are no t well aligned with the requirements

of machine-learning. These in c lude: (i) a very high cost of

errors; (ii) lack of training data; (iii) a semantic gap b etween

results and their operational interpretation; (iv) enormous

variability in in put data; and (v) fundamental difﬁculties

for conducting soun d evaluation. While these challenges

may not be surprising for th ose who have been working

in the domain for some time, they can be easily lost on

newcomers. To addr ess them, we deem it crucial for any

effective deployment to acqu ire deep, semantic insight in to

a system’s capabilities and limitations, rather than treating

the system as a black box as u nfortunately of te n seen.

We stress that we do not consider machine -learning an

inappropriate tool for intrusion detection. Its use requires

care, however: the more c risply one can deﬁne th e context

in which it oper a te s, the better promise the results may hold.

Likewise, the better we understand the semantics of the

detection process, the more operationally relevant the system

will be. Conseque ntly, we also present a set of guidelines

meant to stren gthen future intrusion detection research .

Throu ghout the discussion, we frame our mindset around

on the goal of using an anomaly detection system effec-

tively in the “real world”, i.e., in large-scale, operational

environments. We focus o n network intrusion detection as

that is our main area of expertise, though we believe that

similar arguments hold for host-based systems. For ease of

exposition we will use the term anomaly detection somewhat

narrowly to refer to d e te ction approaches that rely primarily

on machine-learning. By “machine-learning” we mean a lgo-

rithms that a re ﬁrst trained with reference input to “learn”

its speciﬁcs (either super vised or unsupervised) , to then be

deployed on previously unseen input for the actual detection

process. While our terminology is deliberately a bit vague,

we believe it capture s what many in the ﬁeld intuitively

associate with the term “a nomaly detection”.

We structure the rema inder of the paper as follows. In Sec-

tion II, we begin with a brief discussion of machine learning

as it has been applied to intrusion detection in the past. We

then in Section III identify the speciﬁc challenges machine

learning faces in our domain. In Section IV w e present

recommendations that we hope will help to strengthen future

research, and we brieﬂy summarize in Section V.

II. MACHINE LEARNING IN INTRUSION DETECTION

Anomaly detection systems ﬁnd deviations from expected

behavio r. Based on a notion of normal activity, they report

deviations from that proﬁle as alerts. The basic assumption

underlying any an omaly detection system—malicious activ-

ity exhibits characteristics not observed for no rmal usage—

was ﬁrst introduc e d by Denning in her seminal work o n

the host-based IDES system [9] in 1987. To cap ture normal

activity, IDES (and its successor NIDES [10]) used a com-

bination o f statistical metrics and proﬁles. Since th en, many

more approaches have been pur sued. Often, they borrow

schemes from the m achine learning community, such as

informa tion theory [11], neural networks [12], support vector

machines [13], ge netic algorithms [14], artiﬁcial immune-

systems [1 5], and many m ore. In our discussion, we focus on

anomaly detection systems that utilize su c h machine learning

approa c hes.

Chandola et al. provide a survey of anomaly detection

in [16], including other areas where similar approache s

are used, such as monitoring credit card spending patterns

for fraudulent activity. While in such applications one is

also looking for outliers, the data tends to be much more

structured. For example, the spac e for representing credit

card transactions is of relatively low dimensionality and se-

mantically much more well-d eﬁned than network trafﬁc [17].

Anomaly detection approaches must grapple with a set of

well-recognized problems [ 18]: the d etectors tend to gener-

ate numerous false positives; attack-free data for training is

hard to ﬁnd; and attackers can evade detection by gradually

teaching a system to accept malicious activity as benig n. Our

discussion in this paper aims to develop a different general

point: that much of the difﬁculty with an omaly detection

systems stems from usin g tools borrowed from the machine

learning community in inappropriate ways.

Compared to the extensive body of research , anomaly

detection has not obtained much traction in the “rea l world”.

Those systems found in operational de ployment are most

commonly based on statistical proﬁles of heavily aggre-

gated trafﬁc, such as Arbor’s Pea kﬂow [19] and Lanscope’s

StealthWatch [20]. While highly helpful, such devices oper-

ate with a much more speciﬁc fo c us than with the generality

that research papers often envision.

We see this situation

as suggestive that many anomaly detection systems from

the academic world do not live up to the requirem ents of

operational settings.

III. CHALLENGES OF USING MACHINE LEARNING

It can be surprising at ﬁrst to realize that despite extensive

academic research efforts on anomaly detection, the suc cess

of such systems in operation a l environments has been very

limited. In other domain s, the very same machine learning

tools that form the basis of anomaly detection systems have

proven to work with great success, and are regularly used

in commercial settings where large qua ntities of data rende r

manual inspection infeasible. We believe that this “succ e ss

discrepancy” arises because the intr usion detection domain

exhibits particular characteristics that make th e effective

deployment of machin e learning appr oaches funda m e ntally

harder than in many other contexts.

In the following we identify th ese differences, with an aim

of raising the com munity’s awareness of the unique chal-

lenges anomaly detection faces when operating on network

trafﬁc. We note that our examples from oth er d omains are

primarily for illustration, as there is of course a continuous

spectrum for many of the properties discussed (e.g., spam

detection faces a similarly adversarial environment as in-

trusion d etection do es). We also note that we are network

security researchers, not experts on machine-learning, and

thus we argue mostly at an intuitive level rath er than attempt-

ing to frame our statements in the formalisms employed

for mac hine learn ing. However, based on discussions w ith

colleagues who work with machine learning on a daily basis,

we believe these intuitive arguments match well with w hat

a more formal an a lysis wou ld yield.

A. Outlier Detection

Fundamentally, machine-learning algorithms excel much

better at ﬁnding similarities than at identifying activity th a t

does not belong there: the classic machine learning appli-

cation is a classiﬁcation problem, rather than discovering

meaningful outliers as required by an anomaly detection

system [21]. Consider produc t recommendation sy stems

such as tha t used b y Amazon [3]: it employs collaborative

We note that for commercial solutions it is always hard to say what

they do exactly, as speciﬁcs of their internals are rarely publicly available.

ﬁltering, matching each o f a user’s purchased (or positively

rated) items with other similar products, where similarity is

determined by products that tend be boug ht togeth er. If the

system instead operated like an anomaly detection system, it

would look for items that are typically not bought together—

a d ifferent kind of question with a much less clear answer,

as according to [3], many product pairs have no common

customers.

In some sense, outlier detection is also a classiﬁcation

problem: there are two classes, “normal” and “not normal”,

and the objective is determ ining which of the two m ore

likely matches an observation. However, a basic rule of

machine-le arning is that one needs to train a system with

specimens of all classes, and, crucially, the number of

representatives found in the training set for each class should

be large [22]. Yet for anomaly detection aiming to ﬁn d novel

attacks, by deﬁn ition one ca nnot train on the attacks of

interest, but only on normal trafﬁc, and thus having only

one category to compare new activity ag a inst.

In other words, one often winds up tra ining an an omaly

detection system with the opposite of what it is supposed

to ﬁnd—a setting certainly not ideal, as it requires having

a perfect model of normality for any reliable decision. If,

on the other h a nd, one had a classiﬁcation problem with

multiple alternatives to choose from, th en it would sufﬁce

to have a model just crisp enough to separate the classes. To

quote f rom Witten et al. [21]: The idea of spec ifying only

positive examples and adopting a standing assumption that

the rest are negative is called the closed world assumption.

. . . [The assumption] is not of much practical use in real-

life problems because they rarely involve “closed” worlds

in which you can be certain that all cases are covered.

Spam detection is an example from the security do main

of successfully applying machine learning to a classiﬁcation

problem. Originally proposed by Graham [8], Bayesian

frameworks trained with large corpora of both spam and

ham have evolved into a standard tool for reliably identifying

unsolicited mail.

The observation tha t machin e learning work s much better

for such true classiﬁcation problems then leads to the

conclusion that anoma ly detection is likely in fact better

suited for ﬁnding v ariations of known attacks, rather than

previously unknown malicious activity. In such settings, one

can train the system with specimens of the attacks a s they

are known and with normal background trafﬁc, and thus

achieve a much more reliable decision process.

B. High Cost of Errors

In intrusion detection, the relative cost of any misclassi-

ﬁcation is extremely high compared to many other machine

learning applica tions. A false positive requires spending

expensive analyst time examining the reported incident only

to eventually determine that it reﬂects benign unde rlying

activity. As argued by Axelsson, even a very small rate of

false positives can quickly render an NIDS un usable [23].

False negatives, on the other hand, have the potential to

cause serious damage to an organization: even a single

compromised system c an seriously undermine the integrity

of the IT infrastructure. It is illuminating to compar e such

high costs with the impact of misclassiﬁcations in other

domains:

• Product recommendation systems can readily tolerate

errors as these do not have a direct negative impact.

While for the seller a good recommendation has the

potential to increase sales, a bad choice rarely hur ts

beyond a lost opportunity to have made a more enticing

recommendation. (In fact, one might imagine such

systems deliberately making more unlikely guesses

on oc casion, with the hope of pointing customer s to

products they would not have otherwise considered.) If

recommendations do not align well with the customers’

interest, they will most likely just continu e shopping,

rather than take a damaging step such as switching

to different seller. As Greg Linden said (author of the

recommendation engine behind Amazon): “Recommen-

dations involve a lot of g uesswork. Our error rate will

always be hig h.” [24]

• OCR techn ology can likewise tolerate errors much more

readily th an an a nomaly detection system. Spelling and

grammar checkers are commonly employed to clean up

results, weeding out the obvious mistakes. More gener-

ally, statistical language mo dels associate proba bilities

with results, allowing for postprocessing of a system’s

initial output [25]. In addition, users have been tr ained

not to expected perf ect documents but to proofread

where accuracy is important. While this corresponds to

verifying NIDS alerts manually, it is much quicker for a

human eye to check spelling of a word than to validate

a report of, say, a web server compromise. Similar

to OCR, con te mporary automated language translation

operates at relatively large errors rates [7], and while

recent progress has been impressive, nobody would

expect more tha n a rough translatio n.

• Spam de te ction faces a highly unbalanced cost model:

false positives (i.e., ham declared as spam) can prove

very expensive, but false negatives (spam not identi-

ﬁed as such) do n ot have a signiﬁcant impact. This

discrepancy can allow for “lopsided” tuning, leading

to systems that emphasize ﬁnding obvious spam fairly

reliably, ye t exhibiting less reliability for new variations

hitherto unseen. For an anomaly detection system that

primarily aims to ﬁnd novel attac ks, such per formanc e

on new variations rarely constitutes an appropr iate

trade-off.

Overall, an anomaly detection system faces a much more

stringent lim it on the number of errors that it can tole rate.

However, the intrusion detection-speciﬁc challenges that we

discuss here all tend to increase e rror rates—even above

the levels for other domains. We deem this unfortunate

combination as the primary reason for the lack of success

in operational settings.

C. Semantic Gap

Anomaly detection systems face a key challenge of trans-

ferring their r esults into actionable reports for the network

operator. In many studies, we observe a lack of this crucial

ﬁnal step, which we term the semantic gap. Unfortunately,

in the intrusion detection community we ﬁnd a tendency

to limit the evaluation of anomaly detection systems to

an assessment of a system’s capability to reliably identify

deviations from the normal proﬁle. While doing so indeed

comprises an important ingr edient for a sound study, the next

step then needs to inte rpret the results from an operator’s

point of view—“What does it mean?”

Answering this question goes to the heart of the difference

between ﬁnding “a bnormal activity” and “attacks”. Those

familiar with anomaly detection are usually the ﬁrst to

acknowledge th at such systems are not targeting to identify

malicious behavior but just report what has not been seen

before, whether benign or not. We argue however that

one cannot stop a t that point. After all, the objective of

deploying an intrusion detection system is to ﬁnd attacks,

and thus a detector that do e s not allow for bridging this gap

is unlikely to meet operational exp e ctations. The common

experience with anomaly detection system s producing too

many false positives supports this view: by deﬁnition, a

machine learning algorithm does not make any mistakes

within its model of normality; yet for the ope rator it is the

results’ interpretation that matters.

When addressing the semantic gap, one consideration is

the incorporation of local security policies. While often

neglected in academic research, a fundamental observation

about opera tional networks is the degree to which they

differ: many security constraints are a site-speciﬁc prop e rty.

Activity that is ﬁne in an academic setting can be ba nned in

an enterprise network; and even inside a single organization,

department policies can differ widely. Thus, it is crucial for

a NIDS to accommodate such differences.

For an anomaly detec tion system, the natural strategy

to addr ess site-speciﬁcs is having the system “learn” them

during training with normal trafﬁc. However, one cannot

simply assert this as the solution to the question of adapting

to different sites; one needs to explicitly demonstrate it, since

the core issue concer ns tha t such variations can prove diverse

and easy to overlook.

Unfortu nately, more often than n ot security policie s are

not deﬁned crisply on a technical level. For example, an

environment might tolerate peer-to-peer trafﬁc as long as

it is not used for distributing inappropriate content, and

that it remains “below the radar” in terms of volume. To

report a violation of such a policy, the anomaly detection

system would need to have a notion of what is deemed

“appro priate” or “egregiously large” in that particular envi-

ronment; a decision out of reach for any of today’s systems.

Reporting just the usage of P2P applications is likely no t

particularly useful, unless the env ironment ﬂat-out bans such

usage. In our experie nce, such vague guidelines ar e actually

common in many environments, and sometimes originate in

the imprecise legal lang uage fo und in the “terms of service”

to which users must agree [26].

The basic challenge with regard to the semantic gap

is understanding h ow the features the anomaly detec tion

system operates on relate to the semantics of the network

environment. In particular, for any given choice of feature s

there will be a fundamen ta l limit to the k ind of de termina-

tions a NIDS can develop from them. Returning to the P2P

example, when examining only NetFlow records, it is hard

to imagine how one might spot inappropriate co ntent.

another example, consider exﬁltration of pe rsonally identi-

fying information (PII). In many threat models, loss of PI I

ranks quite high, as it has the potential for causing ma jor

damage (either d irectly, in ﬁnancial terms, or due to publicity

or political fallou t). On a technical level, some forms of PII

are not that hard to describe; e.g., social security numbers as

well bank account numbers follow speciﬁc schemes that one

can verify autom atically.

But an anomaly detection system

developed in the a bsence of such descriptions has little ho pe

of ﬁnding PII, and even given examples of PII and non-

PII will likely have difﬁculty distilling rules for accurate ly

distinguishing one from the other.

D. Diversity of Network Trafﬁc

Network trafﬁc often exhibits much more diversity than

people intuitively expect, which leads to misconceptions

about what anomaly detection technology can realistically

achieve in operational enviro nments. Even w ithin a single

network, the network’s most basic c haracteristics—such as

bandwidth, duration of connections, and application mix—

can exhibit immense variability, render ing them unpre-

dictable over short time intervals (seconds to hours) . T he

We note that in fact the literature holds some fairly amazing demon-

strations of how much more information a dataset can provide than what

we might intuitively expect: Wright et al. [27] infer the language spoken

on encrypted VOIP sessions; Yen et al. [28] identify the particular web

browser a client uses from ﬂow-level data; Narayanan et al. [29] identify

users in the anonymized Netﬂix datasets via correlation with their public

reviews in a separate database; and Kumar et al. [30] determine from lossy

and remote packet traces the number of disks attached to systems infected

by the “Witty” worm, as well as their uptime to millisecond precision.

However these examples all demonstrate the power of exploiting structural

know ledge informed by very careful examination of the particular domain

of study—results not obtainable by simply expecting an anomaly detection

system to develop inferences about “peculiar” activity.

With limitations of course. As it turns out, Japanese phone numbers look

a lot like US social security numbers, as the Lawrence Berkeley National

Laboratory noticed when m onitoring for them in email [31].

widespread prevalence of strong co rrelations and “heavy-

tailed” data transfers [32], [33] regularly lead s to large bursts

of activity. It is crucial to acknowledge that in networking

such variability occur s regularly; it does not represent any-

thing unusual. For an anomaly detection system, however,

such variability can prove har d to deal with, as it makes it

difﬁcult to ﬁnd a stable notion o f “normality”.

One way to reduce the diversity of Internet trafﬁc is

to employ aggregation. While highly variable over small-

to-medium time intervals, trafﬁc properties tend to g reater

stability when observed over longer time perio ds (hours to

days, sometimes weeks). For example, in most networks

time-of-day and day-of-week e ffects exhibit reliable pat-

terns: if during today’s lunch b reak, the trafﬁc volume is

twice as large as du ring the corresponding time slots last

week, tha t likely reﬂects something unusual occu rring. Not

coincidentally, one form of anomaly detection system we

do ﬁnd in operation deployment is those that operate o n

highly aggregated information, such as “ volume per hour” or

“connections per source”. On the other h a nd, incidents found

by these systems tend to be rather noisy anyway—and often

straight-fo rward to ﬁnd with other approaches (e.g., simple

threshold schemes). This last observation goes to the heart

of w hat can often u ndermine anomaly detec tion research

efforts: a failure to examine whether simpler, non-machine

learning approaches mig ht work equally well.

Finally, we note that trafﬁc diversity is not restricted

to pac ket-level features, but extends to ap plication-laye r

informa tion as well, both in terms of syntactic and sema ntic

variability. Syntactically, protocol speciﬁcations often pur-

posefully leave r oom for in te rpretation, and in heterogen e ous

trafﬁc streams the re is ample opportunity for corner-case

situations to manifest (see the discussion of “crud” in [34]) .

Semantically, features derived from application p rotocols

can be just as ﬂuctuating as network-layer packets ( see, e.g.,

[35], [36]).

E. Difﬁculties with Evaluation

For an ano maly detection system, a thorough evaluation

is particularly crucial to perform, as experience shows that

many promising approaches turn out in practice to fall short

of one’s expectations. That said, devising sound evaluation

schemes is not easy, and in fact turns out to be more difﬁcult

than building the detector itself. Du e to the opacity of

the detection process, the results of an anomaly detection

system a re harder to predict tha n for a misuse detector. We

discuss evaluation challenges in terms of the difﬁculties for

(i) ﬁnding the right data, and then (ii) interpreting r esults.

1) Difﬁculties of Data: Arguably the most signiﬁcant

challenge an evaluation faces is the lack of appropriate

public datasets for assessing anomaly detection systems. In

other domains, we often ﬁnd either stan dardized test suites

available, or the possibility to collect an appropriate corp us,

or both. For example, for automatic language translation

“a large training set of the input-output behavior that we

seek to automate is available to us in the wild” [ 37]. For

spam detectors, dedica te d “spam feed s” [38] provide large

collections of spam free of privacy concerns. Getting suitable

collections of “ham” is more difﬁcult, however even a sm all

number of private mail archives can already yield a large

corpus [39]. For OCR, sophisticated methods have been

devised to generate ground-truth automatically [40]. In our

domain, however, we often have neither standardiz ed test

sets, nor any appropriate, readily available data.

The two publicly available da ta sets tha t have pro-

vided so mething of a standardized setting in the past—the

DARPA/Lincoln Labs p acket traces [41], [42] and the KDD

Cup dataset de rived f rom them [43]—are now a decade old,

and n o longer adequate for any current study. The DARPA

dataset contains multiple weeks of network activity from a

simulated Air Force ne twork, generated in 1998 and reﬁned

in 1999. No t only is this data synthetic, and no longer even

close to reﬂecting contemporary attacks, but it also has been

so extensively studied over the years that mo st mem bers of

the intrusion detection community dee m it wholly uninter-

esting if a NIDS now re liably detects the attacks it contains.

(Indeed, the DARPA data faced pointed criticisms not lon g

after its release [4 4], particularly regarding the degree to

which simulated data can be appropriate for th e evaluation

of a NIDS.) The KDD dataset repr esents a distillation of

the DARPA traces into features for machine learning. Not

only d oes it inherit the shortcomings of the DARPA data,

but the features have also turned out to exhibit unfortunate

artifacts [45].

Given the lack of publicly available data, it is natural to

ask wh y we ﬁnd such a striking gap in our community.

The

primary reason c learly arises from the data’s sensitive nature:

the inspection of n etwork trafﬁc can reveal highly sensitive

informa tion, including conﬁdential or personal communi-

cations, an organ ization’s business secrets, or its users’

network access patterns. Any breach of such information

can prove catastrophic not only for the organization itself,

but also for affected third parties. It is understandable tha t in

the face of such high risks, researchers frequently encounter

insurmountable organizational and legal barriers when they

attempt to provide datasets to the community.

Given this difﬁculty, researchers have pursued two al-

ternative routes in the past: simulation a nd anonymization.

As de monstrated by the DARPA dataset, network trafﬁc

generated by simulation can have the major b eneﬁt of

being free of sensitivity co ncerns. However, Interne t trafﬁc

We note that the lack of public network data is not limited to the

intrusion detection domain. We see effects similar to the overuse of the

DARPA dataset in empirical network research: the ClarkNet-HTTP [46]

dataset contains two weeks’ worth of HTTP requests to ClarkNet’s web

server, recorded in 1995. While researchers at ClarkNet stopped using these

logs for their own studies in 1997, in total researchers have used the traces

for evaluations in more than 90 papers published between 1995 and 2007—

13 of these in 2007 [47]!

HTML Viewer

Frequently Asked Questions (9)

Q1. What are the contributions in "Outside the closed world: on using machine learning for network intrusion detection" ?

The authors examine the differences between the network intrusion detection problem and other areas where machine learning regularly finds much more success. The authors support this claim by identifying challenges particular to network intrusion detection, and provide a set of guidelines meant to strengthen future research on anomaly detection.

Q2. What future works have the authors mentioned in the paper "Outside the closed world: on using machine learning for network intrusion detection" ?

The authors hope for this discussion to contribute to strengthening future research on anomaly detection by pinpointing the fundamental challenges it faces.

Q3. What is the strategy for bringing the data to the experimenter?

mediated trace access can be a viable strategy [64]: rather than bringing the data to the experimenter, bring the experiment to the data, i.e., researchers send their analysis programs to data providers who then run them on their behalf and return the output.

Q4. What is the way to address site-specifics?

For an anomaly detection system, the natural strategy to address site-specifics is having the system “learn” them during training with normal traffic.

Q5. Why are the results of an anomaly detection system harder to predict than for a misuse detector?

Due to the opacity of the detection process, the results of an anomaly detection system are harder to predict than for a misuse detector.

Q6. What is the basic challenge with regard to the semantic gap?

The basic challenge with regard to the semantic gap is understanding how the features the anomaly detection system operates on relate to the semantics of the network environment.

Q7. What is the significant challenge an evaluation faces?

1) Difficulties of Data: Arguably the most significant challenge an evaluation faces is the lack of appropriate public datasets for assessing anomaly detection systems.

Q8. What is the convincing test of any anomaly detection system?

the most convincing real-world test of any anomaly detection system is to solicit feedback from operators who run the system in their network.

Q9. Why has the lack of public data garnered little traction to date?

despite intensive efforts [52], [53], publishing such datasets has garnered little traction to date, mostly one suspects for the fear that information can still leak.

Outside the Closed World: On Using Machine Learning for Network Intrusion Detection

Summary (4 min read)

Introduction

II. MACHINE LEARNING IN INTRUSION DETECTION

III. CHALLENGES OF USING MACHINE LEARNING

A. Outlier Detection

B. High Cost of Errors

C. Semantic Gap

D. Diversity of Network Traffic

E. Difficulties with Evaluation

IV. RECOMMENDATIONS FOR USING MACHINE LEARNING

A. Understanding the Threat Model

B. Keeping The Scope Narrow

C. Reducing the Costs

D. Evaluation

V. CONCLUSION

Citations

Cites background from "Outside the Closed World: On Using ..."

Cites background or methods from "Outside the Closed World: On Using ..."

Cites background from "Outside the Closed World: On Using ..."

References

"Outside the Closed World: On Using ..." refers background in this paper

"Outside the Closed World: On Using ..." refers methods in this paper

"Outside the Closed World: On Using ..." refers background in this paper

Related Papers (5)

Frequently Asked Questions (9)

Q1. What are the contributions in "Outside the closed world: on using machine learning for network intrusion detection" ?

Q2. What future works have the authors mentioned in the paper "Outside the closed world: on using machine learning for network intrusion detection" ?

Q3. What is the strategy for bringing the data to the experimenter?

Q4. What is the way to address site-specifics?

Q5. Why are the results of an anomaly detection system harder to predict than for a misuse detector?

Q6. What is the basic challenge with regard to the semantic gap?

Q7. What is the significant challenge an evaluation faces?

Q8. What is the convincing test of any anomaly detection system?

Q9. Why has the lack of public data garnered little traction to date?