scispace - formally typeset
Search or ask a question
Proceedings ArticleDOI

Outside the Closed World: On Using Machine Learning for Network Intrusion Detection

TL;DR: The main claim is that the task of finding attacks is fundamentally different from these other applications, making it significantly harder for the intrusion detection community to employ machine learning effectively.
Abstract: In network intrusion detection research, one popular strategy for finding attacks is monitoring a network's activity for anomalies: deviations from profiles of normality previously learned from benign traffic, typically identified using tools borrowed from the machine learning community However, despite extensive academic research one finds a striking gap in terms of actual deployments of such systems: compared with other intrusion detection approaches, machine learning is rarely employed in operational "real world" settings We examine the differences between the network intrusion detection problem and other areas where machine learning regularly finds much more success Our main claim is that the task of finding attacks is fundamentally different from these other applications, making it significantly harder for the intrusion detection community to employ machine learning effectively We support this claim by identifying challenges particular to network intrusion detection, and provide a set of guidelines meant to strengthen future research on anomaly detection

Summary (4 min read)

Introduction

  • In this paper the authors set out to examine the differences between the intrusion detection domain and other areas where machine learning is used with more success.
  • In addition, the authors identify further characteristics that their domain exhibits that are not well aligned with the requirements of machine-learning.
  • By “machine-learning” the authors mean algorithms that are first trained with reference input to “learn” its specifics (either supervised or unsupervised), to then be deployed on previously unseen input for the actual detection process.

II. MACHINE LEARNING IN INTRUSION DETECTION

  • Anomaly detection systems find deviations from expected behavior.
  • To capture normal activity, IDES (and its successor NIDES [10]) used a combination of statistical metrics and profiles.
  • Since then, many more approaches have been pursued.
  • Often, they borrow schemes from the machine learning community, such as information theory [11], neural networks [12], support vector machines [13], genetic algorithms [14], artificial immunesystems [15], and many more.
  • The authors discussion in this paper aims to develop a different general point: that much of the difficulty with anomaly detection systems stems from using tools borrowed from the machine learning community in inappropriate ways.

III. CHALLENGES OF USING MACHINE LEARNING

  • It can be surprising at first to realize that despite extensive academic research efforts on anomaly detection, the success of such systems in operational environments has been very limited.
  • The authors believe that this “success discrepancy” arises because the intrusion detection domain exhibits particular characteristics that make the effective deployment of machine learning approaches fundamentally harder than in many other contexts.
  • In the following the authors identify these differences, with an aim of raising the community’s awareness of the unique challenges anomaly detection faces when operating on network traffic.
  • The authors note that their examples from other domains are primarily for illustration, as there is of course a continuous spectrum for many of the properties discussed (e.g., spam detection faces a similarly adversarial environment as intrusion detection does).
  • Based on discussions with colleagues who work with machine learning on a daily basis, the authors believe these intuitive arguments match well with what a more formal analysis would yield.

A. Outlier Detection

  • Fundamentally, machine-learning algorithms excel much better at finding similarities than at identifying activity that does not belong there: the classic machine learning application is a classification problem, rather than discovering meaningful outliers as required by an anomaly detection system [21].
  • Filtering, matching each of a user’s purchased (or positively rated) items with other similar products, where similarity is determined by products that tend be bought together.
  • The idea of specifying only positive examples and adopting a standing assumption that the rest are negative is called the closed world assumption.
  • Originally proposed by Graham [8], Bayesian frameworks trained with large corpora of both spam and ham have evolved into a standard tool for reliably identifying unsolicited mail.
  • The observation that machine learning works much better for such true classification problems then leads to the conclusion that anomaly detection is likely in fact better suited for finding variations of known attacks, rather than previously unknown malicious activity.

B. High Cost of Errors

  • In intrusion detection, the relative cost of any misclassification is extremely high compared to many other machine learning applications.
  • While for the seller a good recommendation has the potential to increase sales, a bad choice rarely hurts beyond a lost opportunity to have made a more enticing recommendation.
  • Spelling and grammar checkers are commonly employed to clean up results, weeding out the obvious mistakes.
  • Spam detection faces a highly unbalanced cost model: false positives (i.e., ham declared as spam) can prove very expensive, but false negatives (spam not identified as such) do not have a significant impact.
  • Overall, an anomaly detection system faces a much more stringent limit on the number of errors that it can tolerate.

C. Semantic Gap

  • Anomaly detection systems face a key challenge of transferring their results into actionable reports for the network operator.
  • Unfortunately, in the intrusion detection community the authors find a tendency to limit the evaluation of anomaly detection systems to an assessment of a system’s capability to reliably identify deviations from the normal profile.
  • When addressing the semantic gap, one consideration is the incorporation of local security policies.
  • Returning to the P2P example, when examining only NetFlow records, it is hard to imagine how one might spot inappropriate content.
  • As another example, consider exfiltration of personally identifying information (PII).

D. Diversity of Network Traffic

  • Network traffic often exhibits much more diversity than people intuitively expect, which leads to misconceptions about what anomaly detection technology can realistically achieve in operational environments.
  • Wright et al. [27] infer the language spoken on encrypted VOIP sessions;.
  • However these examples all demonstrate the power of exploiting structural knowledge informed by very careful examination of the particular domain of study—results not obtainable by simply expecting an anomaly detection system to develop inferences about “peculiar” activity.
  • While highly variable over smallto-medium time intervals, traffic properties tend to greater stability when observed over longer time periods (hours to days, sometimes weeks).
  • Finally, the authors note that traffic diversity is not restricted to packet-level features, but extends to application-layer information as well, both in terms of syntactic and semantic variability.

E. Difficulties with Evaluation

  • For an anomaly detection system, a thorough evaluation is particularly crucial to perform, as experience shows that many promising approaches turn out in practice to fall short of one’s expectations.
  • The two publicly available datasets that have provided something of a standardized setting in the past—the DARPA/Lincoln Labs packet traces [41], [42] and the KDD Cup dataset derived from them [43]—are now a decade old, and no longer adequate for any current study.
  • It is understandable that in the face of such high risks, researchers frequently encounter insurmountable organizational and legal barriers when they attempt to provide datasets to the community.
  • The authors argue that when evaluating an anomaly detection system, understanding the system’s semantic properties— the operationally relevant activity that it can detect, as well as the blind spots every system will necessarily have— is much more valuable than identifying a concrete set of parameters for which the system happens to work best for a particular input.
  • Exploiting the specifics of a machine learning implementation requires significant effort, time, and expertise on the attacker’s side.

IV. RECOMMENDATIONS FOR USING MACHINE LEARNING

  • The authors note that they view these guidelines as touchstones rather than as firm rules; there is certainly room for further discussion within the wider intrusion detection community.
  • If the authors could give only one recommendation on how to improve the state of anomaly detection research, it would be: Understand what the system is doing.
  • The nature of their domain is such that one can always find a variation that works slightly better than anything else in a particular setting.
  • The point the authors wish to convey however is that they are working in an area where insight matters much more than just numerical results.

A. Understanding the Threat Model

  • Before starting to develop an anomaly detector, one needs to consider the anticipated threat model, as that establishes the framework for choosing trade-offs.
  • Operation in a small network faces very different challenges than for a large enterprise or backbone network; academic environments impose different requirements than commercial enterprises.
  • Possible answers ranges from “very little” to “lethal.”.
  • The degree to which attackers might analyze defense techniques and seek to circumvent them determines the robustness requirements for any detector.

B. Keeping The Scope Narrow

  • A common pitfall is starting with the premise to use machinelearning (or, worse, a particular machine-learning approach) and then looking for a problem to solve.
  • Question is identifying the feature set the detector will work with: insight into the features’ significance (in terms of the domain) and capabilities (in terms of revealing the targeted activity) goes a long way towards reliable detection.
  • Laying out the land like this sets up the stage for a well-grounded study.

C. Reducing the Costs

  • Per the discussion in Section III-B, it follows that one obtains enormous benefit from reducing the costs associated with using an anomaly detection system.
  • As the authors have seen, an anomaly detection system does not necessarily make more mistakes than machine learning systems deployed in other domains—yet the high cost associated with each error often conflicts with effective operation.
  • Likely the most important step towards fewer mistakes is reducing the system’s scope, as discussed in Section IV-B.
  • The setup of the underlying machine-learning problem also has a direct impact on the number of false positives.
  • As a simple flow-level example, the set of destination ports a particular internal host contacts will likely fluctuate quite a bit for typical client systems; but the authors might often find the set of ports on which it accepts incoming connections to be stable over extended periods of time.

D. Evaluation

  • When evaluating an anomaly detection system, the primary objective should be to develop insight into the system’s capabilities:.
  • The authors discuss evaluation separately in terms of working with data, and interpreting results.
  • Often measurements include artifacts that can impact the results (such as filtering or unintended loss), or unrelated noise that one can safely filter out if readily identified (e.g., an internal vulnerability scan run by the security department), also known as No dataset is perfect.
  • Thus, machine learning can sometimes serve very effectively to “point the way” to how to develop detectors that are themselves based on different principles.
  • The successful operation of an anomaly detection system typically requires significant experience with the particular system, as it needs to be tuned to the local setting—experience that can prove cumbersome to collect if the underlying objective is instead to understand the new system.

V. CONCLUSION

  • The authors work examines the surprising imbalance between the extensive amount of research on machine learning-based anomaly detection pursued in the academic intrusion detection community, versus the lack of operational deployments of such systems.
  • The authors argue that this discrepancy stems in large part from specifics of the problem domain that make it significantly harder to apply machine learning effectively than in many other areas of computer science where such schemes are used with greater success.
  • It is crucial to acknowledge that the nature of the domain is such that one can always find schemes that yield marginally better ROC curves than anything else has for a specific given setting.
  • Such results however do not contribute to the progress of the field without any semantic understanding of the gain.

Did you find this useful? Give us your feedback

Content maybe subject to copyright    Report

Outside the Closed World:
On Using Machine Learning For Network Intrusion Detection
Robin Sommer
International Computer Scienc e Institute, and
Lawrenc e Be rkeley National Laboratory
Vern Paxson
International Computer Science Institute, and
University of California, Berkeley
Abstract—In network intrusion detection research, one pop-
ular strategy for finding attacks is monitoring a network’s
activity for anomalies: deviations from profiles of normality
previously learned from benign traffic, typically identified
using tools borrowed from the machine learning community.
However, despite extensive academic research one finds a
striking gap in terms of actual deployments of such systems:
compared with other intrusion detection ap proaches, machine
learning is rarely employed in operational “real world” settings.
We examine the differences b etween t he network intrusion
detection problem and oth er areas where machine learning
regularly finds much more success. Our main claim is that
the task of finding attacks is fundamentally different from
these other applications, making it signifi cantly harder for the
intrusion detection community to employ machine learning
effectively. We support this claim by i dentifying challenges
particular to network intrusion detection, and provid e a set
of guid el ines meant to strengthen future research on anomaly
detection.
Keywords-anomaly detection; machine learning; intrusion
detection; network security.
I. INTRODUCTION
Traditionally, network intrusion detection systems (NIDS)
are broadly classified based on the style of detection they are
using: systems relying on misuse-detection monitor activity
with precise descriptions of known malicious behavior, while
anomaly-detection systems have a notion of no rmal activity
and flag deviations from tha t profile.
1
Both approaches have
been extensively studied by the research community for
many years. However, in terms of actu a l deployments, we
observe a striking imbalance: in operational settings, of
these two main classes we find almost exclusively only
misuse detectors in use—most common ly in the form of
signature systems that scan network traffic for characteristic
byte sequenc es.
This situation is somewhat striking when considering
the success that mac hine-learn ing—which fr e quently forms
the basis for anomaly-d etection—sees in m any other areas
of com puter scienc e, where it often results in large-scale
1
Other styles include specification-based [1] and behavioral detec-
tion [2]. These approaches focus respectively on defining allowed types
of activity in order to flag any other activity as forbidden, and analyzing
patterns of activity and surrounding context to find secondary evidence of
attacks.
deployments in the commercial world. Examples from other
domains include product recommendations systems such
as used by Amazon [3] and Netflix [4]; optical character
recogn ition sy stems (e.g., [ 5], [6]); natural language trans-
lation [7]; and also spam detection, as an example closer to
home [8].
In this paper we set out to examine the differences
between the intrusion detection domain and other areas
where mac hine learning is used with more success. Our main
claim is tha t the task of finding attacks is fu ndamentally
different from other applications, making it significantly
harder for the intrusion detection com munity to em ploy
machine learning effectively. We believe that a significant
part o f the problem already or iginates in the premise, found
in virtually any relevant textbook, that anomaly detection is
suitable for finding novel attacks; we argue that this premise
does not hold with the gen erality commonly implied. Rather,
the strength of machine-learning tools is fin ding a c tivity
that is similar to something previously seen, without the
need however to precisely describe that activity up front (as
misuse detection must).
In addition, we identify further characteristics that ou r d o-
main exhibits tha t are no t well aligned with the requirements
of machine-learning. These in c lude: (i) a very high cost of
errors; (ii) lack of training data; (iii) a semantic gap b etween
results and their operational interpretation; (iv) enormous
variability in in put data; and (v) fundamental difficulties
for conducting soun d evaluation. While these challenges
may not be surprising for th ose who have been working
in the domain for some time, they can be easily lost on
newcomers. To addr ess them, we deem it crucial for any
effective deployment to acqu ire deep, semantic insight in to
a system’s capabilities and limitations, rather than treating
the system as a black box as u nfortunately of te n seen.
We stress that we do not consider machine -learning an
inappropriate tool for intrusion detection. Its use requires
care, however: the more c risply one can define th e context
in which it oper a te s, the better promise the results may hold.
Likewise, the better we understand the semantics of the
detection process, the more operationally relevant the system
will be. Conseque ntly, we also present a set of guidelines
meant to stren gthen future intrusion detection research .

Throu ghout the discussion, we frame our mindset around
on the goal of using an anomaly detection system effec-
tively in the “real world”, i.e., in large-scale, operational
environments. We focus o n network intrusion detection as
that is our main area of expertise, though we believe that
similar arguments hold for host-based systems. For ease of
exposition we will use the term anomaly detection somewhat
narrowly to refer to d e te ction approaches that rely primarily
on machine-learning. By “machine-learning we mean a lgo-
rithms that a re first trained with reference input to “learn”
its specifics (either super vised or unsupervised) , to then be
deployed on previously unseen input for the actual detection
process. While our terminology is deliberately a bit vague,
we believe it capture s what many in the field intuitively
associate with the term “a nomaly detection”.
We structure the rema inder of the paper as follows. In Sec-
tion II, we begin with a brief discussion of machine learning
as it has been applied to intrusion detection in the past. We
then in Section III identify the specific challenges machine
learning faces in our domain. In Section IV w e present
recommendations that we hope will help to strengthen future
research, and we briefly summarize in Section V.
II. MACHINE LEARNING IN INTRUSION DETECTION
Anomaly detection systems find deviations from expected
behavio r. Based on a notion of normal activity, they report
deviations from that profile as alerts. The basic assumption
underlying any an omaly detection system—malicious activ-
ity exhibits characteristics not observed for no rmal usage—
was first introduc e d by Denning in her seminal work o n
the host-based IDES system [9] in 1987. To cap ture normal
activity, IDES (and its successor NIDES [10]) used a com-
bination o f statistical metrics and profiles. Since th en, many
more approaches have been pur sued. Often, they borrow
schemes from the m achine learning community, such as
informa tion theory [11], neural networks [12], support vector
machines [13], ge netic algorithms [14], artificial immune-
systems [1 5], and many m ore. In our discussion, we focus on
anomaly detection systems that utilize su c h machine learning
approa c hes.
Chandola et al. provide a survey of anomaly detection
in [16], including other areas where similar approache s
are used, such as monitoring credit card spending patterns
for fraudulent activity. While in such applications one is
also looking for outliers, the data tends to be much more
structured. For example, the spac e for representing credit
card transactions is of relatively low dimensionality and se-
mantically much more well-d efined than network traffic [17].
Anomaly detection approaches must grapple with a set of
well-recognized problems [ 18]: the d etectors tend to gener-
ate numerous false positives; attack-free data for training is
hard to find; and attackers can evade detection by gradually
teaching a system to accept malicious activity as benig n. Our
discussion in this paper aims to develop a different general
point: that much of the difficulty with an omaly detection
systems stems from usin g tools borrowed from the machine
learning community in inappropriate ways.
Compared to the extensive body of research , anomaly
detection has not obtained much traction in the “rea l world”.
Those systems found in operational de ployment are most
commonly based on statistical profiles of heavily aggre-
gated traffic, such as Arbor’s Pea kflow [19] and Lanscopes
StealthWatch [20]. While highly helpful, such devices oper-
ate with a much more specific fo c us than with the generality
that research papers often envision.
2
We see this situation
as suggestive that many anomaly detection systems from
the academic world do not live up to the requirem ents of
operational settings.
III. CHALLENGES OF USING MACHINE LEARNING
It can be surprising at first to realize that despite extensive
academic research efforts on anomaly detection, the suc cess
of such systems in operation a l environments has been very
limited. In other domain s, the very same machine learning
tools that form the basis of anomaly detection systems have
proven to work with great success, and are regularly used
in commercial settings where large qua ntities of data rende r
manual inspection infeasible. We believe that this “succ e ss
discrepancy” arises because the intr usion detection domain
exhibits particular characteristics that make th e effective
deployment of machin e learning appr oaches funda m e ntally
harder than in many other contexts.
In the following we identify th ese differences, with an aim
of raising the com munity’s awareness of the unique chal-
lenges anomaly detection faces when operating on network
traffic. We note that our examples from oth er d omains are
primarily for illustration, as there is of course a continuous
spectrum for many of the properties discussed (e.g., spam
detection faces a similarly adversarial environment as in-
trusion d etection do es). We also note that we are network
security researchers, not experts on machine-learning, and
thus we argue mostly at an intuitive level rath er than attempt-
ing to frame our statements in the formalisms employed
for mac hine learn ing. However, based on discussions w ith
colleagues who work with machine learning on a daily basis,
we believe these intuitive arguments match well with w hat
a more formal an a lysis wou ld yield.
A. Outlier Detection
Fundamentally, machine-learning algorithms excel much
better at finding similarities than at identifying activity th a t
does not belong there: the classic machine learning appli-
cation is a classification problem, rather than discovering
meaningful outliers as required by an anomaly detection
system [21]. Consider produc t recommendation sy stems
such as tha t used b y Amazon [3]: it employs collaborative
2
We note that for commercial solutions it is always hard to say what
they do exactly, as specifics of their internals are rarely publicly available.

filtering, matching each o f a user’s purchased (or positively
rated) items with other similar products, where similarity is
determined by products that tend be boug ht togeth er. If the
system instead operated like an anomaly detection system, it
would look for items that are typically not bought together—
a d ifferent kind of question with a much less clear answer,
as according to [3], many product pairs have no common
customers.
In some sense, outlier detection is also a classification
problem: there are two classes, “normal” and “not normal”,
and the objective is determ ining which of the two m ore
likely matches an observation. However, a basic rule of
machine-le arning is that one needs to train a system with
specimens of all classes, and, crucially, the number of
representatives found in the training set for each class should
be large [22]. Yet for anomaly detection aiming to fin d novel
attacks, by defin ition one ca nnot train on the attacks of
interest, but only on normal traffic, and thus having only
one category to compare new activity ag a inst.
In other words, one often winds up tra ining an an omaly
detection system with the opposite of what it is supposed
to find—a setting certainly not ideal, as it requires having
a perfect model of normality for any reliable decision. If,
on the other h a nd, one had a classification problem with
multiple alternatives to choose from, th en it would suffice
to have a model just crisp enough to separate the classes. To
quote f rom Witten et al. [21]: The idea of spec ifying only
positive examples and adopting a standing assumption that
the rest are negative is called the closed world assumption.
. . . [The assumption] is not of much practical use in real-
life problems because they rarely involve “closed” worlds
in which you can be certain that all cases are covered.
Spam detection is an example from the security do main
of successfully applying machine learning to a classification
problem. Originally proposed by Graham [8], Bayesian
frameworks trained with large corpora of both spam and
ham have evolved into a standard tool for reliably identifying
unsolicited mail.
The observation tha t machin e learning work s much better
for such true classification problems then leads to the
conclusion that anoma ly detection is likely in fact better
suited for finding v ariations of known attacks, rather than
previously unknown malicious activity. In such settings, one
can train the system with specimens of the attacks a s they
are known and with normal background traffic, and thus
achieve a much more reliable decision process.
B. High Cost of Errors
In intrusion detection, the relative cost of any misclassi-
fication is extremely high compared to many other machine
learning applica tions. A false positive requires spending
expensive analyst time examining the reported incident only
to eventually determine that it reflects benign unde rlying
activity. As argued by Axelsson, even a very small rate of
false positives can quickly render an NIDS un usable [23].
False negatives, on the other hand, have the potential to
cause serious damage to an organization: even a single
compromised system c an seriously undermine the integrity
of the IT infrastructure. It is illuminating to compar e such
high costs with the impact of misclassifications in other
domains:
Product recommendation systems can readily tolerate
errors as these do not have a direct negative impact.
While for the seller a good recommendation has the
potential to increase sales, a bad choice rarely hur ts
beyond a lost opportunity to have made a more enticing
recommendation. (In fact, one might imagine such
systems deliberately making more unlikely guesses
on oc casion, with the hope of pointing customer s to
products they would not have otherwise considered.) If
recommendations do not align well with the customers’
interest, they will most likely just continu e shopping,
rather than take a damaging step such as switching
to different seller. As Greg Linden said (author of the
recommendation engine behind Amazon): “Recommen-
dations involve a lot of g uesswork. Our error rate will
always be hig h. [24]
OCR techn ology can likewise tolerate errors much more
readily th an an a nomaly detection system. Spelling and
grammar checkers are commonly employed to clean up
results, weeding out the obvious mistakes. More gener-
ally, statistical language mo dels associate proba bilities
with results, allowing for postprocessing of a system’s
initial output [25]. In addition, users have been tr ained
not to expected perf ect documents but to proofread
where accuracy is important. While this corresponds to
verifying NIDS alerts manually, it is much quicker for a
human eye to check spelling of a word than to validate
a report of, say, a web server compromise. Similar
to OCR, con te mporary automated language translation
operates at relatively large errors rates [7], and while
recent progress has been impressive, nobody would
expect more tha n a rough translatio n.
Spam de te ction faces a highly unbalanced cost model:
false positives (i.e., ham declared as spam) can prove
very expensive, but false negatives (spam not identi-
fied as such) do n ot have a significant impact. This
discrepancy can allow for “lopsided” tuning, leading
to systems that emphasize finding obvious spam fairly
reliably, ye t exhibiting less reliability for new variations
hitherto unseen. For an anomaly detection system that
primarily aims to find novel attac ks, such per formanc e
on new variations rarely constitutes an appropr iate
trade-off.
Overall, an anomaly detection system faces a much more
stringent lim it on the number of errors that it can tole rate.

However, the intrusion detection-specific challenges that we
discuss here all tend to increase e rror rates—even above
the levels for other domains. We deem this unfortunate
combination as the primary reason for the lack of success
in operational settings.
C. Semantic Gap
Anomaly detection systems face a key challenge of trans-
ferring their r esults into actionable reports for the network
operator. In many studies, we observe a lack of this crucial
final step, which we term the semantic gap. Unfortunately,
in the intrusion detection community we find a tendency
to limit the evaluation of anomaly detection systems to
an assessment of a system’s capability to reliably identify
deviations from the normal profile. While doing so indeed
comprises an important ingr edient for a sound study, the next
step then needs to inte rpret the results from an operator’s
point of view—“What does it mean?”
Answering this question goes to the heart of the difference
between finding “a bnormal activity” and “attacks”. Those
familiar with anomaly detection are usually the first to
acknowledge th at such systems are not targeting to identify
malicious behavior but just report what has not been seen
before, whether benign or not. We argue however that
one cannot stop a t that point. After all, the objective of
deploying an intrusion detection system is to find attacks,
and thus a detector that do e s not allow for bridging this gap
is unlikely to meet operational exp e ctations. The common
experience with anomaly detection system s producing too
many false positives supports this view: by definition, a
machine learning algorithm does not make any mistakes
within its model of normality; yet for the ope rator it is the
results’ interpretation that matters.
When addressing the semantic gap, one consideration is
the incorporation of local security policies. While often
neglected in academic research, a fundamental observation
about opera tional networks is the degree to which they
differ: many security constraints are a site-specific prop e rty.
Activity that is fine in an academic setting can be ba nned in
an enterprise network; and even inside a single organization,
department policies can differ widely. Thus, it is crucial for
a NIDS to accommodate such differences.
For an anomaly detec tion system, the natural strategy
to addr ess site-specifics is having the system “learn” them
during training with normal traffic. However, one cannot
simply assert this as the solution to the question of adapting
to different sites; one needs to explicitly demonstrate it, since
the core issue concer ns tha t such variations can prove diverse
and easy to overlook.
Unfortu nately, more often than n ot security policie s are
not defined crisply on a technical level. For example, an
environment might tolerate peer-to-peer traffic as long as
it is not used for distributing inappropriate content, and
that it remains “below the radar” in terms of volume. To
report a violation of such a policy, the anomaly detection
system would need to have a notion of what is deemed
“appro priate” or “egregiously large” in that particular envi-
ronment; a decision out of reach for any of today’s systems.
Reporting just the usage of P2P applications is likely no t
particularly useful, unless the env ironment flat-out bans such
usage. In our experie nce, such vague guidelines ar e actually
common in many environments, and sometimes originate in
the imprecise legal lang uage fo und in the “terms of service”
to which users must agree [26].
The basic challenge with regard to the semantic gap
is understanding h ow the features the anomaly detec tion
system operates on relate to the semantics of the network
environment. In particular, for any given choice of feature s
there will be a fundamen ta l limit to the k ind of de termina-
tions a NIDS can develop from them. Returning to the P2P
example, when examining only NetFlow records, it is hard
to imagine how one might spot inappropriate co ntent.
3
As
another example, consider exfiltration of pe rsonally identi-
fying information (PII). In many threat models, loss of PI I
ranks quite high, as it has the potential for causing ma jor
damage (either d irectly, in financial terms, or due to publicity
or political fallou t). On a technical level, some forms of PII
are not that hard to describe; e.g., social security numbers as
well bank account numbers follow specific schemes that one
can verify autom atically.
4
But an anomaly detection system
developed in the a bsence of such descriptions has little ho pe
of finding PII, and even given examples of PII and non-
PII will likely have difficulty distilling rules for accurate ly
distinguishing one from the other.
D. Diversity of Network Traffic
Network traffic often exhibits much more diversity than
people intuitively expect, which leads to misconceptions
about what anomaly detection technology can realistically
achieve in operational enviro nments. Even w ithin a single
network, the network’s most basic c haracteristics—such as
bandwidth, duration of connections, and application mix—
can exhibit immense variability, render ing them unpre-
dictable over short time intervals (seconds to hours) . T he
3
We note that in fact the literature holds some fairly amazing demon-
strations of how much more information a dataset can provide than what
we might intuitively expect: Wright et al. [27] infer the language spoken
on encrypted VOIP sessions; Yen et al. [28] identify the particular web
browser a client uses from flow-level data; Narayanan et al. [29] identify
users in the anonymized Netflix datasets via correlation with their public
reviews in a separate database; and Kumar et al. [30] determine from lossy
and remote packet traces the number of disks attached to systems infected
by the “Witty” worm, as well as their uptime to millisecond precision.
However these examples all demonstrate the power of exploiting structural
know ledge informed by very careful examination of the particular domain
of study—results not obtainable by simply expecting an anomaly detection
system to develop inferences about “peculiar” activity.
4
With limitations of course. As it turns out, Japanese phone numbers look
a lot like US social security numbers, as the Lawrence Berkeley National
Laboratory noticed when m onitoring for them in email [31].

widespread prevalence of strong co rrelations and “heavy-
tailed” data transfers [32], [33] regularly lead s to large bursts
of activity. It is crucial to acknowledge that in networking
such variability occur s regularly; it does not represent any-
thing unusual. For an anomaly detection system, however,
such variability can prove har d to deal with, as it makes it
difficult to find a stable notion o f “normality”.
One way to reduce the diversity of Internet traffic is
to employ aggregation. While highly variable over small-
to-medium time intervals, traffic properties tend to g reater
stability when observed over longer time perio ds (hours to
days, sometimes weeks). For example, in most networks
time-of-day and day-of-week e ffects exhibit reliable pat-
terns: if during today’s lunch b reak, the traffic volume is
twice as large as du ring the corresponding time slots last
week, tha t likely reflects something unusual occu rring. Not
coincidentally, one form of anomaly detection system we
do find in operation deployment is those that operate o n
highly aggregated information, such as volume per hour” or
“connections per source”. On the other h a nd, incidents found
by these systems tend to be rather noisy anyway—and often
straight-fo rward to find with other approaches (e.g., simple
threshold schemes). This last observation goes to the heart
of w hat can often u ndermine anomaly detec tion research
efforts: a failure to examine whether simpler, non-machine
learning approaches mig ht work equally well.
Finally, we note that traffic diversity is not restricted
to pac ket-level features, but extends to ap plication-laye r
informa tion as well, both in terms of syntactic and sema ntic
variability. Syntactically, protocol specifications often pur-
posefully leave r oom for in te rpretation, and in heterogen e ous
traffic streams the re is ample opportunity for corner-case
situations to manifest (see the discussion of “crud in [34]) .
Semantically, features derived from application p rotocols
can be just as fluctuating as network-layer packets ( see, e.g.,
[35], [36]).
E. Difficulties with Evaluation
For an ano maly detection system, a thorough evaluation
is particularly crucial to perform, as experience shows that
many promising approaches turn out in practice to fall short
of one’s expectations. That said, devising sound evaluation
schemes is not easy, and in fact turns out to be more difficult
than building the detector itself. Du e to the opacity of
the detection process, the results of an anomaly detection
system a re harder to predict tha n for a misuse detector. We
discuss evaluation challenges in terms of the difficulties for
(i) finding the right data, and then (ii) interpreting r esults.
1) Difficulties of Data: Arguably the most significant
challenge an evaluation faces is the lack of appropriate
public datasets for assessing anomaly detection systems. In
other domains, we often find either stan dardized test suites
available, or the possibility to collect an appropriate corp us,
or both. For example, for automatic language translation
“a large training set of the input-output behavior that we
seek to automate is available to us in the wild [ 37]. For
spam detectors, dedica te d “spam feed s” [38] provide large
collections of spam free of privacy concerns. Getting suitable
collections of “ham” is more difficult, however even a sm all
number of private mail archives can already yield a large
corpus [39]. For OCR, sophisticated methods have been
devised to generate ground-truth automatically [40]. In our
domain, however, we often have neither standardiz ed test
sets, nor any appropriate, readily available data.
The two publicly available da ta sets tha t have pro-
vided so mething of a standardized setting in the past—the
DARPA/Lincoln Labs p acket traces [41], [42] and the KDD
Cup dataset de rived f rom them [43]—are now a decade old,
and n o longer adequate for any current study. The DARPA
dataset contains multiple weeks of network activity from a
simulated Air Force ne twork, generated in 1998 and refined
in 1999. No t only is this data synthetic, and no longer even
close to reflecting contemporary attacks, but it also has been
so extensively studied over the years that mo st mem bers of
the intrusion detection community dee m it wholly uninter-
esting if a NIDS now re liably detects the attacks it contains.
(Indeed, the DARPA data faced pointed criticisms not lon g
after its release [4 4], particularly regarding the degree to
which simulated data can be appropriate for th e evaluation
of a NIDS.) The KDD dataset repr esents a distillation of
the DARPA traces into features for machine learning. Not
only d oes it inherit the shortcomings of the DARPA data,
but the features have also turned out to exhibit unfortunate
artifacts [45].
Given the lack of publicly available data, it is natural to
ask wh y we find such a striking gap in our community.
5
The
primary reason c learly arises from the data’s sensitive nature:
the inspection of n etwork traffic can reveal highly sensitive
informa tion, including confidential or personal communi-
cations, an organ ization’s business secrets, or its users’
network access patterns. Any breach of such information
can prove catastrophic not only for the organization itself,
but also for affected third parties. It is understandable tha t in
the face of such high risks, researchers frequently encounter
insurmountable organizational and legal barriers when they
attempt to provide datasets to the community.
Given this difficulty, researchers have pursued two al-
ternative routes in the past: simulation a nd anonymization.
As de monstrated by the DARPA dataset, network traffic
generated by simulation can have the major b enefit of
being free of sensitivity co ncerns. However, Interne t traffic
5
We note that the lack of public network data is not limited to the
intrusion detection domain. We see effects similar to the overuse of the
DARPA dataset in empirical network research: the ClarkNet-HTTP [46]
dataset contains two weeks’ worth of HTTP requests to ClarkNet’s web
server, recorded in 1995. While researchers at ClarkNet stopped using these
logs for their own studies in 1997, in total researchers have used the traces
for evaluations in more than 90 papers published between 1995 and 2007—
13 of these in 2007 [47]!

Citations
More filters
Proceedings ArticleDOI
01 Jan 2014
TL;DR: DREBIN is proposed, a lightweight method for detection of Android malware that enables identifying malicious applications directly on the smartphone and outperforms several related approaches and detects 94% of the malware with few false alarms.
Abstract: Malicious applications pose a threat to the security of the Android platform. The growing amount and diversity of these applications render conventional defenses largely ineffective and thus Android smartphones often remain unprotected from novel malware. In this paper, we propose DREBIN, a lightweight method for detection of Android malware that enables identifying malicious applications directly on the smartphone. As the limited resources impede monitoring applications at run-time, DREBIN performs a broad static analysis, gathering as many features of an application as possible. These features are embedded in a joint vector space, such that typical patterns indicative for malware can be automatically identified and used for explaining the decisions of our method. In an evaluation with 123,453 applications and 5,560 malware samples DREBIN outperforms several related approaches and detects 94% of the malware with few false alarms, where the explanations provided for each detection reveal relevant properties of the detected malware. On five popular smartphones, the method requires 10 seconds for an analysis on average, rendering it suitable for checking downloaded applications directly on the device.

1,905 citations


Cites background from "Outside the Closed World: On Using ..."

  • ...It is a common shortcoming of learningbased approaches that they are black-box methods [34]....

    [...]

Journal ArticleDOI
TL;DR: The intent for this dataset is to assist various researchers in acquiring datasets of this kind for testing, evaluation, and comparison purposes, through sharing the generated datasets and profiles.

1,050 citations


Cites background or methods from "Outside the Closed World: On Using ..."

  • ...This has been pointed out in numerous work such as Tavallaee et al. (2010) and has also been argued in Sommer and Paxson (2010) that “the most significant challenge an evaluation faces is the lack of appropriate public datasets for assessing anomaly detection systems.”...

    [...]

  • ...It it worthy to note that works such as Sommer and Paxson (2010) have made interesting observations on anomaly-based network intrusion detection mechanisms and have provided recommendations to further improve research in this field....

    [...]

  • ...The systematic approach described in this work addresses the evaluation recommendations of Sommer and Paxson (2010)....

    [...]

Proceedings ArticleDOI
21 Oct 2011
TL;DR: In this article, the authors discuss an emerging field of study: adversarial machine learning (AML), the study of effective machine learning techniques against an adversarial opponent, and give a taxonomy for classifying attacks against online machine learning algorithms.
Abstract: In this paper (expanded from an invited talk at AISEC 2010), we discuss an emerging field of study: adversarial machine learning---the study of effective machine learning techniques against an adversarial opponent. In this paper, we: give a taxonomy for classifying attacks against online machine learning algorithms; discuss application-specific factors that limit an adversary's capabilities; introduce two models for modeling an adversary's capabilities; explore the limits of an adversary's knowledge about the algorithm, feature space, training, and input data; explore vulnerabilities in machine learning algorithms; discuss countermeasures against attacks; introduce the evasion challenge; and discuss privacy-preserving learning techniques.

947 citations

Journal ArticleDOI
TL;DR: The author briefly introduces the emerging field of adversarial machine learning, in which opponents can cause traditional machine learning algorithms to behave poorly in security applications.
Abstract: The author briefly introduces the emerging field of adversarial machine learning, in which opponents can cause traditional machine learning algorithms to behave poorly in security applications. He gives a high-level overview and mentions several types of attacks, as well as several types of defenses, and theoretical limits derived from a study of near-optimal evasion.

703 citations

Journal ArticleDOI
TL;DR: This survey delineates the limitations, give insights, research challenges and future opportunities to advance ML in networking, and jointly presents the application of diverse ML techniques in various key areas of networking across different network technologies.
Abstract: Machine Learning (ML) has been enjoying an unprecedented surge in applications that solve problems and enable automation in diverse domains. Primarily, this is due to the explosion in the availability of data, significant improvements in ML techniques, and advancement in computing capabilities. Undoubtedly, ML has been applied to various mundane and complex problems arising in network operation and management. There are various surveys on ML for specific areas in networking or for specific network technologies. This survey is original, since it jointly presents the application of diverse ML techniques in various key areas of networking across different network technologies. In this way, readers will benefit from a comprehensive discussion on the different learning paradigms and ML techniques applied to fundamental problems in networking, including traffic prediction, routing and classification, congestion control, resource and fault management, QoS and QoE management, and network security. Furthermore, this survey delineates the limitations, give insights, research challenges and future opportunities to advance ML in networking. Therefore, this is a timely contribution of the implications of ML for networking, that is pushing the barriers of autonomic network operation and management.

677 citations


Cites background from "Outside the Closed World: On Using ..."

  • ...For SSH traffic, it achieves 95.9%DR and 2.8% FPR on theDalhousie dataset, 97.2% DR and 0.8% FPR on the AMP dataset, and 82.9% DR and 0.5% FPR on the MAWI dataset....

    [...]

  • ...Consequentially, the True Positive Rate (TPR) describing the number of correct predictions is inferred from the confusion matrix as: TPR (Recall) = TP TP + FN The converse, False Positive Rate (FPR) is the ratio of incorrect predictions and is defined as: FPR = FP FP + TN Similarly, True Negative Rate (TNR) and False Negative Rate (FNR) are used to deduce the number of correct and incorrect negative predictions, respectively....

    [...]

  • ...The authors evaluate the sensitivity of parameters to optimize their settings in order to guarantee the best performance, that is higher TPR and lower FPR....

    [...]

  • ...The C4.5 DT classifier also performed well for Skype traffic with 98.4% DR and 7.8% FPR in the Dalhousie dataset....

    [...]

  • ...Despite the extensive literature on ML-based anomaly detection, it has not received the same traction in real deployments [415]....

    [...]

References
More filters
Book
25 Oct 1999
TL;DR: This highly anticipated third edition of the most acclaimed work on data mining and machine learning will teach you everything you need to know about preparing inputs, interpreting outputs, evaluating results, and the algorithmic methods at the heart of successful data mining.
Abstract: Data Mining: Practical Machine Learning Tools and Techniques offers a thorough grounding in machine learning concepts as well as practical advice on applying machine learning tools and techniques in real-world data mining situations. This highly anticipated third edition of the most acclaimed work on data mining and machine learning will teach you everything you need to know about preparing inputs, interpreting outputs, evaluating results, and the algorithmic methods at the heart of successful data mining. Thorough updates reflect the technical changes and modernizations that have taken place in the field since the last edition, including new material on Data Transformations, Ensemble Learning, Massive Data Sets, Multi-instance Learning, plus a new version of the popular Weka machine learning software developed by the authors. Witten, Frank, and Hall include both tried-and-true techniques of today as well as methods at the leading edge of contemporary research. *Provides a thorough grounding in machine learning concepts as well as practical advice on applying the tools and techniques to your data mining projects *Offers concrete tips and techniques for performance improvement that work by transforming the input or output in machine learning methods *Includes downloadable Weka software toolkit, a collection of machine learning algorithms for data mining tasks-in an updated, interactive interface. Algorithms in toolkit cover: data pre-processing, classification, regression, clustering, association rules, visualization

20,196 citations


"Outside the Closed World: On Using ..." refers background in this paper

  • ...(This is a basic requirement for sound science, yet overlooked surprisingly often; see however [21] for a set of standard techniques one can apply when having only limited data available)....

    [...]

  • ...Fundamentally, machine-learning algorithms excel much better at finding similarities than at identifying activity that does not belong there: the classic machine learning application is a classification problem, rather than discovering meaningful outliers as required by an anomaly detection system [21]....

    [...]

Book
01 Jan 2008
TL;DR: In this paper, generalized estimating equations (GEE) with computing using PROC GENMOD in SAS and multilevel analysis of clustered binary data using generalized linear mixed-effects models with PROC LOGISTIC are discussed.
Abstract: tic regression, and it concerns studying the effect of covariates on the risk of disease. The chapter includes generalized estimating equations (GEE’s) with computing using PROC GENMOD in SAS and multilevel analysis of clustered binary data using generalized linear mixed-effects models with PROC LOGISTIC. As a prelude to the following chapter on repeated-measures data, Chapter 5 presents time series analysis. The material on repeated-measures analysis uses linear additive models with GEE’s and PROC MIXED in SAS for linear mixed-effects models. Chapter 7 is about survival data analysis. All computing throughout the book is done using SAS procedures.

9,995 citations

Journal ArticleDOI
TL;DR: This survey tries to provide a structured and comprehensive overview of the research on anomaly detection by grouping existing techniques into different categories based on the underlying approach adopted by each technique.
Abstract: Anomaly detection is an important problem that has been researched within diverse research areas and application domains. Many anomaly detection techniques have been specifically developed for certain application domains, while others are more generic. This survey tries to provide a structured and comprehensive overview of the research on anomaly detection. We have grouped existing techniques into different categories based on the underlying approach adopted by each technique. For each category we have identified key assumptions, which are used by the techniques to differentiate between normal and anomalous behavior. When applying a given technique to a particular domain, these assumptions can be used as guidelines to assess the effectiveness of the technique in that domain. For each category, we provide a basic anomaly detection technique, and then show how the different existing techniques in that category are variants of the basic technique. This template provides an easier and more succinct understanding of the techniques belonging to each category. Further, for each category, we identify the advantages and disadvantages of the techniques in that category. We also provide a discussion on the computational complexity of the techniques since it is an important issue in real application domains. We hope that this survey will provide a better understanding of the different directions in which research has been done on this topic, and how techniques developed in one area can be applied in domains for which they were not intended to begin with.

9,627 citations


"Outside the Closed World: On Using ..." refers methods in this paper

  • ...Chandola et al. provide a survey of anomaly detection in [16], including other areas where similar approaches are used, such as monitoring credit card spending patterns for fraudulent activity....

    [...]

Journal Article
TL;DR: This work compares three common approaches to solving the recommendation problem: traditional collaborative filtering, cluster models, and search-based methods, and their algorithm, which is called item-to-item collaborative filtering.
Abstract: Recommendation algorithms are best known for their use on e-commerce Web sites, where they use input about a customer's interests to generate a list of recommended items. Many applications use only the items that customers purchase and explicitly rate to represent their interests, but they can also use other attributes, including items viewed, demographic data, subject interests, and favorite artists. At Amazon.com, we use recommendation algorithms to personalize the online store for each customer. The store radically changes based on customer interests, showing programming titles to a software engineer and baby toys to a new mother. There are three common approaches to solving the recommendation problem: traditional collaborative filtering, cluster models, and search-based methods. Here, we compare these methods with our algorithm, which we call item-to-item collaborative filtering. Unlike traditional collaborative filtering, our algorithm's online computation scales independently of the number of customers and number of items in the product catalog. Our algorithm produces recommendations in real-time, scales to massive data sets, and generates high quality recommendations.

4,788 citations


"Outside the Closed World: On Using ..." refers background in this paper

  • ...…situation is somewhat striking when considering the success that machine-learning—which frequently forms the basis for anomaly-detection—sees in many other areas of computer science, where it often results in large-scale 1Other styles include specification-based [1] and behavioral detection [2]....

    [...]

  • ...In other domains, the very same machine learning tools that form the basis of anomaly detection systems have proven to work with great success, and are regularly used in commercial settings where large quantities of data render manual inspection infeasible....

    [...]

Journal ArticleDOI
TL;DR: Item-to-item collaborative filtering (ITF) as mentioned in this paper is a popular recommendation algorithm for e-commerce Web sites that scales independently of the number of customers and number of items in the product catalog.
Abstract: Recommendation algorithms are best known for their use on e-commerce Web sites, where they use input about a customer's interests to generate a list of recommended items. Many applications use only the items that customers purchase and explicitly rate to represent their interests, but they can also use other attributes, including items viewed, demographic data, subject interests, and favorite artists. At Amazon.com, we use recommendation algorithms to personalize the online store for each customer. The store radically changes based on customer interests, showing programming titles to a software engineer and baby toys to a new mother. There are three common approaches to solving the recommendation problem: traditional collaborative filtering, cluster models, and search-based methods. Here, we compare these methods with our algorithm, which we call item-to-item collaborative filtering. Unlike traditional collaborative filtering, our algorithm's online computation scales independently of the number of customers and number of items in the product catalog. Our algorithm produces recommendations in real-time, scales to massive data sets, and generates high quality recommendations.

4,372 citations

Frequently Asked Questions (9)
Q1. What are the contributions in "Outside the closed world: on using machine learning for network intrusion detection" ?

The authors examine the differences between the network intrusion detection problem and other areas where machine learning regularly finds much more success. The authors support this claim by identifying challenges particular to network intrusion detection, and provide a set of guidelines meant to strengthen future research on anomaly detection. 

The authors hope for this discussion to contribute to strengthening future research on anomaly detection by pinpointing the fundamental challenges it faces. 

mediated trace access can be a viable strategy [64]: rather than bringing the data to the experimenter, bring the experiment to the data, i.e., researchers send their analysis programs to data providers who then run them on their behalf and return the output. 

For an anomaly detection system, the natural strategy to address site-specifics is having the system “learn” them during training with normal traffic. 

Due to the opacity of the detection process, the results of an anomaly detection system are harder to predict than for a misuse detector. 

The basic challenge with regard to the semantic gap is understanding how the features the anomaly detection system operates on relate to the semantics of the network environment. 

1) Difficulties of Data: Arguably the most significant challenge an evaluation faces is the lack of appropriate public datasets for assessing anomaly detection systems. 

the most convincing real-world test of any anomaly detection system is to solicit feedback from operators who run the system in their network. 

despite intensive efforts [52], [53], publishing such datasets has garnered little traction to date, mostly one suspects for the fear that information can still leak.