scispace - formally typeset
Search or ask a question
ReportDOI

Data mining approaches for intrusion detection

26 Jan 1998-pp 6-6
TL;DR: An agent-based architecture for intrusion detection systems where the learning agents continuously compute and provide the updated (detection) models to the detection agents is proposed.
Abstract: In this paper we discuss our research in developing general and systematic methods for intrusion detection. The key ideas are to use data mining techniques to discover consistent and useful patterns of system features that describe program and user behavior, and use the set of relevant system features to compute (inductively learned) classifiers that can recognize anomalies and known intrusions. Using experiments on the sendmail system call data and the network tcpdump data, we demonstrate that we can construct concise and accurate classifiers to detect anomalies. We provide an overview on two general data mining algorithms that we have implemented: the association rules algorithm and the frequent episodes algorithm. These algorithms can be used to compute the intra-and inter-audit record patterns, which are essential in describing program or user behavior. The discovered patterns can guide the audit data gathering process and facilitate feature selection. To meet the challenges of both efficient learning (mining) and real-time detection, we propose an agent-based architecture for intrusion detection systems where the learning agents continuously compute and provide the updated (detection) models to the detection agents.

Summary (4 min read)

1 Introduction

  • As network-based computer systems play increasingly vital roles in modern society, they have become the targets of their enemies and criminals.
  • Intrusion detection techniques can be categorized into misuse detection, which uses patterns of wellknown attacks or weak spots of the system to identify intrusions; and anomaly detection, which tries to determine whether deviation from the established normal us- age patterns can be flagged as intrusions.
  • Misuse detection systems, for example [KS95] and STAT [IKP95], encode and match the sequence of “signature actions” (e.g., change the ownership of a file) of known intrusion scenarios.
  • Section 4 briefly highlights the architecture of their proposed intrusion detection system.

2 Building Classification Models

  • In this section the authors describe in detail their experiments in constructing classification models for anomaly detection.
  • A flaw in the finger daemon allows the attacker to use “buffer overflow” to trick the program to execute his malicious code.
  • Forrest et al. discovered that the short sequences of system calls made by a program during its normal executions are very consistent, yet different from the sequences of its abnormal executions as well as the executions of other programs.
  • Stephanie Forrest provided us with a set of traces of the sendmail program used in her experiments [FHSL96].
  • The number “5” represents system call open.

2.1.2 Learning to Classify System Call Sequences

  • In order for a machine learning program to learn the classification models of the “normal” and “abnormal” system call sequences, the authors need to supply it with a set of training data containing pre-labeled “normal” and “abnormal” sequences.
  • See Table 1 for an example of the labeled sequences.
  • RIPPER outputs a set of if-then rules for the “minority” classes, and a default “true” rule for the remaining class.
  • The authors conjectured that a set of specific rules for normal sequences can be used as the “identity” of a program, and thus can be used to detect any known and unknown intrusions (anomaly intrusion detection).

2.1.3 Learning to Predict System Calls

  • Unlike the experiments in Section 2.1.2 which required abnormal traces in the training data, here the authors wanted to study how to compute an anomaly detector given just the normal traces.
  • The learning tasks were formulated as follows:.
  • If a violation occurs (the actual system call is not the same as predicted by the rule), the “score” of the trace is incremented by 100 times the confidence of the violated rule.
  • Table 3 shows the results of the following experiments: Experiment A: predict the 11th system call; Experiment B: predict the middle system call in a sequence of length 7; Experiment C: predict the middle system call in a sequence of length 11; Experiment D: predict the 7th system call.

2.1.4 Discussion

  • The authors experiments showed that the normal behavior of a program execution can be established and used to detect its anomalous usage.
  • Here the authors show that a machine learning program, RIPPER, was able to generalize the system call sequence information, from 80% of the normal sequences, to a set of concise and accurate rules (the rule sets have 200 to 280 rules, and each rule has 2 or 3 attribute tests).
  • The authors need to search for a more predictive classification model so that the anomaly detector has higher confidence in flagging intrusions.
  • The directories and the names of the files touched by a program can be used.
  • For the purposes of the shootout, filters were used so that tcpdump only collected Internet Transmission Control Protocol (TCP) and Internet User Datagram Protocol (UDP) packets.

2.2.2 Data Pre-processing

  • The authors developed a script to scan each tcpdump data file and extract the “connection” level information about the network traffic.
  • Since UDP is connectionless (no connection state), the authors simply treat each packet as a connection.
  • A connection record, in preparation of data mining, now has the following fields : start time, duration, participating hosts, ports, the statistics of the connection (e.g., bytes sent in each direction, resent rate, etc.), flag (“normal” or one of the recorded connection/termination errors), and protocol (TCP or UDP).
  • The authors call the host that initiates the connection, i.e., the one that sends the first SYN, as the source, and the other as the destination.
  • Depending on the direction from the source to the destination, a connection is in one of the three types: out-going - from the LAN to the external networks; in-coming - from the external networks to the LAN; and inter-LAN - within the LAN.

2.2.3 Experiments and Results

  • For each type of the s, the authors formulated the classification experiments as the following: Each record uses the destination ser- vice (port) as the class label, and all the other features as attributes;.
  • The process (training and testing) is repeated 5 times, each time using a different 80% of the normal data as the training data (and accordingly the different remaining 20% of the normal data as part of the test data), and the averaged accuracy of the classifiers from the 5 runs is reported.
  • The authors again applied RIPPER to the connection data.
  • The results from the first round of experiments, as shown in Table 4, were not very good: the differences in the misclassification rates of the normal and intrusion data were small, except for the inter-LAN traffic of some intrusions.
  • These additional temporal-statistical features provide additional information of the network activity from a continuous perspective, and provide more insight into anomalies.

2.2.4 Discussion

  • The authors learned some important lessons from the experiments on the tcpdump data.
  • First, when the collected data is not designed specifically for security purposes or can not be used directly to build a detection model, a considerable amount of data pre-processing is required.
  • Many trials were attempted before the authors came up with the current set of features and time intervals.
  • The authors need useful tools that can provide insight into the patterns that may be exhibited in the data.
  • Second, the authors should provide tools that can help administrative staff understand the nature of the anomalies.

2.3 Combining Multiple Classifiers

  • The classifiers described in this section each models a single aspect of the system behavior.
  • They are what the authors call the base (single level) classifiers.
  • A priority in their research plan is to study and experiment with (inductively learned) classification models that combine evidence from multiple (base) detection models.
  • The authors research activities in JAM [SPT+97], which focus on the accuracy and efficiency of meta classifiers, will contribute significantly to their effort in building meta detection models.

3 Mining Patterns from Audit Data

  • In order to construct an accurate base classifier, the authors need to gather a sufficient amount of training data and identify a set of meaningful features.
  • Both of these tasks require insight into the nature of the audit data, and can be very difficult without proper tools and guidelines.
  • In this section the authors describe some algorithms that can address these needs.
  • Here the authors use the term “audit data” to refer to general data streams that have been properly processed for detection purposes.
  • An example of such data streams is the connection record data extracted from the raw tcpdump output.

3.1 Association Rules

  • The goal of mining association rules is to derive multifeature correlations from a database table.
  • The motivation for applying the association rules algorithm to audit data are: Audit data can be formatted into a database table where each row is an audit record and each column is a field (system feature) of the audit records;.
  • One of the reasons that “program policies”, which codify the access rights of privileged programs, are concise and capable to detect known attacks [KFL94] is that the intended behavior of a program, e.g., read and write files from certain directories with specific permissions, is very consistent.
  • The authors can continuously merge the rules from a new run to the aggregate rule set (of all previous runs).

3.2 Frequent Episodes

  • While the association rules algorithm seeks to find intraaudit record patterns, the frequent episodes algorithm, as described in [MTV95], can be used to discover inter- audit record patterns.
  • A frequent episode is a set of events that occur frequently within a time window (of a specified length).
  • The authors seek to apply the frequent episodes algorithm to analyze audit trails since there is evidence that the sequence information in program executions and user commands can be used to build profiles for anomaly detection [FHSL96, LB97].
  • The authors implementation followed the description in [MTV95].

3.3 Using the Discovered Patterns

  • The association rules and frequent episodes can be used to guide the audit process.
  • The authors run a program many times and under different settings.
  • Such a support system can speed up the iterative feature selection process, and help ensure the accuracy of a detection model.
  • In Figure 2 the authors see that the number of frequent episodes (raw episodes or serial episode rules) increases sharply as win goes from 2s to 30s, it then gradually stabilizes (note that by the nature of the frequent episodes algorithm, the number of episodes can only increase aswin increases).

4 Architecture Support

  • The biggest challenge of using data mining approaches in intrusion detection is that it requires a large amount of audit data in order to compute the profile rule sets.
  • As a new version of a system software arrives, the authors need to update the “normal” profile rules.
  • A learning agent, which may reside in a server machine for its computing power, is responsible for computing and maintaining the rule sets for programs and users.
  • A detection agent is generic and extensible.
  • In a network environment, a meta agent can combine reports from (base) detection agents running on each host, and make the final assertion on the state of the network.

Did you find this useful? Give us your feedback

Content maybe subject to copyright    Report

Data Mining Approaches for Intrusion Detection
Wenke Lee Salvatore J. Stolfo
Computer Science Department
Columbia University
500 West 120th Street, New York, NY 10027
f
wenke,sal
g
@cs.columbia.edu
Abstract
In this paper we discuss our research in developing gen-
eral and systematic methods for intrusion detection. The
key ideas are to use data mining techniques to discover
consistent and useful patterns of system features that de-
scribe program and user behavior, and use the set of rel-
evant system features to compute (inductively learned)
classifiers that can recognize anomalies and known in-
trusions. Using experiments on the sendmail system call
data and the network tcpdump data, we demonstrate that
we can construct concise and accurate classifiers to de-
tect anomalies. We provide an overview on two general
data mining algorithms that we have implemented: the
association rules algorithm and the frequent episodes al-
gorithm. These algorithms can be used to compute the
intra- and inter- audit recordpatterns, which are essential
in describing program or user behavior. The discovered
patterns can guide the audit data gathering process and
facilitate feature selection. To meet the challenges of
both efficient learning (mining) and real-time detection,
we propose an agent-based architecture for intrusion de-
tection systems where the learning agents continuously
compute and provide the updated (detection) models to
the detection agents.
1 Introduction
As network-based computer systems play increasingly
vital roles in modern society, they have become the tar-
gets of our enemies and criminals. Therefore, we need
to find the best ways possible to protect our systems.
The security of a computer system is compromisedwhen
This research is supported in part by grants from DARPA
(F30602-96-1-0311) and NSF (IRI-96-32225 and CDA-96-25374)
an intrusion takes place. An intrusion can be defined
[HLMS90] as “any set of actions that attempt to com-
promise the integrity, confidentiality or availability of
a resource”. Intrusion prevention techniques, such as
user authentication (e.g. using passwords or biometrics),
avoiding programming errors, and information protec-
tion (e.g., encryption) have been used to protect com-
puter systems as a rst line of defense. Intrusion preven-
tion alone is not sufficient because as systems become
ever more complex, there are always exploitable weak-
ness in the systems due to design and programming er-
rors, or various “socially engineered” penetration tech-
niques. For example, after it was first reported many
years ago, exploitable “buffer overflow” still exists in
some recent system software due to programming er-
rors. The policies that balance convenience versus strict
control of a system and information access also make it
impossible for an operational system to be completely
secure.
Intrusion detection is therefore needed as another wall
to protect computer systems. The elements central to
intrusion detection are: resources to be protected in a
target system, i.e., user accounts, file systems, system
kernels, etc; models that characterize the “normal” or
“legitimate” behavior of these resources; techniques that
compare the actual system activities with the established
models, and identify those that are “abnormal” or “intru-
sive”.
Many researchers have proposed and implemented dif-
ferent models which define different measures of system
behavior, with an ad hoc presumption that normalcy and
anomaly (or illegitimacy) will be accurately manifested
in the chosen set of system features that are modeled and
measured. Intrusion detection techniques can be catego-
rized into misuse detection, which uses patterns of well-
known attacks or weak spots of the system to identify
intrusions; and anomaly detection, which tries to deter-
mine whether deviation from the established normal us-

age patterns can be flagged as intrusions.
Misuse detection systems, for example [KS95] and
STAT [IKP95], encode and match the sequence of “sig-
nature actions” (e.g., change the ownership of a file) of
known intrusion scenarios. The main shortcomings of
such systems are: known intrusion patterns have to be
hand-coded into the system; they are unable to detect
any future (unknown) intrusions that have no matched
patterns stored in the system.
Anomaly detection (sub)systems, such as
IDES [LTG
+
92], establish normal usage patterns
(profiles) using statistical measures on system features,
for example, the CPU and I/O activities by a particular
user or program. The main difficulties of these systems
are: intuition and experience is relied upon in selecting
the system features, which can vary greatly among
different computing environments; some intrusions can
only be detected by studying the sequential interrelation
between events because each event alone may fit the
profiles.
Our research aims to eliminate, as much as possible, the
manual and ad-hoc elements from the process of build-
ing an intrusion detection system. We take a data-centric
point of view and consider intrusion detection as a data
analysis process. Anomaly detection is about finding the
normal usage patterns from the audit data, whereas mis-
use detection is about encoding and matching the intru-
sion patterns using the audit data. The central theme of
our approach is to apply data mining techniques to in-
trusion detection. Data mining generally refers to the
process of (automatically) extracting models from large
stores of data [FPSS96]. The recent rapid development
in data mining has made available a wide variety of algo-
rithms, drawn from the fields of statistics, pattern recog-
nition, machine learning, and database. Several types of
algorithms are particularly relevant to our research:
Classification: maps a data item into one of severalpre-
defined categories. These algorithms normally out-
put “classifiers”, for example, in the form of deci-
sion trees or rules. An ideal application in intrusion
detection will be to gather sufficient “normal” and
“abnormal” audit data for a user or a program, then
apply a classification algorithm to learn a classifier
that will determine (future) audit data as belonging
to the normal class or the abnormal class;
Link analysis: determines relations between fields in
the database. Finding out the correlations in audit
data will provide insight for selecting the right set
of system features for intrusion detection;
Sequence analysis: models sequential patterns. These
algorithms can help us understand what (time-
based) sequence of audit events are frequently en-
countered together. These frequent event patterns
are important elements of the behavior profile of a
user or program.
We are developing a systematic framework for design-
ing, developing and evaluating intrusion detection sys-
tems. Specifically, the framework consists of a set of
environment-independent guidelines and programs that
can assist a system administrator or security officer to
select appropriate system features from audit data
to build models for intrusion detection;
architect a hierarchical detector system from com-
ponent detectors;
update and deploy new detection systems as
needed.
The key advantage of our approach is that it can auto-
matically generate concise and accurate detection mod-
els from large amount of audit data. The methodology
itself is general and mechanical, and therefore can be
used to build intrusion detection systems for a wide va-
riety of computing environments.
The rest of the paper is organized as follows: Sec-
tion 2 describes our experiments in building classifica-
tion models for sendmail and network traffic. Section 3
presents the association rules and frequent episodes al-
gorithms that can be used to compute a set of patterns
from audit data. Section 4 briefly highlights the archi-
tecture of our proposed intrusion detection system. Sec-
tion 5 outlines our future research plans.
2 Building Classification Models
In this section we describe in detail our experiments
in constructing classification models for anomaly de-
tection. The first set of experiments, first reported in
[LSC97], is on the sendmail system call data, and the
second is on the network tcpdump data.
2.1 Experiments on sendmail Data
There have been a lot of attacks on computer systems
that are carried out as exploitations of the design and

programming errors in privileged programs, those that
can run as root. For example, a flaw in the finger dae-
mon allows the attacker to use “buffer overflow” to trick
the program to execute his malicious code. Recent re-
search efforts by Ko et al. [KFL94] and Forrest et al.
[FHSL96] attempted to build intrusion detection systems
that monitor the execution of privileged programs and
detect the attacks on their vulnerabilities. Forrest et al.
discovered that the short sequences of system calls made
by a program during its normal executions are very con-
sistent, yet different from the sequences of its abnormal
(exploited) executions as well as the executions of other
programs. Therefore a database containing these nor-
mal sequences can be used as the “self definition of the
normal behavior of a program, and as the basis to de-
tect anomalies. Their findings motivated us to search for
simple and accurate intrusion detection models.
Stephanie Forrest provided us with a set of traces of the
sendmail program used in her experiments [FHSL96].
We applied machine learning techniques to produce
classifiers that can distinguish the exploits from the nor-
mal runs.
2.1.1 The sendmail System Call Traces
The procedure of generating the sendmail traces were
detailed in [FHSL96]. Briefly, each file of the trace
data has two columns of integers, the first is the process
ids and the second is the system call “numbers”. These
numbers are indices into a lookup table of system call
names. For example, the number “5” represents system
call open. The set of traces include:
Normal traces: a trace of the sendmail daemon and a
concatenation of several invocations of the send-
mail program;
Abnormal traces: 3 traces of the sscp (sunsendmailcp)
attacks, 2 traces of the syslog-remote attacks, 2
traces of the syslog-local attacks, 2 traces of the de-
code attacks, 1 trace of the sm5x attack and 1 trace
of the sm565a attack. These are the traces of (var-
ious kinds of) abnormal runs of the sendmail pro-
gram.
2.1.2 Learning to Classify System Call Sequences
In order for a machine learning programto learn the clas-
sification models of the “normal” and “abnormal” sys-
tem call sequences, we need to supply it with a set of
System Call Sequences (length 7) Class Labels
4 2 66 66 4 138 66 “normal”
... ...
5 5 5 4 59 105 104 “abnormal”
... ...
Table 1: Pre-labeled System Call Sequences of Length 7
training data containing pre-labeled “normal” and “ab-
normal” sequences. We use a sliding window to scan
the normal traces and create a list of unique sequences
of system calls. We call this list the “normal” list. Next,
we scan each of the intrusion traces. For each sequence
of system calls, we first look it up in the normal list. If
an exact match can be found then the sequence is labeled
as “normal”. Otherwise it is labeled as “abnormal” (note
that the data gathering process described in [FHSL96]
ensured that the normal traces include nearly all possible
“normal” short sequences of system calls, as new runs of
sendmail
failed to generate new sequences). Needless
to say all sequences in the normal traces are labeled as
“normal”. See Table 1 for an example of the labeled se-
quences. It should be noted that an intrusion trace con-
tains many normal sequences in addition to the abnormal
sequences since the illegal activities only occur in some
places within a trace.
We applied RIPPER [Coh95], a rule learning program,
to our training data. The following learning tasks were
formulated to induce the rule sets for normal and abnor-
mal system call sequences:
Each record has
n
positional attributes,
p
1
,
p
2
, ...,
p
n
, one for each of the system calls in a sequence of
length
n
; plus a class label, “normal” or “abnormal”
The training data is composed of normal sequences
taken from
80%
of the normal traces, plus the ab-
normal sequences from 2 traces of the
sscp
attacks,
1 trace of the syslog-local attack, and 1 trace of the
syslog-remote attack
The testing data includes both normal and abnormal
traces not used in the training data.
RIPPER outputs a set of if-then rules for the “minority”
classes, and a default “true” rule for the remaining class.
The following exemplar RIPPER rules were generated
from the system call data:
normal:-
p
2
= 104
,
p
7
= 112
. [meaning: if
p
2
is 104 (
v times
) and
p
7
is 112 (
v tr ace
) then
the sequence is “normal”]

normal:-
p
6
= 19
,
p
7
= 105
. [meaning: if
p
6
is 19 (
lseek
) and
p
7
is 105 (
sig v ec
) then the
sequence is “normal”]
...
abnormal:- true. [meaning: if none of the
above, the sequence is “abnormal”]
These RIPPER rules can be used to predict whether a se-
quence is “abnormal” or “normal”. But what the intru-
sion detection system needs to know is whether the trace
being analyzed is an intrusion or not. We use the fol-
lowing post-processingscheme to detect whether a given
trace is an intrusion based on the RIPPER predictions of
its constituent sequences:
1. Use a sliding window of length
2
l
+ 1
, e.g., 7, 9, 11,
13, etc., and a sliding (shift) step of
l
, to scan the
predictions made by the RIPPER rules on system
call sequences.
2. For each of the (length
2
l
+ 1
) regions of RIPPER
predictions generated in Step
1
, if more than
l
pre-
dictions are “abnormal” then the current region of
predictions is an “abnormal” region. (Note that
l
is
an input parameter).
3. If the percentage of abnormal regions is above a
threshold value, say
2%
, then the trace is an intru-
sion.
This scheme is an attempt to filter out the spurious pre-
diction errors. The intuition behind this scheme is that
when an intrusion actually occurs, the majority of adja-
cent system call sequences are abnormal; whereas the
prediction errors tend to be isolated and sparse. In
[FHSL96], the percentage of the mismatched sequences
(out of the total number of matches (lookups) performed
for the trace) is used to distinguish normal from abnor-
mal. The “mismatched” sequences are the abnormal se-
quences in our context. Our scheme is different in that
we look for abnormal regions that contain more abnor-
mal sequences than the normal ones, and calculate the
percentage of abnormal regions (out of the total number
of regions). Our scheme is more sensitive to the tempo-
ral information, and is less sensitive to noise (errors).
RIPPER only outputs rules for the “minority” class. For
example, in our experiments, if the training data has
fewer abnormal sequences than the normal ones, the
output RIPPER rules can be used to identify abnormal
sequences, and the default (everything else) prediction
is normal. We conjectured that a set of specific rules
for normal sequences can be used as the “identity” of
a program, and thus can be used to detect any known
and unknown intrusions (anomaly intrusion detection).
Whereas having only the rules for abnormal sequences
only gives us the capability to identify known intrusions
(misuse intrusion detection).
% abn. % abn. in experiment
Traces [FHSL96] A B C D
sscp-1 5.2 41.9 32.2 40.0 33.1
sscp-2 5.2 40.4 30.4 37.6 33.3
sscp-3 5.2 40.4 30.4 37.6 33.3
syslog-r-1 5.1 30.8 21.2 30.3 21.9
syslog-r-2 1.7 27.1 15.6 26.8 16.5
syslog-l-1 4.0 16.7 11.1 17.0 13.0
syslog-l-2 5.3 19.9 15.9 19.8 15.9
decode-1 0.3 4.7 2.1 3.1 2.1
decode-2 0.3 4.4 2.0 2.5 2.2
sm565a 0.6 11.7 8.0 1.1 1.0
sm5x 2.7 17.7 6.5 5.0 3.0
sendmail
0 1.0 0.1 0.2 0.3
Table 2: Comparing Detection of Anomalies. The col-
umn [FHSL96] is the percentage of the abnormal se-
quences of the traces. Columns A, B, C, and D are
the percentages of abnormal regions (as measured by
the post-processing scheme) of the traces.
sendmail
is the
20%
normal traces not used in the training data.
Traces in bold were included in the training data, the
other traces were used as testing data only.
We compare the results of the following experimentsthat
have different distributions of abnormal versus normal
sequences in the training data:
Experiment A:
46%
normal and
54%
abnormal, se-
quence length is 11;
Experiment B:
46%
normal and
54%
abnormal, se-
quence length is 7;
Experiment C:
46%
abnormal and
54%
normal, se-
quence length is 11;
Experiment D:
46%
abnormal and
54%
normal, se-
quence length is 7.
Table 2 shows the results of using the classifiers from
these experiments to analyze the traces. We report here
the percentage of abnormal regions (as measured by our
post-processing scheme) of each trace, and compare our
results with Forrest et al., as reported in [FHSL96].
From Table 2, we can see that in general, intrusion traces
generate much larger percentages of abnormal regions
than the normal traces. We call these measured percent-
ages the “scores” of the traces. In order to establish a
threshold score for identifying intrusion traces, it is de-
sirable that there is a sufficiently large gap between the

scores of the normal sendmail traces and the low-end
scores of the intrusion traces. Comparing experiments
that used the same sequence length, we observe that such
a gap in A,
3
:
4
, is larger than the gap in C,
0
:
9
; and
1
:
9
in B is larger than
0
:
7
in D. The RIPPER rules from
experiments A and B describe the patterns of the nor-
mal sequences. Here the results show that these rules
can be used to identify the intrusion traces, including
those not seen in the training data, namely, the decode
traces, the sm565a and sm5x traces. This confirms our
conjecture that rules for normal patterns can be used for
anomaly detection. The RIPPER rules from experiments
C and D specify the patterns of abnormal sequences in
the intrusion traces included in the training data. The
results indicate that these rules are very capable of de-
tecting the intrusion traces of the “known” types (those
seen in the training data), namely, the sscp-3 trace, the
syslog-remote-2 trace and the syslog-local-2 trace. But
comparing with the rules from A and B, the rules in C
and D perform poorly on intrusion traces of “unknown”
types. This confirms our conjecture that rules for abnor-
mal patterns are good for misuse intrusion detection, but
may not be as effective in detecting future (“unknown”)
intrusions.
The results from Forrest et al. showed that their method
required a very low threshold in order to correctly detect
the
decode
and
sm
565
a
intrusions. While the results
here show that our approach generated much stronger
“signals” of anomalies from the intrusion traces, it
should be noted that their method used all of the normal
traces but not any of the intrusion traces in training.
2.1.3 Learning to Predict System Calls
Unlike the experiments in Section 2.1.2 which required
abnormal traces in the training data, here we wanted to
study how to compute an anomaly detector given just the
normal traces. We conducted experiments to learn the
(normal) correlation among system calls: the
n
th system
calls or the middle system calls in (normal) sequences of
length
n
.
The learning tasks were formulated as follows:
Each record has
n
1
positional attributes,
p
1
,
p
2
,
...,
p
n
1
, each being a system call; plus a class la-
bel, the system call of the
n
th position or the middle
position
The training data is composed of (normal) se-
quences taken from
80%
of the normal sendmail
traces
The testing data is the traces not included in the
training data, namely, the remaining
20%
of the
normal sendmail traces and all the intrusion traces.
RIPPER outputs rules in the following form:
38 :-
p
3
= 40
,
p
4
= 4
. [meaning: if
p
3
is 40
(
lstat
) and
p
4
is 4 (
wr ite
), then the 7th system
call is 38 (
stat
).]
...
5:- true. [meaning: if none of the above, then
the 7th system calls is 5 (
open
).]
Each of these RIPPER rules has some “confidence” in-
formation: the number of matched examples (records
that conform to the rule) and the number of unmatched
examples (recordsthat are in conflict with the rule) in the
training data. For example, the rule for “38 (
stat
)” cov-
ers 12 matched examples and 0 unmatched examples.
We measure the confidence value of a rule as the num-
ber of matched examples divided by the sum of matched
and unmatched examples. These rules can be used to an-
alyze a trace by examining each sequence of the trace. If
a violation occurs (the actual system call is not the same
as predicted by the rule), the “score” of the trace is in-
cremented by 100 times the confidence of the violated
rule. For example, if a sequence in the trace has
p
3
= 40
and
p
4
= 4
, but
p
7
= 44
instead of 38, the total score
of the trace is incremented by 100 since the confidence
value of this violated rule is 1. The averaged score (by
the total number of sequences) of the trace is then used
to decide whether an intrusion has occurred.
Table 3 shows the results of the following experiments:
Experiment A: predict the 11th system call;
Experiment B: predict the middle system call in a se-
quence of length 7;
Experiment C: predict the middle system call in a se-
quence of length 11;
Experiment D: predict the 7th system call.
We can see from Table 3 that the RIPPER rules from
experiments A and B are effective because the gap be-
tween the score of normal sendmail and the low-end
scores of intrusion traces, 3.9, and 3.3 respectively, are
large enough. However, the rules from C and D perform
poorly. Since C predicts the middle system call of a se-
quence of length 11 and D predicts the 7th system call,
we reason that the training data (the normal traces) has
no stable patterns for the 6th or 7th position in system
call sequences.

Citations
More filters
Journal ArticleDOI
TL;DR: This survey tries to provide a structured and comprehensive overview of the research on anomaly detection by grouping existing techniques into different categories based on the underlying approach adopted by each technique.
Abstract: Anomaly detection is an important problem that has been researched within diverse research areas and application domains. Many anomaly detection techniques have been specifically developed for certain application domains, while others are more generic. This survey tries to provide a structured and comprehensive overview of the research on anomaly detection. We have grouped existing techniques into different categories based on the underlying approach adopted by each technique. For each category we have identified key assumptions, which are used by the techniques to differentiate between normal and anomalous behavior. When applying a given technique to a particular domain, these assumptions can be used as guidelines to assess the effectiveness of the technique in that domain. For each category, we provide a basic anomaly detection technique, and then show how the different existing techniques in that category are variants of the basic technique. This template provides an easier and more succinct understanding of the techniques belonging to each category. Further, for each category, we identify the advantages and disadvantages of the techniques in that category. We also provide a discussion on the computational complexity of the techniques since it is an important issue in real application domains. We hope that this survey will provide a better understanding of the different directions in which research has been done on this topic, and how techniques developed in one area can be applied in domains for which they were not intended to begin with.

9,627 citations

Book ChapterDOI
Pavel Berkhin1
01 Jan 2006
TL;DR: This survey concentrates on clustering algorithms from a data mining perspective as a data modeling technique that provides for concise summaries of the data.
Abstract: Clustering is the division of data into groups of similar objects. In clustering, some details are disregarded in exchange for data simplification. Clustering can be viewed as a data modeling technique that provides for concise summaries of the data. Clustering is therefore related to many disciplines and plays an important role in a broad range of applications. The applications of clustering usually deal with large datasets and data with many attributes. Exploration of such data is a subject of data mining. This survey concentrates on clustering algorithms from a data mining perspective.

3,047 citations

Journal ArticleDOI
TL;DR: The main challenges to be dealt with for the wide scale deployment of anomaly-based intrusion detectors, with special emphasis on assessment issues are outlined.

1,712 citations

Book ChapterDOI
22 Sep 2003
TL;DR: This paper presents a novel approach for learning from imbalanced data sets, based on a combination of the SMOTE algorithm and the boosting procedure, which shows improvement in prediction performance on the minority class and overall improved F-values.
Abstract: Many real world data mining applications involve learning from imbalanced data sets. Learning from data sets that contain very few instances of the minority (or interesting) class usually produces biased classifiers that have a higher predictive accuracy over the majority class(es), but poorer predictive accuracy over the minority class. SMOTE (Synthetic Minority Over-sampling TEchnique) is specifically designed for learning from imbalanced data sets. This paper presents a novel approach for learning from imbalanced data sets, based on a combination of the SMOTE algorithm and the boosting procedure. Unlike standard boosting where all misclassified examples are given equal weights, SMOTEBoost creates synthetic examples from the rare or minority class, thus indirectly changing the updating weights and compensating for skewed distributions. SMOTEBoost applied to several highly and moderately imbalanced data sets shows improvement in prediction performance on the minority class and overall improved F-values.

1,502 citations


Cites background or methods from "Data mining approaches for intrusio..."

  • ...T Modify distribution Dt by creating N synthetic examples from minority class Cm using SMOTE algorithm Train a weak learner using distribution Dt Compute weak hypothesis ht: X × Y → [0, 1] Compute the pseudo-loss of hypothesis ht:...

    [...]

  • ...The challenge of detecting future attacks has led to an increasing interest in intrusion detection techniques based upon data mining [1, 2, 3, 4]....

    [...]

Journal ArticleDOI
TL;DR: This paper provides a comprehensive survey of anomaly detection systems and hybrid intrusion detection systems of the recent past and present and discusses recent technological trends in anomaly detection and identifies open problems and challenges in this area.

1,433 citations


Cites methods from "Data mining approaches for intrusio..."

  • ...[46,50] proposed an association rulebased data mining approach for anomaly detection where raw data was converted into ASCII network packet information, which in turn was converted into connection-level information....

    [...]

  • ...[45,46,50] used RIPPER to characterize sequences occurring in normal data by a smaller set of rules that capture the common elements in those sequences....

    [...]

  • ...[46] It uses inductive rule generation to generate rules for important,...

    [...]

  • ...Association rules have been successfully used to mine audit data to find normal patterns for anomaly detection [46,50,81]....

    [...]

References
More filters
Book ChapterDOI
William W. Cohen1
09 Jul 1995
TL;DR: This paper evaluates the recently-proposed rule learning algorithm IREP on a large and diverse collection of benchmark problems, and proposes a number of modifications resulting in an algorithm RIPPERk that is very competitive with C4.5 and C 4.5rules with respect to error rates, but much more efficient on large samples.
Abstract: Many existing rule learning systems are computationally expensive on large noisy datasets. In this paper we evaluate the recently-proposed rule learning algorithm IREP on a large and diverse collection of benchmark problems. We show that while IREP is extremely efficient, it frequently gives error rates higher than those of C4.5 and C4.5rules. We then propose a number of modifications resulting in an algorithm RIPPERk that is very competitive with C4.5rules with respect to error rates, but much more efficient on large samples. RIPPERk obtains error rates lower than or equivalent to C4.5rules on 22 of 37 benchmark problems, scales nearly linearly with the number of training examples, and can efficiently process noisy datasets containing hundreds of thousands of examples.

4,081 citations

Proceedings Article
26 Jan 1998
TL;DR: Bro as mentioned in this paper is a stand-alone system for detecting network intruders in real-time by passively monitoring a network link over which the intruder's traffic transits, which emphasizes high-speed (FDDI-rate) monitoring, realtime notification, clear separation between mechanism and policy and extensibility.
Abstract: We describe Bro, a stand-alone system for detecting network intruders in real-time by passively monitoring a network link over which the intruder's traffic transits. We give an overview of the system's design, which emphasizes high-speed (FDDI-rate) monitoring, real-time notification, clear separation between mechanism and policy, and extensibility. To achieve these ends, Bro is divided into an "event engine" that reduces a kernel-filtered network traffic stream into a series of higher-level events, and a "policy script interpreter" that interprets event handlers written in a specialized language used to express a site's security policy. Event handlers can update state information, synthesize new events, record information to disk, and generate real-time notifications via syslog. We also discuss a number of attacks that attempt to subvert passive monitoring systems and defenses against these, and give particulars of how Bro analyzes the four applications integrated into it so far: Finger, FTP, Portmapper and Telnet. The system is publicly available in source code form.

2,468 citations

Proceedings ArticleDOI
06 May 1996
TL;DR: A method for anomaly detection is introduced in which "normal" is defined by short-range correlations in a process' system calls, and initial experiments suggest that the definition is stable during normal behaviour for standard UNIX programs.
Abstract: A method for anomaly detection is introduced in which ``normal'' is defined by short-range correlations in a process' system calls. Initial experiments suggest that the definition is stable during normal behavior for standard UNIX programs. Further, it is able to detect several common intrusions involving sendmail and lpr. This work is part of a research program aimed at building computer security systems that incorporate the mechanisms and algorithms used by natural immune systems.

2,003 citations


"Data mining approaches for intrusio..." refers background or methods or result in this paper

  • ...The weakness of the model in [ FHSL96 ] may be that the recorded (rote learned) normal sequence database may be too specific as it contains entries....

    [...]

  • ...Traces [ FHSL96 ] A B C D sscp-1 5.2 41.9 32.2 40.0 33.1 sscp-2 5.2 40.4 30.4 37.6 33.3 sscp-3 5.2 40.4 30.4 37.6 33.3 syslog-r-1 5.1 30.8 21.2 30.3 21.9 syslog-r-2 1.7 27.1 15.6 26.8 16.5 syslog-l-1 4.0 16.7 11.1 17.0 13.0 syslog-l-2 5.3 19.9 15.9 19.8 15.9 decode-1 0.3 4.7 2.1 3.1 2.1 decode-2 0.3 4.4 2.0 2.5 2.2 sm565a 0.6 11.7 8.0 1.1 1.0 sm5x 2.7 17.7 6.5 5.0 3.0 0 1.0 0.1 0.2 0.3...

    [...]

  • ...Otherwise it is labeled as “abnormal” (note that the data gathering process described in [ FHSL96 ] ensured that the normal traces include nearly all possible “normal” short sequences of system calls, as new runs of failed to generate new sequences)....

    [...]

  • ...We report here the percentage of abnormal regions (as measured by our post-processing scheme) of each trace, and compare our results with Forrest et al., as reported in [ FHSL96 ]....

    [...]

  • ...The procedure of generating the sendmail traces were detailed in [ FHSL96 ]....

    [...]

Journal ArticleDOI
TL;DR: A new interest-measure for rules which uses the information in the taxonomy is presented, and given a user-specified “minimum-interest-level”, this measure prunes a large number of redundant rules.

1,790 citations


"Data mining approaches for intrusio..." refers background in this paper

  • ...Formally, given a set of records, where each record is a set of items, an association rule is an expression [ SA95 ]....

    [...]

Frequently Asked Questions (15)
Q1. What contributions have the authors mentioned in the paper "Data mining approaches for intrusion detection" ?

In this paper the authors discuss their research in developing general and systematic methods for intrusion detection. Using experiments on the sendmail system call data and the network tcpdump data, the authors demonstrate that they can construct concise and accurate classifiers to detect anomalies. The authors provide an overview on two general data mining algorithms that they have implemented: the association rules algorithm and the frequent episodes algorithm. To meet the challenges of both efficient learning ( mining ) and real-time detection, the authors propose an agent-based architecture for intrusion detection systems where the learning agents continuously compute and provide the updated ( detection ) models to the detection agents. 

The main difficulties of these systems are: intuition and experience is relied upon in selecting the system features, which can vary greatly among different computing environments; some intrusions can only be detected by studying the sequential interrelation between events because each event alone may fit the profiles. 

Since C predicts the middle system call of a sequence of length 11 and D predicts the 7th system call, the authors reason that the training data (the normal traces) has no stable patterns for the 6th or 7th position in system call sequences. 

The authors suggested that the association rules and frequent episodes algorithms can be used to compute the consistent patterns from audit data. 

The key advantage of their approach is that it can automatically generate concise and accurate detection models from large amount of audit data. 

A priority in their research plan is to study and experiment with (inductively learned) classification models that combine evidence from multiple (base) detection models. 

The main shortcomings of such systems are: known intrusion patterns have to be hand-coded into the system; they are unable to detect any future (unknown) intrusions that have no matched patterns stored in the system. 

While the association rules algorithm seeks to find intraaudit record patterns, the frequent episodes algorithm, as described in [MTV95], can be used to discover inter- audit record patterns. 

Many operating systems provide auditing utilities, such as the BSM audit of Solaris, that can be configured to collectabundant information (with many features) of the activities in a host system. 

Improvement in accuracy can come from adding more features, rather than just the system calls, into the models of program execution. 

The policies that balance convenience versus strict control of a system and information access also make it impossible for an operational system to be completely secure. 

The methodology itself is general and mechanical, and therefore can be used to build intrusion detection systems for a wide variety of computing environments. 

as Figure 1 shows, for the in-comingtraffic, the misclassification rates on the intrusion data increase dramatically as the time interval goes from 5s to 30s, then stabilizes or tapers off afterwards. 

In Figure 2 the authors see that the number of frequent episodes (raw episodes or serial episode rules) increases sharply as win goes from 2s to 30s, it then gradually stabilizes (note that by the nature of the frequent episodes algorithm, the number of episodes can only increase aswin increases). 

For each new run, the authors compute its rule set (that consists of both the association rules and the frequent episodes) from the audit trail, and update the (existing) aggregate rule sets using the following merge process: