Report•DOI•

Data mining approaches for intrusion detection

Wenke Lee¹, Salvatore J. Stolfo¹•Institutions (1)

26 Jan 1998-pp 6-6

TL;DR: An agent-based architecture for intrusion detection systems where the learning agents continuously compute and provide the updated (detection) models to the detection agents is proposed.

read less

Abstract: In this paper we discuss our research in developing general and systematic methods for intrusion detection. The key ideas are to use data mining techniques to discover consistent and useful patterns of system features that describe program and user behavior, and use the set of relevant system features to compute (inductively learned) classifiers that can recognize anomalies and known intrusions. Using experiments on the sendmail system call data and the network tcpdump data, we demonstrate that we can construct concise and accurate classifiers to detect anomalies. We provide an overview on two general data mining algorithms that we have implemented: the association rules algorithm and the frequent episodes algorithm. These algorithms can be used to compute the intra-and inter-audit record patterns, which are essential in describing program or user behavior. The discovered patterns can guide the audit data gathering process and facilitate feature selection. To meet the challenges of both efficient learning (mining) and real-time detection, we propose an agent-based architecture for intrusion detection systems where the learning agents continuously compute and provide the updated (detection) models to the detection agents.

...read moreread less

Summary (4 min read)

Jump to: [1 Introduction] – [2 Building Classification Models] – [2.1.2 Learning to Classify System Call Sequences] – [2.1.3 Learning to Predict System Calls] – [2.1.4 Discussion] – [2.2.2 Data Pre-processing] – [2.2.3 Experiments and Results] – [2.2.4 Discussion] – [2.3 Combining Multiple Classifiers] – [3 Mining Patterns from Audit Data] – [3.1 Association Rules] – [3.2 Frequent Episodes] – [3.3 Using the Discovered Patterns] and [4 Architecture Support]

1 Introduction

As network-based computer systems play increasingly vital roles in modern society, they have become the targets of their enemies and criminals.
Intrusion detection techniques can be categorized into misuse detection, which uses patterns of wellknown attacks or weak spots of the system to identify intrusions; and anomaly detection, which tries to determine whether deviation from the established normal us- age patterns can be flagged as intrusions.
Misuse detection systems, for example [KS95] and STAT [IKP95], encode and match the sequence of “signature actions” (e.g., change the ownership of a file) of known intrusion scenarios.
Section 4 briefly highlights the architecture of their proposed intrusion detection system.

2 Building Classification Models

In this section the authors describe in detail their experiments in constructing classification models for anomaly detection.
A flaw in the finger daemon allows the attacker to use “buffer overflow” to trick the program to execute his malicious code.
Forrest et al. discovered that the short sequences of system calls made by a program during its normal executions are very consistent, yet different from the sequences of its abnormal executions as well as the executions of other programs.
Stephanie Forrest provided us with a set of traces of the sendmail program used in her experiments [FHSL96].
The number “5” represents system call open.

2.1.2 Learning to Classify System Call Sequences

In order for a machine learning program to learn the classification models of the “normal” and “abnormal” system call sequences, the authors need to supply it with a set of training data containing pre-labeled “normal” and “abnormal” sequences.
See Table 1 for an example of the labeled sequences.
RIPPER outputs a set of if-then rules for the “minority” classes, and a default “true” rule for the remaining class.
The authors conjectured that a set of specific rules for normal sequences can be used as the “identity” of a program, and thus can be used to detect any known and unknown intrusions (anomaly intrusion detection).

2.1.3 Learning to Predict System Calls

Unlike the experiments in Section 2.1.2 which required abnormal traces in the training data, here the authors wanted to study how to compute an anomaly detector given just the normal traces.
The learning tasks were formulated as follows:.
If a violation occurs (the actual system call is not the same as predicted by the rule), the “score” of the trace is incremented by 100 times the confidence of the violated rule.
Table 3 shows the results of the following experiments: Experiment A: predict the 11th system call; Experiment B: predict the middle system call in a sequence of length 7; Experiment C: predict the middle system call in a sequence of length 11; Experiment D: predict the 7th system call.

2.1.4 Discussion

The authors experiments showed that the normal behavior of a program execution can be established and used to detect its anomalous usage.
Here the authors show that a machine learning program, RIPPER, was able to generalize the system call sequence information, from 80% of the normal sequences, to a set of concise and accurate rules (the rule sets have 200 to 280 rules, and each rule has 2 or 3 attribute tests).
The authors need to search for a more predictive classification model so that the anomaly detector has higher confidence in flagging intrusions.
The directories and the names of the files touched by a program can be used.
For the purposes of the shootout, filters were used so that tcpdump only collected Internet Transmission Control Protocol (TCP) and Internet User Datagram Protocol (UDP) packets.

2.2.2 Data Pre-processing

The authors developed a script to scan each tcpdump data file and extract the “connection” level information about the network traffic.
Since UDP is connectionless (no connection state), the authors simply treat each packet as a connection.
A connection record, in preparation of data mining, now has the following fields : start time, duration, participating hosts, ports, the statistics of the connection (e.g., bytes sent in each direction, resent rate, etc.), flag (“normal” or one of the recorded connection/termination errors), and protocol (TCP or UDP).
The authors call the host that initiates the connection, i.e., the one that sends the first SYN, as the source, and the other as the destination.
Depending on the direction from the source to the destination, a connection is in one of the three types: out-going - from the LAN to the external networks; in-coming - from the external networks to the LAN; and inter-LAN - within the LAN.

2.2.3 Experiments and Results

For each type of the s, the authors formulated the classification experiments as the following: Each record uses the destination ser- vice (port) as the class label, and all the other features as attributes;.
The process (training and testing) is repeated 5 times, each time using a different 80% of the normal data as the training data (and accordingly the different remaining 20% of the normal data as part of the test data), and the averaged accuracy of the classifiers from the 5 runs is reported.
The authors again applied RIPPER to the connection data.
The results from the first round of experiments, as shown in Table 4, were not very good: the differences in the misclassification rates of the normal and intrusion data were small, except for the inter-LAN traffic of some intrusions.
These additional temporal-statistical features provide additional information of the network activity from a continuous perspective, and provide more insight into anomalies.

2.2.4 Discussion

The authors learned some important lessons from the experiments on the tcpdump data.
First, when the collected data is not designed specifically for security purposes or can not be used directly to build a detection model, a considerable amount of data pre-processing is required.
Many trials were attempted before the authors came up with the current set of features and time intervals.
The authors need useful tools that can provide insight into the patterns that may be exhibited in the data.
Second, the authors should provide tools that can help administrative staff understand the nature of the anomalies.

2.3 Combining Multiple Classifiers

The classifiers described in this section each models a single aspect of the system behavior.
They are what the authors call the base (single level) classifiers.
A priority in their research plan is to study and experiment with (inductively learned) classification models that combine evidence from multiple (base) detection models.
The authors research activities in JAM [SPT+97], which focus on the accuracy and efficiency of meta classifiers, will contribute significantly to their effort in building meta detection models.

3 Mining Patterns from Audit Data

In order to construct an accurate base classifier, the authors need to gather a sufficient amount of training data and identify a set of meaningful features.
Both of these tasks require insight into the nature of the audit data, and can be very difficult without proper tools and guidelines.
In this section the authors describe some algorithms that can address these needs.
Here the authors use the term “audit data” to refer to general data streams that have been properly processed for detection purposes.
An example of such data streams is the connection record data extracted from the raw tcpdump output.

3.1 Association Rules

The goal of mining association rules is to derive multifeature correlations from a database table.
The motivation for applying the association rules algorithm to audit data are: Audit data can be formatted into a database table where each row is an audit record and each column is a field (system feature) of the audit records;.
One of the reasons that “program policies”, which codify the access rights of privileged programs, are concise and capable to detect known attacks [KFL94] is that the intended behavior of a program, e.g., read and write files from certain directories with specific permissions, is very consistent.
The authors can continuously merge the rules from a new run to the aggregate rule set (of all previous runs).

3.2 Frequent Episodes

While the association rules algorithm seeks to find intraaudit record patterns, the frequent episodes algorithm, as described in [MTV95], can be used to discover inter- audit record patterns.
A frequent episode is a set of events that occur frequently within a time window (of a specified length).
The authors seek to apply the frequent episodes algorithm to analyze audit trails since there is evidence that the sequence information in program executions and user commands can be used to build profiles for anomaly detection [FHSL96, LB97].
The authors implementation followed the description in [MTV95].

3.3 Using the Discovered Patterns

The association rules and frequent episodes can be used to guide the audit process.
The authors run a program many times and under different settings.
Such a support system can speed up the iterative feature selection process, and help ensure the accuracy of a detection model.
In Figure 2 the authors see that the number of frequent episodes (raw episodes or serial episode rules) increases sharply as win goes from 2s to 30s, it then gradually stabilizes (note that by the nature of the frequent episodes algorithm, the number of episodes can only increase aswin increases).

4 Architecture Support

The biggest challenge of using data mining approaches in intrusion detection is that it requires a large amount of audit data in order to compute the profile rule sets.
As a new version of a system software arrives, the authors need to update the “normal” profile rules.
A learning agent, which may reside in a server machine for its computing power, is responsible for computing and maintaining the rule sets for programs and users.
A detection agent is generic and extensible.
In a network environment, a meta agent can combine reports from (base) detection agents running on each host, and make the final assertion on the state of the network.

Did you find this useful? Give us your feedback

Figures (2)

Figure 2: Effects of Window Sizes on the Number of Frequent Episodes.

Figure 3: An Architecture for Agent-Based Intrusion Detection System

Content maybe subject to copyright Report

Data Mining Approaches for Intrusion Detection



Wenke Lee Salvatore J. Stolfo

Computer Science Department

Columbia University

500 West 120th Street, New York, NY 10027

wenke,sal

@cs.columbia.edu

Abstract

In this paper we discuss our research in developing gen-

eral and systematic methods for intrusion detection. The

key ideas are to use data mining techniques to discover

consistent and useful patterns of system features that de-

scribe program and user behavior, and use the set of rel-

evant system features to compute (inductively learned)

classiﬁers that can recognize anomalies and known in-

trusions. Using experiments on the sendmail system call

data and the network tcpdump data, we demonstrate that

we can construct concise and accurate classiﬁers to de-

tect anomalies. We provide an overview on two general

data mining algorithms that we have implemented: the

association rules algorithm and the frequent episodes al-

gorithm. These algorithms can be used to compute the

intra- and inter- audit recordpatterns, which are essential

in describing program or user behavior. The discovered

patterns can guide the audit data gathering process and

facilitate feature selection. To meet the challenges of

both efﬁcient learning (mining) and real-time detection,

we propose an agent-based architecture for intrusion de-

tection systems where the learning agents continuously

compute and provide the updated (detection) models to

the detection agents.

1 Introduction

As network-based computer systems play increasingly

vital roles in modern society, they have become the tar-

gets of our enemies and criminals. Therefore, we need

to ﬁnd the best ways possible to protect our systems.

The security of a computer system is compromisedwhen



This research is supported in part by grants from DARPA

(F30602-96-1-0311) and NSF (IRI-96-32225 and CDA-96-25374)

an intrusion takes place. An intrusion can be deﬁned

[HLMS90] as “any set of actions that attempt to com-

promise the integrity, conﬁdentiality or availability of

a resource”. Intrusion prevention techniques, such as

user authentication (e.g. using passwords or biometrics),

avoiding programming errors, and information protec-

tion (e.g., encryption) have been used to protect com-

puter systems as a ﬁrst line of defense. Intrusion preven-

tion alone is not sufﬁcient because as systems become

ever more complex, there are always exploitable weak-

ness in the systems due to design and programming er-

rors, or various “socially engineered” penetration tech-

niques. For example, after it was ﬁrst reported many

years ago, exploitable “buffer overﬂow” still exists in

some recent system software due to programming er-

rors. The policies that balance convenience versus strict

control of a system and information access also make it

impossible for an operational system to be completely

secure.

Intrusion detection is therefore needed as another wall

to protect computer systems. The elements central to

intrusion detection are: resources to be protected in a

target system, i.e., user accounts, ﬁle systems, system

kernels, etc; models that characterize the “normal” or

“legitimate” behavior of these resources; techniques that

compare the actual system activities with the established

models, and identify those that are “abnormal” or “intru-

sive”.

Many researchers have proposed and implemented dif-

ferent models which deﬁne different measures of system

behavior, with an ad hoc presumption that normalcy and

anomaly (or illegitimacy) will be accurately manifested

in the chosen set of system features that are modeled and

measured. Intrusion detection techniques can be catego-

rized into misuse detection, which uses patterns of well-

known attacks or weak spots of the system to identify

intrusions; and anomaly detection, which tries to deter-

mine whether deviation from the established normal us-

age patterns can be ﬂagged as intrusions.

Misuse detection systems, for example [KS95] and

STAT [IKP95], encode and match the sequence of “sig-

nature actions” (e.g., change the ownership of a ﬁle) of

known intrusion scenarios. The main shortcomings of

such systems are: known intrusion patterns have to be

hand-coded into the system; they are unable to detect

any future (unknown) intrusions that have no matched

patterns stored in the system.

Anomaly detection (sub)systems, such as

IDES [LTG

92], establish normal usage patterns

(proﬁles) using statistical measures on system features,

for example, the CPU and I/O activities by a particular

user or program. The main difﬁculties of these systems

are: intuition and experience is relied upon in selecting

the system features, which can vary greatly among

different computing environments; some intrusions can

only be detected by studying the sequential interrelation

between events because each event alone may ﬁt the

proﬁles.

Our research aims to eliminate, as much as possible, the

manual and ad-hoc elements from the process of build-

ing an intrusion detection system. We take a data-centric

point of view and consider intrusion detection as a data

analysis process. Anomaly detection is about ﬁnding the

normal usage patterns from the audit data, whereas mis-

use detection is about encoding and matching the intru-

sion patterns using the audit data. The central theme of

our approach is to apply data mining techniques to in-

trusion detection. Data mining generally refers to the

process of (automatically) extracting models from large

stores of data [FPSS96]. The recent rapid development

in data mining has made available a wide variety of algo-

rithms, drawn from the ﬁelds of statistics, pattern recog-

nition, machine learning, and database. Several types of

algorithms are particularly relevant to our research:

Classiﬁcation: maps a data item into one of severalpre-

deﬁned categories. These algorithms normally out-

put “classiﬁers”, for example, in the form of deci-

sion trees or rules. An ideal application in intrusion

detection will be to gather sufﬁcient “normal” and

“abnormal” audit data for a user or a program, then

apply a classiﬁcation algorithm to learn a classiﬁer

that will determine (future) audit data as belonging

to the normal class or the abnormal class;

Link analysis: determines relations between ﬁelds in

the database. Finding out the correlations in audit

data will provide insight for selecting the right set

of system features for intrusion detection;

Sequence analysis: models sequential patterns. These

algorithms can help us understand what (time-

based) sequence of audit events are frequently en-

countered together. These frequent event patterns

are important elements of the behavior proﬁle of a

user or program.

We are developing a systematic framework for design-

ing, developing and evaluating intrusion detection sys-

tems. Speciﬁcally, the framework consists of a set of

environment-independent guidelines and programs that

can assist a system administrator or security ofﬁcer to



select appropriate system features from audit data

to build models for intrusion detection;



architect a hierarchical detector system from com-

ponent detectors;



update and deploy new detection systems as

needed.

The key advantage of our approach is that it can auto-

matically generate concise and accurate detection mod-

els from large amount of audit data. The methodology

itself is general and mechanical, and therefore can be

used to build intrusion detection systems for a wide va-

riety of computing environments.

The rest of the paper is organized as follows: Sec-

tion 2 describes our experiments in building classiﬁca-

tion models for sendmail and network trafﬁc. Section 3

presents the association rules and frequent episodes al-

gorithms that can be used to compute a set of patterns

from audit data. Section 4 brieﬂy highlights the archi-

tecture of our proposed intrusion detection system. Sec-

tion 5 outlines our future research plans.

2 Building Classiﬁcation Models

In this section we describe in detail our experiments

in constructing classiﬁcation models for anomaly de-

tection. The ﬁrst set of experiments, ﬁrst reported in

[LSC97], is on the sendmail system call data, and the

second is on the network tcpdump data.

2.1 Experiments on sendmail Data

There have been a lot of attacks on computer systems

that are carried out as exploitations of the design and

programming errors in privileged programs, those that

can run as root. For example, a ﬂaw in the ﬁnger dae-

mon allows the attacker to use “buffer overﬂow” to trick

the program to execute his malicious code. Recent re-

search efforts by Ko et al. [KFL94] and Forrest et al.

[FHSL96] attempted to build intrusion detection systems

that monitor the execution of privileged programs and

detect the attacks on their vulnerabilities. Forrest et al.

discovered that the short sequences of system calls made

by a program during its normal executions are very con-

sistent, yet different from the sequences of its abnormal

(exploited) executions as well as the executions of other

programs. Therefore a database containing these nor-

mal sequences can be used as the “self” deﬁnition of the

normal behavior of a program, and as the basis to de-

tect anomalies. Their ﬁndings motivated us to search for

simple and accurate intrusion detection models.

Stephanie Forrest provided us with a set of traces of the

sendmail program used in her experiments [FHSL96].

We applied machine learning techniques to produce

classiﬁers that can distinguish the exploits from the nor-

mal runs.

2.1.1 The sendmail System Call Traces

The procedure of generating the sendmail traces were

detailed in [FHSL96]. Brieﬂy, each ﬁle of the trace

data has two columns of integers, the ﬁrst is the process

ids and the second is the system call “numbers”. These

numbers are indices into a lookup table of system call

names. For example, the number “5” represents system

call open. The set of traces include:

Normal traces: a trace of the sendmail daemon and a

concatenation of several invocations of the send-

mail program;

Abnormal traces: 3 traces of the sscp (sunsendmailcp)

attacks, 2 traces of the syslog-remote attacks, 2

traces of the syslog-local attacks, 2 traces of the de-

code attacks, 1 trace of the sm5x attack and 1 trace

of the sm565a attack. These are the traces of (var-

ious kinds of) abnormal runs of the sendmail pro-

gram.

2.1.2 Learning to Classify System Call Sequences

In order for a machine learning programto learn the clas-

siﬁcation models of the “normal” and “abnormal” sys-

tem call sequences, we need to supply it with a set of

System Call Sequences (length 7) Class Labels

4 2 66 66 4 138 66 “normal”

... ...

5 5 5 4 59 105 104 “abnormal”

... ...

Table 1: Pre-labeled System Call Sequences of Length 7

training data containing pre-labeled “normal” and “ab-

normal” sequences. We use a sliding window to scan

the normal traces and create a list of unique sequences

of system calls. We call this list the “normal” list. Next,

we scan each of the intrusion traces. For each sequence

of system calls, we ﬁrst look it up in the normal list. If

an exact match can be found then the sequence is labeled

as “normal”. Otherwise it is labeled as “abnormal” (note

that the data gathering process described in [FHSL96]

ensured that the normal traces include nearly all possible

“normal” short sequences of system calls, as new runs of

sendmail

failed to generate new sequences). Needless

to say all sequences in the normal traces are labeled as

“normal”. See Table 1 for an example of the labeled se-

quences. It should be noted that an intrusion trace con-

tains many normal sequences in addition to the abnormal

sequences since the illegal activities only occur in some

places within a trace.

We applied RIPPER [Coh95], a rule learning program,

to our training data. The following learning tasks were

formulated to induce the rule sets for normal and abnor-

mal system call sequences:



Each record has

positional attributes,

, ...,

, one for each of the system calls in a sequence of

length

; plus a class label, “normal” or “abnormal”



The training data is composed of normal sequences

taken from

80%

of the normal traces, plus the ab-

normal sequences from 2 traces of the

sscp

attacks,

1 trace of the syslog-local attack, and 1 trace of the

syslog-remote attack



The testing data includes both normal and abnormal

traces not used in the training data.

RIPPER outputs a set of if-then rules for the “minority”

classes, and a default “true” rule for the remaining class.

The following exemplar RIPPER rules were generated

from the system call data:

normal:-

= 104

= 112

. [meaning: if

is 104 (

v times

) and

is 112 (

v tr ace

) then

the sequence is “normal”]

normal:-

= 19

= 105

. [meaning: if

is 19 (

lseek

) and

is 105 (

sig v ec

) then the

sequence is “normal”]

...

abnormal:- true. [meaning: if none of the

above, the sequence is “abnormal”]

These RIPPER rules can be used to predict whether a se-

quence is “abnormal” or “normal”. But what the intru-

sion detection system needs to know is whether the trace

being analyzed is an intrusion or not. We use the fol-

lowing post-processingscheme to detect whether a given

trace is an intrusion based on the RIPPER predictions of

its constituent sequences:

1. Use a sliding window of length

+ 1

, e.g., 7, 9, 11,

13, etc., and a sliding (shift) step of

, to scan the

predictions made by the RIPPER rules on system

call sequences.

2. For each of the (length

+ 1

) regions of RIPPER

predictions generated in Step

, if more than

pre-

dictions are “abnormal” then the current region of

predictions is an “abnormal” region. (Note that

an input parameter).

3. If the percentage of abnormal regions is above a

threshold value, say

, then the trace is an intru-

sion.

This scheme is an attempt to ﬁlter out the spurious pre-

diction errors. The intuition behind this scheme is that

when an intrusion actually occurs, the majority of adja-

cent system call sequences are abnormal; whereas the

prediction errors tend to be isolated and sparse. In

[FHSL96], the percentage of the mismatched sequences

(out of the total number of matches (lookups) performed

for the trace) is used to distinguish normal from abnor-

mal. The “mismatched” sequences are the abnormal se-

quences in our context. Our scheme is different in that

we look for abnormal regions that contain more abnor-

mal sequences than the normal ones, and calculate the

percentage of abnormal regions (out of the total number

of regions). Our scheme is more sensitive to the tempo-

ral information, and is less sensitive to noise (errors).

RIPPER only outputs rules for the “minority” class. For

example, in our experiments, if the training data has

fewer abnormal sequences than the normal ones, the

output RIPPER rules can be used to identify abnormal

sequences, and the default (everything else) prediction

is normal. We conjectured that a set of speciﬁc rules

for normal sequences can be used as the “identity” of

a program, and thus can be used to detect any known

and unknown intrusions (anomaly intrusion detection).

Whereas having only the rules for abnormal sequences

only gives us the capability to identify known intrusions

(misuse intrusion detection).

% abn. % abn. in experiment

Traces [FHSL96] A B C D

sscp-1 5.2 41.9 32.2 40.0 33.1

sscp-2 5.2 40.4 30.4 37.6 33.3

sscp-3 5.2 40.4 30.4 37.6 33.3

syslog-r-1 5.1 30.8 21.2 30.3 21.9

syslog-r-2 1.7 27.1 15.6 26.8 16.5

syslog-l-1 4.0 16.7 11.1 17.0 13.0

syslog-l-2 5.3 19.9 15.9 19.8 15.9

decode-1 0.3 4.7 2.1 3.1 2.1

decode-2 0.3 4.4 2.0 2.5 2.2

sm565a 0.6 11.7 8.0 1.1 1.0

sm5x 2.7 17.7 6.5 5.0 3.0

sendmail

0 1.0 0.1 0.2 0.3

Table 2: Comparing Detection of Anomalies. The col-

umn [FHSL96] is the percentage of the abnormal se-

quences of the traces. Columns A, B, C, and D are

the percentages of abnormal regions (as measured by

the post-processing scheme) of the traces.

sendmail

is the

20%

normal traces not used in the training data.

Traces in bold were included in the training data, the

other traces were used as testing data only.

We compare the results of the following experimentsthat

have different distributions of abnormal versus normal

sequences in the training data:

Experiment A:

46%

normal and

54%

abnormal, se-

quence length is 11;

Experiment B:

46%

normal and

54%

abnormal, se-

quence length is 7;

Experiment C:

46%

abnormal and

54%

normal, se-

quence length is 11;

Experiment D:

46%

abnormal and

54%

normal, se-

quence length is 7.

Table 2 shows the results of using the classiﬁers from

these experiments to analyze the traces. We report here

the percentage of abnormal regions (as measured by our

post-processing scheme) of each trace, and compare our

results with Forrest et al., as reported in [FHSL96].

From Table 2, we can see that in general, intrusion traces

generate much larger percentages of abnormal regions

than the normal traces. We call these measured percent-

ages the “scores” of the traces. In order to establish a

threshold score for identifying intrusion traces, it is de-

sirable that there is a sufﬁciently large gap between the

scores of the normal sendmail traces and the low-end

scores of the intrusion traces. Comparing experiments

that used the same sequence length, we observe that such

a gap in A,

, is larger than the gap in C,

; and

in B is larger than

in D. The RIPPER rules from

experiments A and B describe the patterns of the nor-

mal sequences. Here the results show that these rules

can be used to identify the intrusion traces, including

those not seen in the training data, namely, the decode

traces, the sm565a and sm5x traces. This conﬁrms our

conjecture that rules for normal patterns can be used for

anomaly detection. The RIPPER rules from experiments

C and D specify the patterns of abnormal sequences in

the intrusion traces included in the training data. The

results indicate that these rules are very capable of de-

tecting the intrusion traces of the “known” types (those

seen in the training data), namely, the sscp-3 trace, the

syslog-remote-2 trace and the syslog-local-2 trace. But

comparing with the rules from A and B, the rules in C

and D perform poorly on intrusion traces of “unknown”

types. This conﬁrms our conjecture that rules for abnor-

mal patterns are good for misuse intrusion detection, but

may not be as effective in detecting future (“unknown”)

intrusions.

The results from Forrest et al. showed that their method

required a very low threshold in order to correctly detect

the

decode

and

565

intrusions. While the results

here show that our approach generated much stronger

“signals” of anomalies from the intrusion traces, it

should be noted that their method used all of the normal

traces but not any of the intrusion traces in training.

2.1.3 Learning to Predict System Calls

Unlike the experiments in Section 2.1.2 which required

abnormal traces in the training data, here we wanted to

study how to compute an anomaly detector given just the

normal traces. We conducted experiments to learn the

(normal) correlation among system calls: the

th system

calls or the middle system calls in (normal) sequences of

length

The learning tasks were formulated as follows:



Each record has



positional attributes,

...,



, each being a system call; plus a class la-

bel, the system call of the

th position or the middle

position



The training data is composed of (normal) se-

quences taken from

80%

of the normal sendmail

traces



The testing data is the traces not included in the

training data, namely, the remaining

20%

of the

normal sendmail traces and all the intrusion traces.

RIPPER outputs rules in the following form:

38 :-

= 40

= 4

. [meaning: if

is 40

(

lstat

) and

is 4 (

wr ite

), then the 7th system

call is 38 (

stat

).]

...

5:- true. [meaning: if none of the above, then

the 7th system calls is 5 (

open

).]

Each of these RIPPER rules has some “conﬁdence” in-

formation: the number of matched examples (records

that conform to the rule) and the number of unmatched

examples (recordsthat are in conﬂict with the rule) in the

training data. For example, the rule for “38 (

stat

)” cov-

ers 12 matched examples and 0 unmatched examples.

We measure the conﬁdence value of a rule as the num-

ber of matched examples divided by the sum of matched

and unmatched examples. These rules can be used to an-

alyze a trace by examining each sequence of the trace. If

a violation occurs (the actual system call is not the same

as predicted by the rule), the “score” of the trace is in-

cremented by 100 times the conﬁdence of the violated

rule. For example, if a sequence in the trace has

= 40

and

= 4

, but

= 44

instead of 38, the total score

of the trace is incremented by 100 since the conﬁdence

value of this violated rule is 1. The averaged score (by

the total number of sequences) of the trace is then used

to decide whether an intrusion has occurred.

Table 3 shows the results of the following experiments:

Experiment A: predict the 11th system call;

Experiment B: predict the middle system call in a se-

quence of length 7;

Experiment C: predict the middle system call in a se-

quence of length 11;

Experiment D: predict the 7th system call.

We can see from Table 3 that the RIPPER rules from

experiments A and B are effective because the gap be-

tween the score of normal sendmail and the low-end

scores of intrusion traces, 3.9, and 3.3 respectively, are

large enough. However, the rules from C and D perform

poorly. Since C predicts the middle system call of a se-

quence of length 11 and D predicts the 7th system call,

we reason that the training data (the normal traces) has

no stable patterns for the 6th or 7th position in system

call sequences.

HTML Viewer

Frequently Asked Questions (15)

Q1. What contributions have the authors mentioned in the paper "Data mining approaches for intrusion detection" ?

In this paper the authors discuss their research in developing general and systematic methods for intrusion detection. Using experiments on the sendmail system call data and the network tcpdump data, the authors demonstrate that they can construct concise and accurate classifiers to detect anomalies. The authors provide an overview on two general data mining algorithms that they have implemented: the association rules algorithm and the frequent episodes algorithm. To meet the challenges of both efficient learning ( mining ) and real-time detection, the authors propose an agent-based architecture for intrusion detection systems where the learning agents continuously compute and provide the updated ( detection ) models to the detection agents.

Q2. What are the main difficulties of intrusion detection systems?

The main difficulties of these systems are: intuition and experience is relied upon in selecting the system features, which can vary greatly among different computing environments; some intrusions can only be detected by studying the sequential interrelation between events because each event alone may fit the profiles.

Q3. What is the reason why the normal traces are not stable?

Since C predicts the middle system call of a sequence of length 11 and D predicts the 7th system call, the authors reason that the training data (the normal traces) has no stable patterns for the 6th or 7th position in system call sequences.

Q4. What can be used to compute the consistent patterns from audit data?

The authors suggested that the association rules and frequent episodes algorithms can be used to compute the consistent patterns from audit data.

Q5. What is the key advantage of the approach?

The key advantage of their approach is that it can automatically generate concise and accurate detection models from large amount of audit data.

Q6. What is the priority of the research plan?

A priority in their research plan is to study and experiment with (inductively learned) classification models that combine evidence from multiple (base) detection models.

Q7. What are the main shortcomings of network-based computer systems?

The main shortcomings of such systems are: known intrusion patterns have to be hand-coded into the system; they are unable to detect any future (unknown) intrusions that have no matched patterns stored in the system.

Q8. What can be used to find inter- audit record patterns?

While the association rules algorithm seeks to find intraaudit record patterns, the frequent episodes algorithm, as described in [MTV95], can be used to discover inter- audit record patterns.

Q9. What is the way to collect information about a host system?

Many operating systems provide auditing utilities, such as the BSM audit of Solaris, that can be configured to collectabundant information (with many features) of the activities in a host system.

Q10. What can be the way to improve the accuracy of the RIPPER model?

Improvement in accuracy can come from adding more features, rather than just the system calls, into the models of program execution.

Q11. What makes it impossible for an operational system to be completely secure?

The policies that balance convenience versus strict control of a system and information access also make it impossible for an operational system to be completely secure.

Q12. What is the key advantage of the methodology?

The methodology itself is general and mechanical, and therefore can be used to build intrusion detection systems for a wide variety of computing environments.

Q13. What is the effect of the new features on the intrusion data?

as Figure 1 shows, for the in-comingtraffic, the misclassification rates on the intrusion data increase dramatically as the time interval goes from 5s to 30s, then stabilizes or tapers off afterwards.

Q14. What is the effect of the frequent episodes algorithm?

In Figure 2 the authors see that the number of frequent episodes (raw episodes or serial episode rules) increases sharply as win goes from 2s to 30s, it then gradually stabilizes (note that by the nature of the frequent episodes algorithm, the number of episodes can only increase aswin increases).

Q15. How do the authors update the rule sets?

For each new run, the authors compute its rule set (that consists of both the association rules and the frequent episodes) from the audit trail, and update the (existing) aggregate rule sets using the following merge process:

Data mining approaches for intrusion detection

Summary (4 min read)

1 Introduction

2 Building Classification Models

2.1.2 Learning to Classify System Call Sequences

2.1.3 Learning to Predict System Calls

2.1.4 Discussion

2.2.2 Data Pre-processing

2.2.3 Experiments and Results

2.2.4 Discussion

2.3 Combining Multiple Classifiers

3 Mining Patterns from Audit Data

3.1 Association Rules

3.2 Frequent Episodes

3.3 Using the Discovered Patterns

4 Architecture Support

Figures (2)

Citations

Cites background or methods from "Data mining approaches for intrusio..."

Cites methods from "Data mining approaches for intrusio..."

References

"Data mining approaches for intrusio..." refers background or methods or result in this paper

"Data mining approaches for intrusio..." refers background in this paper

Related Papers (5)

Frequently Asked Questions (15)

Q1. What contributions have the authors mentioned in the paper "Data mining approaches for intrusion detection" ?

Q2. What are the main difficulties of intrusion detection systems?

Q3. What is the reason why the normal traces are not stable?

Q4. What can be used to compute the consistent patterns from audit data?

Q5. What is the key advantage of the approach?

Q6. What is the priority of the research plan?

Q7. What are the main shortcomings of network-based computer systems?

Q8. What can be used to find inter- audit record patterns?

Q9. What is the way to collect information about a host system?

Q10. What can be the way to improve the accuracy of the RIPPER model?

Q11. What makes it impossible for an operational system to be completely secure?

Q12. What is the key advantage of the methodology?

Q13. What is the effect of the new features on the intrusion data?

Q14. What is the effect of the frequent episodes algorithm?

Q15. How do the authors update the rule sets?