Data mining approaches for intrusion detection
Summary (4 min read)
1 Introduction
- As network-based computer systems play increasingly vital roles in modern society, they have become the targets of their enemies and criminals.
- Intrusion detection techniques can be categorized into misuse detection, which uses patterns of wellknown attacks or weak spots of the system to identify intrusions; and anomaly detection, which tries to determine whether deviation from the established normal us- age patterns can be flagged as intrusions.
- Misuse detection systems, for example [KS95] and STAT [IKP95], encode and match the sequence of “signature actions” (e.g., change the ownership of a file) of known intrusion scenarios.
- Section 4 briefly highlights the architecture of their proposed intrusion detection system.
2 Building Classification Models
- In this section the authors describe in detail their experiments in constructing classification models for anomaly detection.
- A flaw in the finger daemon allows the attacker to use “buffer overflow” to trick the program to execute his malicious code.
- Forrest et al. discovered that the short sequences of system calls made by a program during its normal executions are very consistent, yet different from the sequences of its abnormal executions as well as the executions of other programs.
- Stephanie Forrest provided us with a set of traces of the sendmail program used in her experiments [FHSL96].
- The number “5” represents system call open.
2.1.2 Learning to Classify System Call Sequences
- In order for a machine learning program to learn the classification models of the “normal” and “abnormal” system call sequences, the authors need to supply it with a set of training data containing pre-labeled “normal” and “abnormal” sequences.
- See Table 1 for an example of the labeled sequences.
- RIPPER outputs a set of if-then rules for the “minority” classes, and a default “true” rule for the remaining class.
- The authors conjectured that a set of specific rules for normal sequences can be used as the “identity” of a program, and thus can be used to detect any known and unknown intrusions (anomaly intrusion detection).
2.1.3 Learning to Predict System Calls
- Unlike the experiments in Section 2.1.2 which required abnormal traces in the training data, here the authors wanted to study how to compute an anomaly detector given just the normal traces.
- The learning tasks were formulated as follows:.
- If a violation occurs (the actual system call is not the same as predicted by the rule), the “score” of the trace is incremented by 100 times the confidence of the violated rule.
- Table 3 shows the results of the following experiments: Experiment A: predict the 11th system call; Experiment B: predict the middle system call in a sequence of length 7; Experiment C: predict the middle system call in a sequence of length 11; Experiment D: predict the 7th system call.
2.1.4 Discussion
- The authors experiments showed that the normal behavior of a program execution can be established and used to detect its anomalous usage.
- Here the authors show that a machine learning program, RIPPER, was able to generalize the system call sequence information, from 80% of the normal sequences, to a set of concise and accurate rules (the rule sets have 200 to 280 rules, and each rule has 2 or 3 attribute tests).
- The authors need to search for a more predictive classification model so that the anomaly detector has higher confidence in flagging intrusions.
- The directories and the names of the files touched by a program can be used.
- For the purposes of the shootout, filters were used so that tcpdump only collected Internet Transmission Control Protocol (TCP) and Internet User Datagram Protocol (UDP) packets.
2.2.2 Data Pre-processing
- The authors developed a script to scan each tcpdump data file and extract the “connection” level information about the network traffic.
- Since UDP is connectionless (no connection state), the authors simply treat each packet as a connection.
- A connection record, in preparation of data mining, now has the following fields : start time, duration, participating hosts, ports, the statistics of the connection (e.g., bytes sent in each direction, resent rate, etc.), flag (“normal” or one of the recorded connection/termination errors), and protocol (TCP or UDP).
- The authors call the host that initiates the connection, i.e., the one that sends the first SYN, as the source, and the other as the destination.
- Depending on the direction from the source to the destination, a connection is in one of the three types: out-going - from the LAN to the external networks; in-coming - from the external networks to the LAN; and inter-LAN - within the LAN.
2.2.3 Experiments and Results
- For each type of the s, the authors formulated the classification experiments as the following: Each record uses the destination ser- vice (port) as the class label, and all the other features as attributes;.
- The process (training and testing) is repeated 5 times, each time using a different 80% of the normal data as the training data (and accordingly the different remaining 20% of the normal data as part of the test data), and the averaged accuracy of the classifiers from the 5 runs is reported.
- The authors again applied RIPPER to the connection data.
- The results from the first round of experiments, as shown in Table 4, were not very good: the differences in the misclassification rates of the normal and intrusion data were small, except for the inter-LAN traffic of some intrusions.
- These additional temporal-statistical features provide additional information of the network activity from a continuous perspective, and provide more insight into anomalies.
2.2.4 Discussion
- The authors learned some important lessons from the experiments on the tcpdump data.
- First, when the collected data is not designed specifically for security purposes or can not be used directly to build a detection model, a considerable amount of data pre-processing is required.
- Many trials were attempted before the authors came up with the current set of features and time intervals.
- The authors need useful tools that can provide insight into the patterns that may be exhibited in the data.
- Second, the authors should provide tools that can help administrative staff understand the nature of the anomalies.
2.3 Combining Multiple Classifiers
- The classifiers described in this section each models a single aspect of the system behavior.
- They are what the authors call the base (single level) classifiers.
- A priority in their research plan is to study and experiment with (inductively learned) classification models that combine evidence from multiple (base) detection models.
- The authors research activities in JAM [SPT+97], which focus on the accuracy and efficiency of meta classifiers, will contribute significantly to their effort in building meta detection models.
3 Mining Patterns from Audit Data
- In order to construct an accurate base classifier, the authors need to gather a sufficient amount of training data and identify a set of meaningful features.
- Both of these tasks require insight into the nature of the audit data, and can be very difficult without proper tools and guidelines.
- In this section the authors describe some algorithms that can address these needs.
- Here the authors use the term “audit data” to refer to general data streams that have been properly processed for detection purposes.
- An example of such data streams is the connection record data extracted from the raw tcpdump output.
3.1 Association Rules
- The goal of mining association rules is to derive multifeature correlations from a database table.
- The motivation for applying the association rules algorithm to audit data are: Audit data can be formatted into a database table where each row is an audit record and each column is a field (system feature) of the audit records;.
- One of the reasons that “program policies”, which codify the access rights of privileged programs, are concise and capable to detect known attacks [KFL94] is that the intended behavior of a program, e.g., read and write files from certain directories with specific permissions, is very consistent.
- The authors can continuously merge the rules from a new run to the aggregate rule set (of all previous runs).
3.2 Frequent Episodes
- While the association rules algorithm seeks to find intraaudit record patterns, the frequent episodes algorithm, as described in [MTV95], can be used to discover inter- audit record patterns.
- A frequent episode is a set of events that occur frequently within a time window (of a specified length).
- The authors seek to apply the frequent episodes algorithm to analyze audit trails since there is evidence that the sequence information in program executions and user commands can be used to build profiles for anomaly detection [FHSL96, LB97].
- The authors implementation followed the description in [MTV95].
3.3 Using the Discovered Patterns
- The association rules and frequent episodes can be used to guide the audit process.
- The authors run a program many times and under different settings.
- Such a support system can speed up the iterative feature selection process, and help ensure the accuracy of a detection model.
- In Figure 2 the authors see that the number of frequent episodes (raw episodes or serial episode rules) increases sharply as win goes from 2s to 30s, it then gradually stabilizes (note that by the nature of the frequent episodes algorithm, the number of episodes can only increase aswin increases).
4 Architecture Support
- The biggest challenge of using data mining approaches in intrusion detection is that it requires a large amount of audit data in order to compute the profile rule sets.
- As a new version of a system software arrives, the authors need to update the “normal” profile rules.
- A learning agent, which may reside in a server machine for its computing power, is responsible for computing and maintaining the rule sets for programs and users.
- A detection agent is generic and extensible.
- In a network environment, a meta agent can combine reports from (base) detection agents running on each host, and make the final assertion on the state of the network.
Did you find this useful? Give us your feedback
Citations
9,627 citations
3,047 citations
1,712 citations
1,502 citations
Cites background or methods from "Data mining approaches for intrusio..."
...T Modify distribution Dt by creating N synthetic examples from minority class Cm using SMOTE algorithm Train a weak learner using distribution Dt Compute weak hypothesis ht: X × Y → [0, 1] Compute the pseudo-loss of hypothesis ht:...
[...]
...The challenge of detecting future attacks has led to an increasing interest in intrusion detection techniques based upon data mining [1, 2, 3, 4]....
[...]
1,433 citations
Cites methods from "Data mining approaches for intrusio..."
...[46,50] proposed an association rulebased data mining approach for anomaly detection where raw data was converted into ASCII network packet information, which in turn was converted into connection-level information....
[...]
...[45,46,50] used RIPPER to characterize sequences occurring in normal data by a smaller set of rules that capture the common elements in those sequences....
[...]
...[46] It uses inductive rule generation to generate rules for important,...
[...]
...Association rules have been successfully used to mine audit data to find normal patterns for anomaly detection [46,50,81]....
[...]
References
4,081 citations
2,468 citations
2,003 citations
"Data mining approaches for intrusio..." refers background or methods or result in this paper
...The weakness of the model in [ FHSL96 ] may be that the recorded (rote learned) normal sequence database may be too specific as it contains entries....
[...]
...Traces [ FHSL96 ] A B C D sscp-1 5.2 41.9 32.2 40.0 33.1 sscp-2 5.2 40.4 30.4 37.6 33.3 sscp-3 5.2 40.4 30.4 37.6 33.3 syslog-r-1 5.1 30.8 21.2 30.3 21.9 syslog-r-2 1.7 27.1 15.6 26.8 16.5 syslog-l-1 4.0 16.7 11.1 17.0 13.0 syslog-l-2 5.3 19.9 15.9 19.8 15.9 decode-1 0.3 4.7 2.1 3.1 2.1 decode-2 0.3 4.4 2.0 2.5 2.2 sm565a 0.6 11.7 8.0 1.1 1.0 sm5x 2.7 17.7 6.5 5.0 3.0 0 1.0 0.1 0.2 0.3...
[...]
...Otherwise it is labeled as “abnormal” (note that the data gathering process described in [ FHSL96 ] ensured that the normal traces include nearly all possible “normal” short sequences of system calls, as new runs of failed to generate new sequences)....
[...]
...We report here the percentage of abnormal regions (as measured by our post-processing scheme) of each trace, and compare our results with Forrest et al., as reported in [ FHSL96 ]....
[...]
...The procedure of generating the sendmail traces were detailed in [ FHSL96 ]....
[...]
1,790 citations
"Data mining approaches for intrusio..." refers background in this paper
...Formally, given a set of records, where each record is a set of items, an association rule is an expression [ SA95 ]....
[...]
Related Papers (5)
Frequently Asked Questions (15)
Q2. What are the main difficulties of intrusion detection systems?
The main difficulties of these systems are: intuition and experience is relied upon in selecting the system features, which can vary greatly among different computing environments; some intrusions can only be detected by studying the sequential interrelation between events because each event alone may fit the profiles.
Q3. What is the reason why the normal traces are not stable?
Since C predicts the middle system call of a sequence of length 11 and D predicts the 7th system call, the authors reason that the training data (the normal traces) has no stable patterns for the 6th or 7th position in system call sequences.
Q4. What can be used to compute the consistent patterns from audit data?
The authors suggested that the association rules and frequent episodes algorithms can be used to compute the consistent patterns from audit data.
Q5. What is the key advantage of the approach?
The key advantage of their approach is that it can automatically generate concise and accurate detection models from large amount of audit data.
Q6. What is the priority of the research plan?
A priority in their research plan is to study and experiment with (inductively learned) classification models that combine evidence from multiple (base) detection models.
Q7. What are the main shortcomings of network-based computer systems?
The main shortcomings of such systems are: known intrusion patterns have to be hand-coded into the system; they are unable to detect any future (unknown) intrusions that have no matched patterns stored in the system.
Q8. What can be used to find inter- audit record patterns?
While the association rules algorithm seeks to find intraaudit record patterns, the frequent episodes algorithm, as described in [MTV95], can be used to discover inter- audit record patterns.
Q9. What is the way to collect information about a host system?
Many operating systems provide auditing utilities, such as the BSM audit of Solaris, that can be configured to collectabundant information (with many features) of the activities in a host system.
Q10. What can be the way to improve the accuracy of the RIPPER model?
Improvement in accuracy can come from adding more features, rather than just the system calls, into the models of program execution.
Q11. What makes it impossible for an operational system to be completely secure?
The policies that balance convenience versus strict control of a system and information access also make it impossible for an operational system to be completely secure.
Q12. What is the key advantage of the methodology?
The methodology itself is general and mechanical, and therefore can be used to build intrusion detection systems for a wide variety of computing environments.
Q13. What is the effect of the new features on the intrusion data?
as Figure 1 shows, for the in-comingtraffic, the misclassification rates on the intrusion data increase dramatically as the time interval goes from 5s to 30s, then stabilizes or tapers off afterwards.
Q14. What is the effect of the frequent episodes algorithm?
In Figure 2 the authors see that the number of frequent episodes (raw episodes or serial episode rules) increases sharply as win goes from 2s to 30s, it then gradually stabilizes (note that by the nature of the frequent episodes algorithm, the number of episodes can only increase aswin increases).
Q15. How do the authors update the rule sets?
For each new run, the authors compute its rule set (that consists of both the association rules and the frequent episodes) from the audit trail, and update the (existing) aggregate rule sets using the following merge process: