scispace - formally typeset

Journal ArticleDOI

fuzzy decision tree based inference techniques for network forensic analysis

01 Jan 2007-Journal of Software-Vol. 18, Iss: 10, pp 2635

TL;DR: The researcher develops a fuzzy decision tree based network forensics system to aid an investigator in analyzing computer crime in network environments and automatically extract digital evidence.
Abstract: Network forensics is an important extension to present security infrastructure, and is becoming the research focus of forensic investigators and network security researchers. However many challenges still exist in conducting network forensics: The sheer amount of data generated by the network; the comprehensibility of evidences extracted from collected data; the efficiency of evidence analysis methods, etc. Against above challenges, by taking the advantage of both the great learning capability and the comprehensibility of the analyzed results of decision tree technology and fuzzy logic, the researcher develops a fuzzy decision tree based network forensics system to aid an investigator in analyzing computer crime in network environments and automatically extract digital evidence. At the end of the paper, the experimental comparison results between our proposed method and other popular methods are presented. Experimental results show that the system can classify most kinds of events (91.16% correct classification rate on average), provide analyzed and comprehensible information for a forensic
Topics: Network forensics (70%), Incremental decision tree (60%), Network security (59%), Decision tree (56%), Fuzzy logic (52%)

Summary (2 min read)

1 Introduction

  • With the fast development and growth in networking connectivity, complexity and activity, there has been an increase in the number of crimes committed within networks.
  • The biggest challenge in conducting network forensics is the sheer amount of data generated by the network.
  • The remainder of the paper is organized as follows: Section 2 discusses the related work such as network forensics and fuzzy decision tree system.
  • Section 3 describes the proposed Fuzzy Decision Section 4 explains the experimental data which is used in this paper and shows the experimental results.

2.1 Network forensics

  • The term network forensics was introduced by the computer security expert Marcus Ranum in the early 90’s[2], and is borrowed from the legal and criminology field where “forensics” pertains to the investigation of crimes.
  • Usually, network forensics which is based on audit trails is difficult and time-consuming process.
  • Particularly, these systems are complex, and the results produced by these methods lack enough comprehensibility.
  • Besides these, an evidence graph-based analysis method has been proposed[3], and although it is nice to present evidence correlation in graphic mode, this system is still a prototype and lacks the effective capability of inference.
  • Finally, a fuzzy expert system has also been proposed for network forensics[4], but it still asks for experts to build a knowledge base and it lacks the capability of self-learning.

2.2 Fuzzy decision tree

  • Decision trees were popularized by Quinlan with the ID3 program[5].
  • ID3 is based on the Concept Learning System algorithm.
  • Up to date, many algorithms have merged fuzzy representation, with its approximate reasoning capabilities, and symbolic decision trees while preserving advantages of both: uncertainty handling and gradual processing of the former with the comprehensibility, popularity, and ease of application of the latter[7,8].
  • The authors develop a network forensic system based on fuzzy decision tree technology .
  • Figure 1 shows the architecture of the proposed system.

3.1 Traffic capturer

  • The Traffic Capturer component is responsible for network traffic capture and preparation for traffic analysis.
  • The process of traffic capture is the first step of the proposed forensic system.
  • While the capturing function is simple and straightforward, it provides the base information for other components of the forensic system.
  • Currently the traffic capturer is based on the well-known packet capture program—TcpDump[9].

3.2 Feature extractor

  • Feature Extractor performs extracting features on the “network traffic” captured by Traffic Capturer component.
  • Feature extraction and selection from the available data is important to the effectiveness of the methods employed.
  • The most popular data structure for network event analysis is the connection log that consists of source address and port features, destination address and port features, etc.
  • Being readily available; much more compact in size than other log formats, such as packet logs; efficient due to not examining data stream contents; being identified as a unique connection, also known as It has many advantages.
  • The JAM Project found that combining temporal information with connection log significantly increased accuracy[10].

3.3 Fuzzy evidence analyzer

  • The Fuzzy Evidence Analyzer component is the core component of NFSFDT including three sub-components: Fuzzy Preprocessor, Fuzzy Rule Bases, and Fuzzy Decision Maker.
  • For each continuous attribute a(j) do the following condition calculation: Note: Step 4: Calculate the membership functions of each continuous attribute (a(j)).
  • The process of building fuzzy rule bases is also the process of building fuzzy decision trees.
  • There are at lease two aspects of the benefits to categorizing the rule base according to service type:.
  • Such decisions are very likely to cause conflicts.

4 Experiment and Result

  • The data for their experiments was prepared by the 1998 DARPA intrusion detection evaluation program from MIT Lincoln Labs[13].
  • In order to make the results even more comprehensible, the authors categorize the target into five different classes {R2L, DOS, Probe U2R, Normal} rather than the usual two classes {Normal, Abnormal}.
  • The data set is divided into 5 subsets, and the following method is repeated 5 times.
  • In order to verify the performance, the authors employ some popular data mining algorithms (Naive Bayes algorithm[14], SMO algorithm[15], Decision Table majority classifier(DT)[16], C4.5[17]) to do the comparison experiments which using the Weka tool (Weka is an open source data mining software package[18]).
  • Figure 5 shows the results comparing the TP rate, while Fig.6 showing the corresponding results of PRECISION measurements (FDT denotes the algorithm proposed in this paper).

5 Conclusion

  • The authors developed an automated network intrusion forensic system , which can produce interpretable and accurate results for forensic experts by applying a fuzzy logic based decision tree data mining system.
  • Identifying significant features for network forensic analysis using artificial intelligent techniques.
  • His research areas are information security and network security.
  • Seventeenth International World Wide Web Conference (WWW2008) April 21-25, 2008 The International World Wide Web Conferences Steering Committee (IW3C2) cordially invites you to participate in the 17th International World Wide Web Conference (WWW2008), to be held on April 21-25, 2008 in Beijing, China.

Did you find this useful? Give us your feedback

Content maybe subject to copyright    Report

ISSN 1000-9825, CODEN RUXUEW E-mail: jos@iscas.ac.cn
Journal of Software, Vol.18, No.10, October 2007, pp.26352644 http://www.jos.org.cn
DOI: 10.1360/jos182635 Tel/Fax: +86-10-62562563
© 2007 by Journal of Software. All rights reserved.
一种用于网络取证分析的模糊决策树推理方法
刘在强
+
,
林东岱
,
冯登国
(中国科学院 软件研究所 信息安全国家重点实验室,北京 100080)
Fuzzy Decision Tree Based Inference Techniques for Network Forensic Analysis
LIU Zai-Qiang
+
, LIN Dong-Dai, FENG Deng-Guo
(State Key Laboratory of Information Security, Institute of Software, The Chinese Academy of Sciences, Beijing 100080, China)
+ Corresponding author: Phn: +86-10-62528254 ext 801, E-mail: liuzq@is.iscas.ac.cn
Liu ZQ, Lin DD, Feng DG. Fuzzy decision tree based inference techniques for network forensic analysis.
Journal of Software, 2007,18(10):26352644. http://www.jos.org.cn/1000-9825/18/2635.htm
Abstract: Network forensics is an important extension to present security infrastructure, and is becoming the
research focus of forensic investigators and network security researchers. However many challenges still exist in
conducting network forensics: The sheer amount of data generated by the network; the comprehensibility of
evidences extracted from collected data; the efficiency of evidence analysis methods, etc. Against above challenges,
by taking the advantage of both the great learning capability and the comprehensibility of the analyzed results of
decision tree technology and fuzzy logic, the researcher develops a fuzzy decision tree based network forensics
system to aid an investigator in analyzing computer crime in network environments and automatically extract digital
evidence. At the end of the paper, the experimental comparison results between our proposed method and other
popular methods are presented. Experimental results show that the system can classify most kinds of events
(91.16% correct classification rate on average), provide analyzed and comprehensible information for a forensic
expert and automate or semi-automate the process of forensic analysis.
Key words: network forensics; fuzzy decision tree; data-mining; feature extraction; intrusion detection
: 网络取证是对现有网络安全体系的必要扩展,已日益成为研究的重点.但目前在进行网络取证时仍存在
很多挑战:如网络产生的海量数据;从已收集数据中提取的证据的可理解;证据分析方法的有效性等.针对上述问
,利用模糊决策树技术强大的学习能力及其分析结果的易理解性,开发了一种基于模糊决策树的网络取证分析系
,以协助网络取证人员在网络环境下对计算机犯罪事件进行取证分析.给出了该方法的实验结果以及与现有方法
的对照分析结果.实验结果表明,该系统可以对大多数网络事件进行识别(平均正确分类率为 91.16%),能为网络取证
人员提供可理解的信息,协助取证人员进行快速高效的证据分析.
关键词: 网络取证;模糊决策树;数据挖掘;特征提取;入侵检测
中图法分类号: TP393 文献标识码: A
Supported by the National High-Tech Research and Development Plan of China under Grant Nos.2006AA01Z412, 2006AA01Z437,
2006AA01Z433 (国家高技术研究发展计划(863))
Received 2006-01-16; Accepted 2006-03-09

2636
Journal of Software 软件学报 Vol.18, No.10, October 2007
1 Introduction
With the fast development and growth in networking connectivity, complexity and activity, there has been an
increase in the number of crimes committed within networks. This is forcing both enterprises and law enforcement
to undertake highly specialized investigations. Network forensics is the act of capturing, recording and analyzing
network audit trails in order to discover the source of security breaches or other information assurance problems
[1]
.
The biggest challenge in conducting network forensics is the sheer amount of data generated by the network.
Besides this, the comprehensibility of the process of analyzing evidences that are extracted from collected data is
also an important aspect for forensic experts. Therefore, the investigators need the aid of an effective,
comprehensible and automated analyzing system for network intrusion forensics. In this paper, we propose a fuzzy
decision tree based system for network intrusion forensics that can detect and analyze efficiently computer crime in
networked environments, and locate digital evidences automatically.
The remainder of the paper is organized as follows: Section 2 discusses the related work such as network
forensics and fuzzy decision tree system. Section 3 describes the proposed Fuzzy Decision Tree based system for
network forensics. Section 4 explains the experimental data which is used in this paper and shows the experimental
results. Finally, a discussion of conclusion and further issues in network forensics are given in Section 5.
2 Related Work
2.1 Network forensics
The term network forensics was introduced by the computer security expert Marcus Ranum in the early 90’s
[2]
,
and is borrowed from the legal and criminology field where “forensics” pertains to the investigation of crimes.
Network forensic systems are designed to identify unauthorized use, misuse, and attacks on information. Usually,
network forensics which is based on audit trails is difficult and time-consuming process. Recently artificial
intelligence technologies, such as artificial neural network (ANN) and support vector machine (SVM)
[1]
, were
developed to extract significant features for network forensics to automate and simplify the process. These
techniques are effective in reducing the computing-time and increasing the intrusion detection accuracy to a certain
extent, but they are limited in forensic analysis. Particularly, these systems are complex, and the results produced by
these methods lack enough comprehensibility. Besides these, an evidence graph-based analysis method has been
proposed
[3]
, and although it is nice to present evidence correlation in graphic mode, this system is still a prototype
and lacks the effective capability of inference. Finally, a fuzzy expert system has also been proposed for network
forensics
[4]
, but it still asks for experts to build a knowledge base and it lacks the capability of self-learning. The
fuzzy decision tree-based forensic system proposed in this paper can effectively solve the above problems while
keeping better analytical result.
2.2 Fuzzy decision tree
Decision trees were popularized by Quinlan with the ID3 program
[5]
. ID3 is based on the Concept Learning
System algorithm. ID3 works by searching through the attributes of the training instances {E|e
1
,e
2
,…,e
i
,…,e
N
}
(where N=number of possible training samples) and extracting the attribute from attribute set {A|a
1
,a
2
,…,a
j
,…,a
M
}
(where M=number of possible values of an attribute) that best separates the given examples. The algorithm uses a
greedy search to choose the best attribute and never looks back to reconsider earlier choices. We need to note that
ID3 algorithm usually work well in symbolic domains, but does not work in a numerical decision. An extension of
ID3 is the C4.5 and C5.0 algorithms, which extend the domain of classification from categorical attributes to

刘在强 :一种用于网络取证分析的模糊决策树推理方法
2637
numeric ones. Although decision tree technologies have already been shown to be interpretable, efficient, problem
independent and able to treat large scale applications, they are also recognized as highly unstable classifiers with
respect to minor perturbations in the training data, in other words, methods presenting high variance. Fuzzy logic
brings in an improved in these aspects due to the elasticity of fuzzy set formalism. Fuzzy sets and fuzzy logic allow
the modeling of language-related uncertainties, while providing a symbolic framework for knowledge
comprehensibility
[6]
. Up to date, many algorithms have merged fuzzy representation, with its approximate reasoning
capabilities, and symbolic decision trees while preserving advantages of both: uncertainty handling and gradual
processing of the former with the comprehensibility, popularity, and ease of application of the latter
[7,8]
. It will
further increase the representative power and applicability of decision trees by amending them with an additional
knowledge component based on fuzzy representation.
3 Fuzzy Decision Tree-Based Network Forensic System
We develop a network forensic system based on fuzzy decision tree technology (NFSFDT). NFSFDT consists
of the following components: Traffic Capturer, Feature Extractor, Forensic Analyzer. Figure 1 shows the
architecture of the proposed system. The following sections detail the components respectively.
Evidence documentor
Digital
evidence
Court
Network traffic capturer
...
Fuzzy rule
subbase
Fuzzy evidence analyser
Fuzzy decision maker
Network traffic
Fuzzy processor
Feature extractor
Fuzzy rule
subbase
Fuzzy evidence analyser
Fuzzy decision maker
Fuzzy processor
Fuzzy rule base
...
Fig.1 NFSFDT system
3.1 Traffic capturer
The Traffic Capturer component is responsible for network traffic capture and preparation for traffic analysis.
The process of traffic capture is the first step of the proposed forensic system. While the capturing function is
simple and straightforward, it provides the base information for other components of the forensic system. Currently
the traffic capturer is based on the well-known packet capture programTcpDump
[9]
.

2638
Journal of Software 软件学报 Vol.18, No.10, October 2007
3.2 Feature extractor
Feature Extractor performs extracting features on the “network traffic” captured by Traffic Capturer
component. Feature extraction and selection from the available data is important to the effectiveness of the methods
employed. Under the network environment, there are many traffic features that can be used for intrusion detection
or event analysis, such as, source address and port number, destination address and port number, timestamp, etc. For
convenience, we use a group of features as a kind of data structure characterizing network traffic. The most popular
data structure for network event analysis is the connection log that consists of source address and port features,
destination address and port features, etc. It has many advantages: being readily available; much more compact in
size than other log formats, such as packet logs; efficient due to not examining data stream contents; being
identified as a unique connection. Even though connection records provide numerous features that are special to
each connection, we still need some features to effectively analyze network events. Essential attributes provide vital
information about connections, but we still need some of the secondary attributes, such as TCP flags, connection
duration and the volume of data passed in each direction. The JAM Project found that combining temporal
information with connection log significantly increased accuracy
[10]
. Usually temporal information is determined by
calculating the average value of a feature (attribute), or by calculating the accumulated count of connections over a
time window (such as t seconds) or n connections. The Feature Extractor extracts 41 different features in all
consisting of connection logs and other calculating features. For more detail information about feature selection,
please refer to Refs.[10,11].
3.3 Fuzzy evidence analyzer
The Fuzzy Evidence Analyzer component is the core component of NFSFDT including three sub-components:
Fuzzy Preprocessor, Fuzzy Rule Bases, and Fuzzy Decision Maker. The following sections detail the above
sub-components individually.
3.3.1 Fuzzy preprocessor
There exist two different kinds of domains for features extracted by the Feature Extractor: continuous and
discrete (such as service type: tcp, udp, icmp). Each input variable’s sharp (crisp) value needs to be first fuzzified
into linguistic values before the Fuzzy Decision-maker processes them with the Rule Base. Unlike classical sets, a
fuzzy set expresses the degree to which an element belongs to a set. The characteristic function of a fuzzy set is
assigned to values between 0 and 1, which denotes the degree of membership of an element in a given set.
The Fuzzy Preprocessor uses two different ways to fuzzify the continuous and the discrete respectively. For the
discrete features, the Fuzzy Preprocessor component uses the same technique as the classical set. For example, let
protocol_type={tcp,udp,icmp} be the set of protocol type, then the membership function of each protocol type can
be expressed as follows
=
==
otherwise ,0
,1
)()(
typeprotocalx
xdiscretx
type
µ
,
where x{tcp,udp,icmp}, type is a fuzzy set. Besides protocol_type feature, there are others discrete features (such
as service type, flag, etc.), which use the same fuzzifying method.
For continuous features, we choose the trapezoidal function as their membership function. The trapezoidal set
is very popular in fuzzy theory due to its computational and storage efficiency, and more important, it is
interpretable and comprehensible. A trapezoidal membership function is specified by four parameters {A
n
,B
n
,C
n
,D
n
}
as follows

刘在强 :一种用于网络取证分析的模糊决策树推理方法
2639
n
n
0, for
( )/( ), for
( ) ( , , , , ) 1, for
( )/( ), for
0,
n
nnn n n
nnnnn n
nnn n
xA
x
ABA AxB
x
trapezoid x A B C D B x C
D
xD C C xD
µ
−− <
==
−− <
for
n
Dx
⎪<
<
/2will
,
where
µ
n
(x) represents the membership function of the n-th fuzzy subset. Note that: if B
n
=C
n
in the above formula,
then
µ
n
(x) will become a triangle membership function (see
µ
2
(x) in Figure 2). Fig.2 presents the fuzzy subsets of
the universe of discourse num_failed_logins (a feature denoting the number of failed login attempts). Using
membership functions defined for each fuzzy set of each linguistic variable, the degree of membership of a sharp
feature value in each fuzzy set is determined.
)(x
µ
µ
(x)
1
0
Num_failed_logins
)(
2
x
µ
)(
1
x
µ
)(
3
x
µ
143
2
5
µ
1
(x)
µ
3
(x)
n
C
C
n
n
B
n
A
B
n
A
n
µ
2
(x)
n
D
D
n
Fig.2 Membership function for num_failed_logins feature
Usually the membership function for continuous features can be user defined, but due to the large volume and
the high dimensions of network data, it is very difficult to define the membership function for all the continuous
features even for an expert. So NFSFDT uses an automatic approach to create the membership functions for each
continuous feature. Assume a sample can be described by M attributes {A|a
(1)
,a
(2)
,…,a
(j)
,…,a
(M)
} and each attribute
a
(j)
takes p
j
values of a fuzzy subset . The algorithm description for finding cut points and },...,,{
)()(
2
)(
1
j
p
jj
aaa
constructing member functions for continuous attributes as follows:
Step 1: If an attribute a
(j)
is continuous, then sort the training sample in ascending order according to the value
of the attribute.
Step 2: Preprocess the values of the attribute a
(j)
in case the large value overwhelms the small one. For each
attribute a
(j)
do the following condition calculation: If , then , (0<iN).
Note: here
denotes the value of a
Γ
>)min(/)max(
)()( j
i
j
i
aa )log(
)()( j
i
j
i
aa =
)( j
i
a
(j)
before being fuzzified;
Γ
denotes a positive integer, such as 10000.
Step 3: Search the candidate cut points of a
(j)
. For each continuous attribute a
(j)
do the following condition
calculation: If
, and , then be used as a candidate cut
point. Note: There is no candidate cut point between adjoining data with equal attribute values and different
classes
j
j
i
Ca
)(
j
j
i
Ca
+
)(
1
)(
1
)( j
i
j
i
aa
+
() ()
1
()
jj
ii i
Ma a
+
=+
[12]
.
Step 4: Calculate the membership functions of each continuous attribute (a
(j)
). Calculation of the membership
function is equal to the calculated values of the parameters {A
n
,B
n
,C
n
,D
n
} (see Fig.2). The values and their ranges
{A
n
,B
n
,C
n
,D
n
} are described in Fig.3.
Note: (1+
λ
)Median(
µ
n1
)CP
n
(1
λ
)Median(
µ
n
) and
(1+
λ
)Median(
µ
n
)CP
n+1
(1
λ
)Median(
µ
n+1
)
where 0
λ
(1/2) and Median(
µ
n
) is the median of the set .
1
)()(
|
+
n
j
in
j
i
CPaCPa

Citations
More filters

Journal ArticleDOI
Niandong Liao1, Shengfeng Tian1, Tinghua Wang1Institutions (1)
TL;DR: This paper proposes an approach based on fuzzy logic and expert system for network forensics that can analyze computer crimes in network environment and make digital evidences automatically and shows that the system can classify most kinds of attack types and provide analyzable and comprehensible information for forensic experts.
Abstract: Network forensics is a research area that finds the malicious users by collecting and analyzing the intrusion or infringement evidence of computer crimes such as hacking. In the past, network forensics was only used by means of investigation. However, nowadays, due to the sharp increase of network traffic, not all the information captured or recorded will be useful for analysis or evidence. The existing methods and tools for network forensics show only simple results. The administrators have difficulty in analyzing the state of the damaged system without expert knowledge. Therefore, we need an effective and automated analyzing system for network forensics. In this paper, we firstly guarantee the evidence reliability as far as possible by collecting different forensic information of detection sensors. Secondly, we propose an approach based on fuzzy logic and expert system for network forensics that can analyze computer crimes in network environment and make digital evidences automatically. At the end of the paper, the experimental comparison results between our proposed method and other popular methods are presented. Experimental results show that the system can classify most kinds of attack types (91.5% correct classification rate on average) and provide analyzable and comprehensible information for forensic experts.

64 citations


Cites background from "fuzzy decision tree based inference..."

  • ...The term network forensics was introduced by the computer security expert Marcus Ranum in the early 90s [18]....

    [...]


Proceedings ArticleDOI
07 Apr 2014
TL;DR: The parameters of Support Vector Machine (SVM) are optimized using heuristic genetic algorithm and then to detect the network intrusion behavior and the classification accuracy is largely improved.
Abstract: The parameters of Support Vector Machine (SVM) are optimized using heuristic genetic algorithm and then to detect the network intrusion behavior. The heuristic real-coded genetic algorithm is used to optimize the best parameters of SVM with Gauss kernel aimed at the classification accuracy of the model. The classification accuracy is largely improved. Experimental results show that this method has a broad application future.

7 citations


Proceedings ArticleDOI
Ming-hai Yao1, Xi-zi Jin2, Na WangInstitutions (2)
09 Sep 2010
TL;DR: Using the clustering method analysis the web-based instruction data, the problem of web- based instruction data automatically cluster was solved and the class-contained from the sample was gained.
Abstract: In order to solve practical problems in network teaching data analysis ,the web-based instruction data of mathematical models and model framework was build .In the core process that data scoop out, the main adoption the classification method based on the continuity of data.It firstly gains the class-contained from the sample, and then obtains the standard category through the degree of support, finally marks the degree of strength among the inside members. This approach is only sensitive to the even distribution of the inside sample points, it does not require pre-set parameters . Using the clustering method analysis the web-based instruction data, the problem of web-based instruction data automatically cluster was solved.

1 citations


Cites background from "fuzzy decision tree based inference..."

  • ...It firstly gains the class-contained from the sample, and then obtains the standard category through the degree of support, finally marks the degree of strength among the inside members....

    [...]


Book ChapterDOI
Danyang Cao1, Lina Duan1, Xue Gao1, Lei Gao1Institutions (1)
18 Dec 2017
TL;DR: This paper gives some definitions about the constraint problems and proposes a solution which is based on the “and/or” tree structure to solve the limited constraint problems which can be more flexible and efficient.
Abstract: In traditional ways, we deal with the reasoning on computer in many ways, such as OWA, fuzzy Petri nets or etc. But all of these methods could not apply to solve the constraint problems. This paper gives some definitions about the constraint problems and propose a solution which is based on the “and/or” tree structure to solve the limited constraint problems. It can be more flexible and efficient. At the end of this paper, there is a specific and actual example which is an expert system that uses this solution to show how it works. It proves that this method could be applied on solving limited constraint problems.

Proceedings ArticleDOI
Dong Liu1, Yong-qing WeiInstitutions (1)
31 Aug 2012
TL;DR: This method to consider the interaction between evidence events and sequence relationship realizes formalization of the electronic evidence and reduces data redundancy inevidence analysis, which strengthens the pertinence of data process and evidence analysis, forensics system becomes more perfect.
Abstract: Computer forensics has limitations in representation formalism of the electronic evidence and data missing. A method in construction of electronic evidence chain was proposed on the basis of the study and analysis of event correlation, and it makes use of Bayesian network inference algorithm, which analysis of causal relationship of the events to deal with the missing data. This method to consider the interaction between evidence events and sequence relationship, it realizes formalization of the electronic evidence and reduces data redundancy in evidence analysis, which strengthens the pertinence of data process and evidence analysis, forensics system becomes more perfect.

References
More filters

Book
J. Ross Quinlan1Institutions (1)
15 Oct 1992
TL;DR: A complete guide to the C4.5 system as implemented in C for the UNIX environment, which starts from simple core learning methods and shows how they can be elaborated and extended to deal with typical problems such as missing data and over hitting.
Abstract: From the Publisher: Classifier systems play a major role in machine learning and knowledge-based systems, and Ross Quinlan's work on ID3 and C4.5 is widely acknowledged to have made some of the most significant contributions to their development. This book is a complete guide to the C4.5 system as implemented in C for the UNIX environment. It contains a comprehensive guide to the system's use , the source code (about 8,800 lines), and implementation notes. The source code and sample datasets are also available on a 3.5-inch floppy diskette for a Sun workstation. C4.5 starts with large sets of cases belonging to known classes. The cases, described by any mixture of nominal and numeric properties, are scrutinized for patterns that allow the classes to be reliably discriminated. These patterns are then expressed as models, in the form of decision trees or sets of if-then rules, that can be used to classify new cases, with emphasis on making the models understandable as well as accurate. The system has been applied successfully to tasks involving tens of thousands of cases described by hundreds of properties. The book starts from simple core learning methods and shows how they can be elaborated and extended to deal with typical problems such as missing data and over hitting. Advantages and disadvantages of the C4.5 approach are discussed and illustrated with several case studies. This book and software should be of interest to developers of classification-based intelligent systems and to students in machine learning and expert systems courses.

21,396 citations


Journal ArticleDOI
Abstract: The technology for building knowledge-based systems by inductive inference from examples has been demonstrated successfully in several practical applications. This paper summarizes an approach to synthesizing decision trees that has been used in a variety of systems, and it describes one such system, ID3, in detail. Results from recent studies show ways in which the methodology can be modified to deal with information that is noisy and/or incomplete. A reported shortcoming of the basic algorithm is discussed and two means of overcoming it are compared. The paper concludes with illustrations of current research directions.

16,062 citations


Steven L. Salzberg1, Alberto SegreInstitutions (1)
01 Jan 1994
TL;DR: In his new book, C4.5: Programs for Machine Learning, Quinlan has put together a definitive, much needed description of his complete system, including the latest developments, which will be a welcome addition to the library of many researchers and students.
Abstract: Algorithms for constructing decision trees are among the most well known and widely used of all machine learning methods. Among decision tree algorithms, J. Ross Quinlan's ID3 and its successor, C4.5, are probably the most popular in the machine learning community. These algorithms and variations on them have been the subject of numerous research papers since Quinlan introduced ID3. Until recently, most researchers looking for an introduction to decision trees turned to Quinlan's seminal 1986 Machine Learning journal article [Quinlan, 1986]. In his new book, C4.5: Programs for Machine Learning, Quinlan has put together a definitive, much needed description of his complete system, including the latest developments. As such, this book will be a welcome addition to the library of many researchers and students.

7,843 citations


"fuzzy decision tree based inference..." refers methods in this paper

  • ...5([17])) to do the comparison experiments which using the Weka tool (Weka is an open source data mining software package([18]))....

    [...]


Journal ArticleDOI
TL;DR: Using clues from the KKT conditions for the dual problem, two threshold parameters are employed to derive modifications of SMO that perform significantly faster than the original SMO on all benchmark data sets tried.
Abstract: This article points out an important source of inefficiency in Platt's sequential minimal optimization (SMO) algorithm that is caused by the use of a single threshold value. Using clues from the KKT conditions for the dual problem, two threshold parameters are employed to derive modifications of SMO. These modified algorithms perform significantly faster than the original SMO on all benchmark data sets tried.

1,713 citations


Proceedings Article
02 Aug 1996
TL;DR: A new algorithm, NBTree, is proposed, which induces a hybrid of decision-tree classifiers and Naive-Bayes classifiers: the decision-Tree nodes contain univariate splits as regular decision-trees, but the leaves contain Naïve-Bayesian classifiers.
Abstract: Naive-Bayes induction algorithms were previously shown to be surprisingly accurate on many classification tasks even when the conditional independence assumption on which they are based is violated. However, most studies were done on small databases. We show that in some larger databases, the accuracy of Naive-Bayes does not scale up as well as decision trees. We then propose a new algorithm, NBTree, which induces a hybrid of decision-tree classifiers and Naive-Bayes classifiers: the decision-tree nodes contain univariate splits as regular decision-trees, but the leaves contain Naive-Bayesian classifiers. The approach retains the interpretability of Naive-Bayes and decision trees, while resulting in classifiers that frequently outperform both constituents, especially in the larger databases tested.

1,401 citations


Performance
Metrics
No. of citations received by the Paper in previous years
YearCitations
20171
20141
20121
20101
20091