scispace - formally typeset
Open AccessJournal ArticleDOI

Statistical evaluation of rough set dependency analysis

TLDR
This paper proposes to enhance RSDA by two simple statistical procedures, both based on randomization techniques, to evaluate the validity of prediction based on the approximation quality of attributes of rough set dependency analysis.
Abstract
Rough set data analysis (RSDA) has recently become a frequently studied symbolic method in data mining. Among other things, it is being used for the extraction of rules from databases; it is, however, not clear from within the methods of rough set analysis, whether the extracted rules are valid.In this paper, we suggest to enhance RSDA by two simple statistical procedures, both based on randomization techniques, to evaluate the validity of prediction based on the approximation quality of attributes of rough set dependency analysis. The first procedure tests the casualness of a prediction to ensure that the prediction is not based on only a few (casual) observations. The second procedure tests the conditional casualness of an attribute within a prediction rule.The procedures are applied to three data sets, originally published in the context of rough set analysis. We argue that several claims of these analyses need to be modified because of lacking validity, and that other possibly significant results were overlooked.

read more

Content maybe subject to copyright    Report

Statistical Evaluation of Rough Set Dependency Analysis
Ivo Düntsch
1
School of Information and Software Engineering
University of Ulster
Newtownabbey, BT 37 0QB, N.Ireland
I.Duentsch@ulst.ac.uk
Günther Gediga
1
FB Psychologie / Methodenlehre
Universität Osnabck
49069 Osnabrück, Germany
gg@Luce.Psycho.Uni-Osnabrueck.DE
and
Institut für semantische Informationsverarbeitung
Universität Osnabck
December 12, 1996
1
Equal authorship implied

Summary
Rough set data analysis (RSDA) has recently become a frequently studied symbolic method in data
mining. Among other things, it is being used for the extraction of rules from databases; it is, however,
not clear from within the methods of rough set analysis, whether the extracted rules are valid.
In this paper, we suggest to enhance RSDA by two simple statistical procedures, both based on ran-
domization techniques, to evaluate the validity of prediction based on the approximation quality of
attributes of rough set dependency analysis. The first procedure tests the casualness of a prediction
to ensure that the prediction is not based on only a few (casual) observations. The second procedure
tests the conditional casualness of an attribute within a prediction rule.
The procedures are applied to three data sets, originally publishedin the context of rough set analysis.
We argue that several claims of these analyses need to be modified because of lacking validity, and
that other possibly significant results were overlooked.
Keywords: Rough sets, dependency analysis, statistical evaluation, validation, randomization test

1 Introduction
Rough set analysis, an emerging technology in artificial intelligence (Pawlak et al. (1995)), has been
compared with statistical models, see for example Wong et al. (1986), Krusi´nska et al. (1992a) or
Krusi´nska et al. (1992b). One area of application of rough set theory is the extraction of rules from
databases; these rules then are sometimes claimed tobe usefulfor future decisionmaking or prediction
of events. However, if such a rule is based on only a few observations, its usefulness for prediction is
arguable (see also Krusi´nska et al. (1992a), p 253 in this context).
The aim of this paper is to employ statistical methods which are compatible with the rough set phi-
losophy to evaluate the “prediction quality” of rough set dependency analysis. The methods will be
applied to three different data sets:
The rst set was publishedin Pawlak et al. (1986) and Słowi´nski & Słowi´nski (1990). It utilizes
rough set analysisto describe patientsafter highlyselectivevagotomy (HSV) for duodenalulcer.
The statistical validity of the conclusions will be discussed.
The second example is the discussion of earthquake data published by Teghem & Charlet
(1992). The main reason why we use this example is that it demonstrates the applicability of
our approach in the situation when the prediction success is perfect in terms of rough analysis.
The third example is used by Teghem & Benjelloun (1992) to compare statistical and rough set
methods. We show how statistical methods within rough set analysis highlight some of their
results in a different way.
2 Rough set data analysis
A major area of application of rough set theory is the study of dependencies among attributes of
information systems. An information system S = hU, ,V
q
,fi
q
consists of
1. A set U of objects,
2. A nite set of attributes,
3. For each q asetV
q
of attribute values,
4. An information function f : U × V
def
=
S
qQ
V
q
with f(x, q) V
q
for all x U, q .
We think of the descriptor f(x, q) as the value which object x takes at attribute q.
With each Q we associate an equivalence relation θ
Q
on U by
x y (θ
Q
)
def
⇐⇒ f(x, q)=f(y, q) for all q Q.
If x U ,thenθ
Q
x is the equivalence class of θ
Q
containing x.
1

Intuitively, x y (θ
Q
) if the objects x and y are indiscernible with respect to the values of their
attributes from Q. If X U,thenthe lower approximation of X by Q
X
θ
Q
=
[
{θ
Q
x : θ
Q
x X}
is the set of all correctly classified elements of X with respect toθ
Q
, i.e. with the information available
from the attributes given in Q.
Suppose that P, Q . We say that P is dependent on Q written as Q P if every class of θ
P
is a union of classes of θ
Q
. In other words, the classification of U induced by θ
P
can be expressed by
the classification induced by θ
Q
.
In order to simplify notation we shall in the sequel usually write Q p instead of Q →{p} and θ
p
instead of θ
{p}
.
Each dependency Q P leads to a set of rules as follows: Suppose that Q
def
= {q
0
,...,q
n
},and
P
def
= {p
0
,...,p
k
}. For each set {t
0
,...,t
n
} where t
i
V
q
i
there is a uniquely determined set
{s
0
,...,s
k
} with s
i
V
p
i
such that
(x U)[f (x, q
0
)=t
0
···∧f(x, q
n
)=t
n
) (f(x, p
0
)=s
0
···∧f(x, p
k
)=s
k
)].(2.1)
Of particular interest in rough set dependency theory are those sets Q which use the least number of
attributes, and still have Q P . A set with this property called a minimal determining set for P .In
other words, a set Q is minimal determining for P ,ifQ P ,andR 6→ P for all R
(
Q.
If such Q is a subset of P we call Q a reduct of P. It is not hard to see, that each P has a reduct,
though this need not be unique. The intersection of all reducts of P is called the core of P .UnlessP
has only one reduct, the core of P is not itself a reduct.
For each R let P
R
be the partition of U induced by θ
R
.Dene
γ
Q
(P )=
P
X∈P
P
|X
θ
Q
|
|U |
.(2.2)
γ
Q
(P ) is the relative frequency of the number of correctly Q–classified elements with respect to
the partition induced by P . It is usually interpreted in rough set analysis as a measurement of the
prediction success of a set of inference rules based on value combinations of Q and value combinations
of P of the form given in (2.1). The prediction success is perfect, if γ
Q
(P )=1; in this case, Q P .
Suppose that Q is a reduct of P ,sothatQ P ,andQ \{q}6P for any q Q. In rough
set theory, the impact of attribute q on the fact that Q P is usually measured by the drop of the
approximation function γ from 1 to γ
Q\{q}
(P ): The larger the difference, the more important one
regards the contribution of q. We shall show below that this interpretation needs to be taken with care
in some cases, and additional statistical evidence may be needed.
2

3 Casual rules and randomization analysis
3.1 Casual dependencies
In the sequel we consider the case that a rule Q P was given before performing the data analysis,
and not obtained by optimizing the quality of approximation. The latter needs additional treatment
andwillbediscussedbrieyinSection3.5.
Suppose that θ
Q
is the identity relation id
U
on U. Then, θ
Q
θ
P
for all P ,i.e.Q P for
all P . Furthermore, each class of θ
Q
consists of exactly one element, and therefore, any rule
Q P is based on exactly one observation. We call such a rule deterministic casual.
If a rule is not deterministic casual, it nevertheless may be based on a few observationsonly, and thus,
its prediction quality could be limited; such rules may be called casual. Therefore, the need arises for
a statistical procedure which tests the casualness of a rule based on mechanisms of rough set analysis.
Assume that theinformation system is the realization of a randomprocessin which the attribute values
of Q and P are realized independently of each other. If no additional information is present, it may be
assumed that the attribute value combinations within Q and P are fixed and the matching of the Q, P
combinations is drawn at random.
Let σ be a permutation of U ,andQ . We define a new information function f
σ(Q)
by
f
σ(Q)
(x, r)
def
=
f(σ(x),r)), if r Q,
f(x, r), otherwise,
and let γ
σ(Q)
(P ) be the approximation of the prediction of P by Q in the new information system.
Note that the structure of the equivalence relation θ
σ(Q)
determined by Q in the revised system is the
same as that of the original θ
Q
. In other words, there is a bijective mapping
τ : {θ
σ(Q)
x : x U}→{θ
Q
x : x U}
which preserves the cardinality of the classes. In particular, if θ
Q
is the identity on U ,soisθ
σ(Q)
.It
follows that for a rule Q p with θ
Q
= id
U
,wehaveγ
σ(Q)
(p)=1as well for all permutations σ of
U.
The distribution of the prediction success is given by the set
R
P,Q
def
= { γ
σ(Q)
(P ):σ a permutation of U }.
Let H be the null hypothesis;we have to estimate the position of the observed approximation quality
γ
obs
def
= γ
Q
(P ) in the set R
P,Q
, i.e. to estimate the probability p(γ
R
γ
obs
|H). Standard ran-
domization techniques for example Manly (1991), Chapter 1 can now be applied to estimate this
probability.
If p(γ
R
γ
obs
|H) is low conventionally in the upper 5% region –, the assumption of randomness
can be rejected, otherwise, if
p(γ
R
γ
obs
|H) > 0.05,
we call the rule (random) casual.
3

Citations
More filters
Journal ArticleDOI

Reducing the Memory Size of a Fuzzy Case-Based Reasoning System Applying Rough Set Techniques

TL;DR: Experiments show that the rough sets reduction method maintains the accuracy of the employed fuzzy rules, while reducing the computational effort needed in its generation and increasing the explanatory strength of the fuzzy rules.
Journal ArticleDOI

Rough approximation quality revisited

TL;DR: This paper re-interpret the classical in terms of a classic measure based on sets, the Marczewski-Steinhaus metric, and also in Terms of "proportional reduction of errors" (PRE) measures.
Journal ArticleDOI

Rough set approach for attribute reduction and rule generation: a case of patients with suspected breast cancer

TL;DR: The study showed that the theory of rough sets is a useful tool for inductive learning and a valuable aid for building expert systems.

Rough Sets in KDD

Book ChapterDOI

Introduction: What You Always Wanted to Know about Rough Sets

Ewa Orłowska
TL;DR: In this chapter the major principles and the methodology of the rough set—style analysis of data are presented and discussed and a survey of various formalisms that provide the tools of this analysis is given.
References
More filters
Book

An introduction to the bootstrap

TL;DR: This article presents bootstrap methods for estimation, using simple arguments, with Minitab macros for implementing these methods, as well as some examples of how these methods could be used for estimation purposes.
Journal ArticleDOI

Rough sets

TL;DR: This approach seems to be of fundamental importance to artificial intelligence (AI) and cognitive sciences, especially in the areas of machine learning, knowledge acquisition, decision analysis, knowledge discovery from databases, expert systems, decision support systems, inductive reasoning, and pattern recognition.
Book

Randomization tests

TL;DR: The writer really shows how the simple words can maximize how the impression of this book is uttered directly for the readers.
Book

Randomization and Monte Carlo methods in biology

TL;DR: This book discusses the construction of tests in non-standard situations testing for randomness of species co-occurences on islands examining time change in niche ovelap probing multivariate data with random skewers other examples.
Frequently Asked Questions (1)
Q1. What are the contributions mentioned in the paper "Statistical evaluation of rough set dependency analysis" ?

In this paper, the authors employ statistical methods which are compatible with the rough set philosophy to evaluate the `` prediction quality '' of rough set dependency analysis.