Statistical Evaluation of Rough Set Dependency Analysis
Ivo Düntsch
1
School of Information and Software Engineering
University of Ulster
Newtownabbey, BT 37 0QB, N.Ireland
I.Duentsch@ulst.ac.uk
Günther Gediga
1
FB Psychologie / Methodenlehre
Universität Osnabrück
49069 Osnabrück, Germany
gg@Luce.Psycho.Uni-Osnabrueck.DE
and
Institut für semantische Informationsverarbeitung
Universität Osnabrück
December 12, 1996
1
Equal authorship implied
Summary
Rough set data analysis (RSDA) has recently become a frequently studied symbolic method in data
mining. Among other things, it is being used for the extraction of rules from databases; it is, however,
not clear from within the methods of rough set analysis, whether the extracted rules are valid.
In this paper, we suggest to enhance RSDA by two simple statistical procedures, both based on ran-
domization techniques, to evaluate the validity of prediction based on the approximation quality of
attributes of rough set dependency analysis. The first procedure tests the casualness of a prediction
to ensure that the prediction is not based on only a few (casual) observations. The second procedure
tests the conditional casualness of an attribute within a prediction rule.
The procedures are applied to three data sets, originally publishedin the context of rough set analysis.
We argue that several claims of these analyses need to be modified because of lacking validity, and
that other possibly significant results were overlooked.
Keywords: Rough sets, dependency analysis, statistical evaluation, validation, randomization test
1 Introduction
Rough set analysis, an emerging technology in artificial intelligence (Pawlak et al. (1995)), has been
compared with statistical models, see for example Wong et al. (1986), Krusi´nska et al. (1992a) or
Krusi´nska et al. (1992b). One area of application of rough set theory is the extraction of rules from
databases; these rules then are sometimes claimed tobe usefulfor future decisionmaking or prediction
of events. However, if such a rule is based on only a few observations, its usefulness for prediction is
arguable (see also Krusi´nska et al. (1992a), p 253 in this context).
The aim of this paper is to employ statistical methods which are compatible with the rough set phi-
losophy to evaluate the “prediction quality” of rough set dependency analysis. The methods will be
applied to three different data sets:
• The first set was publishedin Pawlak et al. (1986) and Słowi´nski & Słowi´nski (1990). It utilizes
rough set analysisto describe patientsafter highlyselectivevagotomy (HSV) for duodenalulcer.
The statistical validity of the conclusions will be discussed.
• The second example is the discussion of earthquake data published by Teghem & Charlet
(1992). The main reason why we use this example is that it demonstrates the applicability of
our approach in the situation when the prediction success is perfect in terms of rough analysis.
• The third example is used by Teghem & Benjelloun (1992) to compare statistical and rough set
methods. We show how statistical methods within rough set analysis highlight some of their
results in a different way.
2 Rough set data analysis
A major area of application of rough set theory is the study of dependencies among attributes of
information systems. An information system S = hU, Ω,V
q
,fi
q∈Ω
consists of
1. A set U of objects,
2. A finite set Ω of attributes,
3. For each q ∈ Ω asetV
q
of attribute values,
4. An information function f : U × Ω → V
def
=
S
q∈Q
V
q
with f(x, q) ∈ V
q
for all x ∈ U, q ∈ Ω.
We think of the descriptor f(x, q) as the value which object x takes at attribute q.
With each Q ⊆ Ω we associate an equivalence relation θ
Q
on U by
x ≡ y (θ
Q
)
def
⇐⇒ f(x, q)=f(y, q) for all q ∈ Q.
If x ∈ U ,thenθ
Q
x is the equivalence class of θ
Q
containing x.
1
Intuitively, x ≡ y (θ
Q
) if the objects x and y are indiscernible with respect to the values of their
attributes from Q. If X ⊆ U,thenthe lower approximation of X by Q
X
θ
Q
=
[
{θ
Q
x : θ
Q
x ⊆ X}
is the set of all correctly classified elements of X with respect toθ
Q
, i.e. with the information available
from the attributes given in Q.
Suppose that P, Q ⊆ Ω. We say that P is dependent on Q – written as Q → P – if every class of θ
P
is a union of classes of θ
Q
. In other words, the classification of U induced by θ
P
can be expressed by
the classification induced by θ
Q
.
In order to simplify notation we shall in the sequel usually write Q → p instead of Q →{p} and θ
p
instead of θ
{p}
.
Each dependency Q → P leads to a set of rules as follows: Suppose that Q
def
= {q
0
,...,q
n
},and
P
def
= {p
0
,...,p
k
}. For each set {t
0
,...,t
n
} where t
i
∈ V
q
i
there is a uniquely determined set
{s
0
,...,s
k
} with s
i
∈ V
p
i
such that
(∀x ∈ U)[f (x, q
0
)=t
0
∧···∧f(x, q
n
)=t
n
) ⇒ (f(x, p
0
)=s
0
∧···∧f(x, p
k
)=s
k
)].(2.1)
Of particular interest in rough set dependency theory are those sets Q which use the least number of
attributes, and still have Q → P . A set with this property called a minimal determining set for P .In
other words, a set Q is minimal determining for P ,ifQ → P ,andR 6→ P for all R
(
Q.
If such Q is a subset of P we call Q a reduct of P. It is not hard to see, that each P ⊆ Ω has a reduct,
though this need not be unique. The intersection of all reducts of P is called the core of P .UnlessP
has only one reduct, the core of P is not itself a reduct.
For each R ⊆ Ω let P
R
be the partition of U induced by θ
R
.Define
γ
Q
(P )=
P
X∈P
P
|X
θ
Q
|
|U |
.(2.2)
γ
Q
(P ) is the relative frequency of the number of correctly Q–classified elements with respect to
the partition induced by P . It is usually interpreted in rough set analysis as a measurement of the
prediction success of a set of inference rules based on value combinations of Q and value combinations
of P of the form given in (2.1). The prediction success is perfect, if γ
Q
(P )=1; in this case, Q → P .
Suppose that Q is a reduct of P ,sothatQ → P ,andQ \{q}6→P for any q ∈ Q. In rough
set theory, the impact of attribute q on the fact that Q → P is usually measured by the drop of the
approximation function γ from 1 to γ
Q\{q}
(P ): The larger the difference, the more important one
regards the contribution of q. We shall show below that this interpretation needs to be taken with care
in some cases, and additional statistical evidence may be needed.
2
3 Casual rules and randomization analysis
3.1 Casual dependencies
In the sequel we consider the case that a rule Q → P was given before performing the data analysis,
and not obtained by optimizing the quality of approximation. The latter needs additional treatment
andwillbediscussedbrieflyinSection3.5.
Suppose that θ
Q
is the identity relation id
U
on U. Then, θ
Q
⊆ θ
P
for all P ⊆ Ω,i.e.Q → P for
all P ⊆ Ω. Furthermore, each class of θ
Q
consists of exactly one element, and therefore, any rule
Q → P is based on exactly one observation. We call such a rule deterministic casual.
If a rule is not deterministic casual, it nevertheless may be based on a few observationsonly, and thus,
its prediction quality could be limited; such rules may be called casual. Therefore, the need arises for
a statistical procedure which tests the casualness of a rule based on mechanisms of rough set analysis.
Assume that theinformation system is the realization of a randomprocessin which the attribute values
of Q and P are realized independently of each other. If no additional information is present, it may be
assumed that the attribute value combinations within Q and P are fixed and the matching of the Q, P
– combinations is drawn at random.
Let σ be a permutation of U ,andQ ⊆ Ω. We define a new information function f
σ(Q)
by
f
σ(Q)
(x, r)
def
=
f(σ(x),r)), if r ∈ Q,
f(x, r), otherwise,
and let γ
σ(Q)
(P ) be the approximation of the prediction of P by Q in the new information system.
Note that the structure of the equivalence relation θ
σ(Q)
determined by Q in the revised system is the
same as that of the original θ
Q
. In other words, there is a bijective mapping
τ : {θ
σ(Q)
x : x ∈ U}→{θ
Q
x : x ∈ U}
which preserves the cardinality of the classes. In particular, if θ
Q
is the identity on U ,soisθ
σ(Q)
.It
follows that for a rule Q → p with θ
Q
= id
U
,wehaveγ
σ(Q)
(p)=1as well for all permutations σ of
U.
The distribution of the prediction success is given by the set
R
P,Q
def
= { γ
σ(Q)
(P ):σ a permutation of U }.
Let H be the null hypothesis;we have to estimate the position of the observed approximation quality
γ
obs
def
= γ
Q
(P ) in the set R
P,Q
, i.e. to estimate the probability p(γ
R
≥ γ
obs
|H). Standard ran-
domization techniques – for example Manly (1991), Chapter 1 – can now be applied to estimate this
probability.
If p(γ
R
≥ γ
obs
|H) is low – conventionally in the upper 5% region –, the assumption of randomness
can be rejected, otherwise, if
p(γ
R
≥ γ
obs
|H) > 0.05,
we call the rule (random) casual.
3