Journal Article•DOI•

Statistical evaluation of rough set dependency analysis

Ivo Düntsch¹, Günther Gediga²•Institutions (2)

Ulster University¹, University of Osnabrück²

01 May 1997-International Journal of Human-computer Studies \/ International Journal of Man-machine Studies (Academic Press, Inc.)-Vol. 46, Iss: 5, pp 589-604

TL;DR: This paper proposes to enhance RSDA by two simple statistical procedures, both based on randomization techniques, to evaluate the validity of prediction based on the approximation quality of attributes of rough set dependency analysis.

read less

Abstract: Rough set data analysis (RSDA) has recently become a frequently studied symbolic method in data mining. Among other things, it is being used for the extraction of rules from databases; it is, however, not clear from within the methods of rough set analysis, whether the extracted rules are valid.In this paper, we suggest to enhance RSDA by two simple statistical procedures, both based on randomization techniques, to evaluate the validity of prediction based on the approximation quality of attributes of rough set dependency analysis. The first procedure tests the casualness of a prediction to ensure that the prediction is not based on only a few (casual) observations. The second procedure tests the conditional casualness of an attribute within a prediction rule.The procedures are applied to three data sets, originally published in the context of rough set analysis. We argue that several claims of these analyses need to be modified because of lacking validity, and that other possibly significant results were overlooked.

...read moreread less

Summary (3 min read)

Jump to: [1 Introduction] – [2 Rough set data analysis] – [3.1 Casual dependencies] – [3.2 How the randomization procedure works] – [3.3 Computational considerations] – [3.4 Conditional casual attributes] – [3.5 Cross validation of learned dependencies] – [4.1 Duodenal ulcer data] – [4.2 Earthquake data] and [5 Conclusion]

1 Introduction

The methods will be applied to three different data sets: .
It utilizes rough set analysis to describe patients after highly selective vagotomy (HSV) for duodenal ulcer.
The authors show how statistical methodswithin rough set analysis highlight some of their results in a different way.

2 Rough set data analysis

Of particular interest in rough set dependency theory are those setsQ which use the least number of attributes, and still haveQ → P .
The intersection of all reducts ofP is called thecore ofP .
For eachR ⊆ Ω letPR be the partition ofU induced byθR. Define γQ(P ) = ∑ X∈PP |XθQ | |U | .(2.2) γQ(P ) is the relative frequency of the number of correctlyQ–classified elements with respect to the partition induced byP .
The larger the difference, the more important one regards the contribution ofq.

3.1 Casual dependencies

In the sequel the authors consider the case that a ruleQ → P was givenbeforeperforming the data analysis, and not obtained by optimizing the quality of approximation.
The latter needs additional treatment and will be discussed briefly in Section 3.5.
U} which preserves the cardinality of the classes.
Standard randomization techniques – for example Manly (1991), Chapter 1 – can now be applied to estimate this probability.
To decide whether the given rule is casual under the statistical assumption, the authors have to consider all 720 possible rules{σ(p), σ(q)} → d and their approximation qualities.

3.2 How the randomization procedure works

The proposed randomization test procedure is one way to model errors in terms of a statistical approach.
Because their approach is aimed to test the casualness of a rule system – and assume for a moment that this assumption really holds –, the assumption of representativeness is a problem of any analysis in most real life data bases.
Any observation within the other six classes ofθQ was randomly assigned to one of the three classes ofθP .
The percentage of the three rules – which is the true value of the approximation qualityγ – is varied by γ 0.0 0.1 0.2 0.3 Figure 1 shows the problem of granularity: GivenN = 10 observations and a true value ofγ = 0.0, the expectation of̂γ is about0.32; the granularity overshoot vanishes at aboutN = 40.
The power curves of an effectγ > 0.0 show that the randomization test has a reasonable power – at least in the chosen situation.

3.3 Computational considerations

It is well known that randomization is a rather expensive procedure, and one might have objections against this technique because of its cost in real life applications.
Iff(N ) is the time complexity for performing the computation of γ, the time complexity of the simulation based randomization procedure is1000f(N ).
If randomization is too costly for a data set, RSDA itself will not be applicable in this case.
Some simple short cuts such as a check whether the entropy of theQ partition is nearlog2(N ) may avoid superfluous computation.
For their re-analysis of the published data sets below it was not necessary to speed up the computations.

3.4 Conditional casual attributes

In rough set analysis, the decline of the approximation quality when omitting one attribute is usually used to determine whether an attribute within a minimal determining set is of high value for the prediction.
This approach does not take into account that the decline of approximation quality may be due to chance.
Assume that an additional attributer is conceptualized in three different ways: A fine grained measurer1 using 8 categories, A medium grained descriptionr2 using 4 categories.
Therefore the authors cannot trust the rules derived from the description{q, r1} → p, because the attributer1 is exchangeable with any random generated attributes = σ(r1).
Whereas the statistical evaluation of the additional predictive power of the three chosen attribute differs, the analysis of the decline of the approximation quality tells us nothingabout these differences.

3.5 Cross validation of learned dependencies

If rough set analysis is used to learn the best subset ofΩ to determineP , a simple randomization procedure is not sufficient, because it does not reflect the optimization of the learning procedure.
Within the test subset the same procedure can be used to validate the chosen attributes.
If the test procedure does not show a significant result, there are too few rules which can be used to predict the decision attributes from the learned attributes.
Note, that these rules need not be the same as those in the learning subset!.
If the additional attribute is conditional casual, the hypothesis that the rules in both sets of objects are identical should be kept.

4.1 Duodenal ulcer data

All data used in this paper are obtainable fromftp://luce.psycho.uni-osnabrueck.de/.
Pawlak et al. (1986) obtained – using rough set analysis – that the attribute setR, consisting of 3: Duration of disease 4: Complication 5: Basic HCI concentration 6: Basic Vol. of gastric juice 9: Stimulated HCI concentration 10: Stimulated Vol. of gastric juice suffices to predict attribute 12 (“Visick grading”).
The attribute set discussed in Pawlak et al. (1986) was based on a reduct searching procedure.
In order to discuss the cross validation procedure, the authors split the data set into 2 subsets containing 61 cases each.
Furthermore, the result suggests a reduction of the number of attributes withinR, because all attributes are conditional casual.

4.2 Earthquake data

In Teghem & Benjelloun (1992), the authors search for premonitory factors for earthquakes by emphasizing gas geochemistry.
The partition attribute (attribute 16) was the seismic activity on 155 days measured on the Richter scale.
The other attributes were radon concentration measured at 8 different locations (attributes 1-8) and 7 measures of climatic factors (attributes 9-15).
A problem with the information system was that it has an empty core with respect to attribute 16, and that an evaluation of some reducts turned out to be difficult.
The statistical evaluation of some of the information systems proposed by Teghem & Benjelloun (1992) gives us additional insights (Tab. 6).

5 Conclusion

Gathering evidence in procedures of Artificial Intelligence should not be based upon casual observations.
The authors approach shows how – in principle – a system using the rough set dependency analysis will defend itself against randomness.
The reanalysis of three published data sets shows that there is an urgent need for such a technique: Parts of the claimed results using the first two data sets are invalidated, some promising dependencies are overlooked and, as the authors show using data of Study 1, their proposed cross–validation technique offers a new horizon for the interpretation.
Concerning Study 3, the conclusions of the authors are validated.

Did you find this useful? Give us your feedback

Figures (9)

Table 1: RESULTS OF RANDOMIZATION ANALYSIS; 6 OBSERV.

Table 4: REANALYSIS OF THE DUODENAL ULCER DATA, II

Table 3: REANALYSIS OF THE DUODENAL ULCER DATA, I

Table 7: REANALYSIS OF FISHER’ S IRIS DATA

Table 2: STATE COMPLEXITY OF INFORMATION SYSTEMS WITH A MODERATE NUMBER OF ATTRIBUTES

Table 6: REANALYSIS OF THE EARTHQUAKE DATA

Figure 1: EXPECTATION OF APPROXIMATION QUALITY, GIVEN SAMPLE SIZE AND γ

Figure 2: TEST CHARACTERISTIC OF THE RANDOMIZATION PROCEDURE(α = 5%)

Table 5: REANALYSIS OF THE DUODENAL ULCER DATA, III

Content maybe subject to copyright Report

Statistical Evaluation of Rough Set Dependency Analysis

Ivo Düntsch

School of Information and Software Engineering

University of Ulster

Newtownabbey, BT 37 0QB, N.Ireland

I.Duentsch@ulst.ac.uk

Günther Gediga

FB Psychologie / Methodenlehre

Universität Osnabrück

49069 Osnabrück, Germany

gg@Luce.Psycho.Uni-Osnabrueck.DE

and

Institut für semantische Informationsverarbeitung

Universität Osnabrück

December 12, 1996

Equal authorship implied

Summary

Rough set data analysis (RSDA) has recently become a frequently studied symbolic method in data

mining. Among other things, it is being used for the extraction of rules from databases; it is, however,

not clear from within the methods of rough set analysis, whether the extracted rules are valid.

In this paper, we suggest to enhance RSDA by two simple statistical procedures, both based on ran-

domization techniques, to evaluate the validity of prediction based on the approximation quality of

attributes of rough set dependency analysis. The ﬁrst procedure tests the casualness of a prediction

to ensure that the prediction is not based on only a few (casual) observations. The second procedure

tests the conditional casualness of an attribute within a prediction rule.

The procedures are applied to three data sets, originally publishedin the context of rough set analysis.

We argue that several claims of these analyses need to be modiﬁed because of lacking validity, and

that other possibly signiﬁcant results were overlooked.

Keywords: Rough sets, dependency analysis, statistical evaluation, validation, randomization test

1 Introduction

Rough set analysis, an emerging technology in artiﬁcial intelligence (Pawlak et al. (1995)), has been

compared with statistical models, see for example Wong et al. (1986), Krusi´nska et al. (1992a) or

Krusi´nska et al. (1992b). One area of application of rough set theory is the extraction of rules from

databases; these rules then are sometimes claimed tobe usefulfor future decisionmaking or prediction

of events. However, if such a rule is based on only a few observations, its usefulness for prediction is

arguable (see also Krusi´nska et al. (1992a), p 253 in this context).

The aim of this paper is to employ statistical methods which are compatible with the rough set phi-

losophy to evaluate the “prediction quality” of rough set dependency analysis. The methods will be

applied to three different data sets:

• The ﬁrst set was publishedin Pawlak et al. (1986) and Słowi´nski & Słowi´nski (1990). It utilizes

rough set analysisto describe patientsafter highlyselectivevagotomy (HSV) for duodenalulcer.

The statistical validity of the conclusions will be discussed.

• The second example is the discussion of earthquake data published by Teghem & Charlet

(1992). The main reason why we use this example is that it demonstrates the applicability of

our approach in the situation when the prediction success is perfect in terms of rough analysis.

• The third example is used by Teghem & Benjelloun (1992) to compare statistical and rough set

methods. We show how statistical methods within rough set analysis highlight some of their

results in a different way.

2 Rough set data analysis

A major area of application of rough set theory is the study of dependencies among attributes of

information systems. An information system S = hU, Ω,V

,fi

q∈Ω

consists of

1. A set U of objects,

2. A ﬁnite set Ω of attributes,

3. For each q ∈ Ω asetV

of attribute values,

4. An information function f : U × Ω → V

def

q∈Q

with f(x, q) ∈ V

for all x ∈ U, q ∈ Ω.

We think of the descriptor f(x, q) as the value which object x takes at attribute q.

With each Q ⊆ Ω we associate an equivalence relation θ

on U by

x ≡ y (θ

)

def

⇐⇒ f(x, q)=f(y, q) for all q ∈ Q.

If x ∈ U ,thenθ

x is the equivalence class of θ

containing x.

Intuitively, x ≡ y (θ

) if the objects x and y are indiscernible with respect to the values of their

attributes from Q. If X ⊆ U,thenthe lower approximation of X by Q

[

{θ

x : θ

x ⊆ X}

is the set of all correctly classiﬁed elements of X with respect toθ

, i.e. with the information available

from the attributes given in Q.

Suppose that P, Q ⊆ Ω. We say that P is dependent on Q – written as Q → P – if every class of θ

is a union of classes of θ

. In other words, the classiﬁcation of U induced by θ

can be expressed by

the classiﬁcation induced by θ

In order to simplify notation we shall in the sequel usually write Q → p instead of Q →{p} and θ

instead of θ

{p}

Each dependency Q → P leads to a set of rules as follows: Suppose that Q

def

= {q

,...,q

},and

def

= {p

,...,p

}. For each set {t

,...,t

} where t

∈ V

there is a uniquely determined set

,...,s

} with s

∈ V

such that

(∀x ∈ U)[f (x, q

)=t

∧···∧f(x, q

)=t

) ⇒ (f(x, p

)=s

∧···∧f(x, p

)=s

)].(2.1)

Of particular interest in rough set dependency theory are those sets Q which use the least number of

attributes, and still have Q → P . A set with this property called a minimal determining set for P .In

other words, a set Q is minimal determining for P ,ifQ → P ,andR 6→ P for all R

(

If such Q is a subset of P we call Q a reduct of P. It is not hard to see, that each P ⊆ Ω has a reduct,

though this need not be unique. The intersection of all reducts of P is called the core of P .UnlessP

has only one reduct, the core of P is not itself a reduct.

For each R ⊆ Ω let P

be the partition of U induced by θ

.Deﬁne

(P )=

X∈P

|U |

.(2.2)

(P ) is the relative frequency of the number of correctly Q–classiﬁed elements with respect to

the partition induced by P . It is usually interpreted in rough set analysis as a measurement of the

prediction success of a set of inference rules based on value combinations of Q and value combinations

of P of the form given in (2.1). The prediction success is perfect, if γ

(P )=1; in this case, Q → P .

Suppose that Q is a reduct of P ,sothatQ → P ,andQ \{q}6→P for any q ∈ Q. In rough

set theory, the impact of attribute q on the fact that Q → P is usually measured by the drop of the

approximation function γ from 1 to γ

Q\{q}

(P ): The larger the difference, the more important one

regards the contribution of q. We shall show below that this interpretation needs to be taken with care

in some cases, and additional statistical evidence may be needed.

3 Casual rules and randomization analysis

3.1 Casual dependencies

In the sequel we consider the case that a rule Q → P was given before performing the data analysis,

and not obtained by optimizing the quality of approximation. The latter needs additional treatment

andwillbediscussedbrieﬂyinSection3.5.

Suppose that θ

is the identity relation id

on U. Then, θ

⊆ θ

for all P ⊆ Ω,i.e.Q → P for

all P ⊆ Ω. Furthermore, each class of θ

consists of exactly one element, and therefore, any rule

Q → P is based on exactly one observation. We call such a rule deterministic casual.

If a rule is not deterministic casual, it nevertheless may be based on a few observationsonly, and thus,

its prediction quality could be limited; such rules may be called casual. Therefore, the need arises for

a statistical procedure which tests the casualness of a rule based on mechanisms of rough set analysis.

Assume that theinformation system is the realization of a randomprocessin which the attribute values

of Q and P are realized independently of each other. If no additional information is present, it may be

assumed that the attribute value combinations within Q and P are ﬁxed and the matching of the Q, P

– combinations is drawn at random.

Let σ be a permutation of U ,andQ ⊆ Ω. We deﬁne a new information function f

σ(Q)

(x, r)

def







f(σ(x),r)), if r ∈ Q,

f(x, r), otherwise,

and let γ

σ(Q)

(P ) be the approximation of the prediction of P by Q in the new information system.

Note that the structure of the equivalence relation θ

σ(Q)

determined by Q in the revised system is the

same as that of the original θ

. In other words, there is a bijective mapping

τ : {θ

σ(Q)

x : x ∈ U}→{θ

x : x ∈ U}

which preserves the cardinality of the classes. In particular, if θ

is the identity on U ,soisθ

σ(Q)

.It

follows that for a rule Q → p with θ

= id

,wehaveγ

σ(Q)

(p)=1as well for all permutations σ of

The distribution of the prediction success is given by the set

P,Q

def

= { γ

σ(Q)

(P ):σ a permutation of U }.

Let H be the null hypothesis;we have to estimate the position of the observed approximation quality

obs

def

= γ

(P ) in the set R

P,Q

, i.e. to estimate the probability p(γ

≥ γ

obs

|H). Standard ran-

domization techniques – for example Manly (1991), Chapter 1 – can now be applied to estimate this

probability.

If p(γ

≥ γ

obs

|H) is low – conventionally in the upper 5% region –, the assumption of randomness

can be rejected, otherwise, if

p(γ

≥ γ

obs

|H) > 0.05,

we call the rule (random) casual.

HTML Viewer

Frequently Asked Questions (1)

Q1. What are the contributions mentioned in the paper "Statistical evaluation of rough set dependency analysis" ?

In this paper, the authors employ statistical methods which are compatible with the rough set philosophy to evaluate the `` prediction quality '' of rough set dependency analysis.

Statistical evaluation of rough set dependency analysis

Summary (3 min read)

1 Introduction

2 Rough set data analysis

3.1 Casual dependencies

3.2 How the randomization procedure works

3.3 Computational considerations

3.4 Conditional casual attributes

3.5 Cross validation of learned dependencies

4.1 Duodenal ulcer data

4.2 Earthquake data

5 Conclusion

Figures (9)

Citations

Cites background from "Statistical evaluation of rough set..."

Cites background from "Statistical evaluation of rough set..."

Cites background or methods from "Statistical evaluation of rough set..."

References

"Statistical evaluation of rough set..." refers methods in this paper

Related Papers (5)

Frequently Asked Questions (1)

Q1. What are the contributions mentioned in the paper "Statistical evaluation of rough set dependency analysis" ?