How do the authors control how their method misses or incorrectly predicts some relations?

By generating experimental data where variations appear in the number of event types, imbalance, noise and log size, the authors attempt to control how their method misses or incorrectly predicts some relations.

What are the three performance indicators used to compare the 10 obtained models?

In order to compare the performance of the 10 obtained models, the authors consider three averaged performance indicators: the error rate, precision and recall.

Why do the authors use the first rule set for future predictions?

Because the performance indicators do not differ significantly, the authors have support to use the induced first rule set for performing future predictions on causal relations.

How is the training error rate calculated for RIPPER CAUS?

The training error rate for RIPPER CAUS is 0.08% (the training error rate represents the rate of incorrect predictions made by the model relabeling the training data set).

How do the authors estimate the generalization performance of a rule set?

Since training error is not relevant to assess the generalization performance and quality of a rule set, the authors estimate its generalization performance using test material in Section 5.3.

What are the CM, LM, YX and XY metrics?

In Section 4 the authors introduced five relational metrics CM, GM, LM, YX and XY to be used as predictive features for determining the causal and exclusive/parallel relations between pairs of events.

What is the effect of log size on generalization performance?

As expected, the incompleteness of the log affecting the generalization performance of finding causal relations: as log size increases, performance increases.

Why do the authors use the same definitions as in Aalst et al. (2004)?

Because the authors focus on the process perspective, the authors use the same definitions as in Aalst et al. (2004), this time referring to process logs and process traces.

What is the effect of noise on the generalization performance of causal relations?

It appears that noise is affecting exclusive and parallel relations in a similar way as the causal relations, e.g., if the level of noise increases, the accuracy of finding the excusive/parallel relations decreases.

What is the LM measure for tasks a and c?

Their LM measure for tasks a and b gives a value of LM = 0.85 and for tasks a and c gives a value of LM = 0.90, which is in line with their intuition.

(Open Access) A Rule-Based Approach for Process Discovery: Dealing with Noise and Imbalance in Process Logs (2006) | Laura Maruster

Q: What are the contributions in "A rule-based approach for process discovery: dealing with noise and imbalance in process logs" ?

The authors propose a method that constructs the process model from process log data, by determining the relations between process tasks. The authors illustrate their approach with real world data in a case study.

Q: What are the advantages of petri nets?

Petri netformalism has several advantages, therefore they are often used to represent process models (Aalst, 1998): formal semantics (a clear and precise definition), graphical nature (intuitive and easy to learn), expressiveness (support all primitives needed to model a process), properties (the mathematical foundation allows for reasoning of Petri Nets properties), analysis (many analysis techniques to prove properties and calculate performance measures), vendor independent (not based on software package of a specific vendor).

Data Mining and Knowledge Discovery, 13, 67–87, 2006



2006 Springer Science + Business Media, LLC. Manufactured in the United States.

DOI: 10.1007/s10618-005-0029-z

A Rule-Based Approach for Process Discovery:

Dealing with Noise and Imbalance in Process Logs

LAURA M

ARUS¸TER l. maruster@rug.nl

University of Groningen, P.O. Box 800, 9700 AV, Groningen, NL

A.J.M.M. (TON) WEIJTERS a.j.m.m.weijters@tm.tue.nl

WIL M.P. VAN DER AALST w.m.p.v.d.aalst@tm.tue.nl

Eindhoven University of Technology, P.O. Box 513, 5600 MB, Eindhoven, NL

ANTAL VAN DEN BOSCH antal.vdnbosch@uvt.nl

Tilburg University, P.O. Box 90153, 5000 LE, Tilburg, NL

Published online: 12 May 2006

Abstract. Effective information systems require the existence of explicit process models. A completely

speciﬁed process design needs to be developed in order to enact a given business process. This development

is time consuming and often subjective and incomplete. We propose a method that constructs the process

model from process log data, by determining the relations between process tasks. To predict these relations, we

employ machine learning technique to induce rule sets. These rule sets are induced from simulated process log

data generated by varying process characteristics such as noise and log size. Tests reveal that the induced rule

sets have a high predictive accuracy on new data. The effects of noise and imbalance of execution priorities

during the discovery of the relations between process tasks are also discussed. Knowing the causal, exclusive,

and parallel relations, a process model expressed in the Petri net formalism can be built. We illustrate our

approach with real world data in a case study.

Keywords: rule induction, process mining, knowledge discovery, Petri nets

1. Introduction

Managing complex business processes calls for the development of powerful infor-

mation systems, able to control and support the underlying processes. To support a

structured business process, such information systems have to offer generic process

modelling and process execution capabilities. Because problems are encountered when

designing and employing such information systems, the interest in Business Process

Analysis and Continuous Process Improvement Efforts increases. Yet, whatever the

goal is (e.g. modelling, designing, redesigning or implementing business processes),

it needs to be preceded by an analysis of the existing processes. The growing interest

into the automation of analysing existing processes, process mining can be explained

by the availability of logged information, which most information systems (traditional

or process-aware) support.

The goal of process mining is to abstract process information from transaction

logs (Aalst et al., 2003). Process mining focuses on different levels. Accordingly,

this leads to different mining perspectives, such as the process perspective, the or-

ganizational perspective, and the case perspective. The process perspective focuses

68 M

ARUS¸TER ET AL.

on the control ﬂow, i.e., the ordering of activities. The goal of this type of mining is

to ﬁnd the possible relations between tasks, expressed in terms of a process model,

e.g., expressed in terms of a Petri net (Reisig and Rosenberg, 1998) or an Event-

driven Process Chain (EPC) (IDS Scheer, 2002; Keller and Teufel, 1998). For process

mining with a focus on the process perspective, the speciﬁc terms process discov-

ery or workﬂow mining are used (Aalst et al., 2004). Using this perspective, it is

assumed that it is possible to record events such that (i) each event refers to a task,

(ii) each event occurs in a case (i.e., process instance) and (iii) events are totally ordered.

A set of such recorded sequences is called a process log. For mining the other perspec-

tives, we refer to (Aalst et al., 2003), and http://www.processmining.org. In this paper

we will focus on the process perspective.

The idea of discovering models from process logs was previously investigated in

contexts such as software engineering and workﬂow management (Agrawal et al., 1998;

Cook and Wolf, 1998a; Herbst, 2000a) etc. Cook and Wolf propose alternative methods

for process discovery in case of software engineering, focusing on sequential (Cook

and Wolf, 1998a) and concurrent processes (Cook and Wolf, 1998b). Herbst and Kara-

giannis use a hidden Markov model in the context of workﬂow management, focusing

on sequential (Herbst and Karagiannis, 2000; Herbst, 2000b) and concurrent processes

(Herbst, 2000a). In M

arus¸ter et al. (2002), a technique for discovering workﬂow pro-

cesses in hospital data is presented. Theoretical results are presented in Aalst et al.

(2004), providing proof that for certain subclasses of processes it is possible to ﬁnd the

correct process model.

To illustrate the idea of process discovery, consider the process log from Figure 1(a). In

this example seven executed cases are logged. Twelve different tasks occur in these cases.

We can notice the following example regularities: for each case, the execution starts with

task a and ends with task l;ifc is executed, then e is executed immediately afterwards.

Using the information shown in the process log from Figure 1(a), we can discover

the process model shown in Figure 1(b). We represented the model using Petri nets

(Reisig and Rosenberg, 1998), where all tasks are expressed as transitions. Petri net

Figure 1. An excerpt of a process log and the corresponding Petri net process model.

A RULE-BASED APPROACH FOR PROCESS DISCOVERY 69

formalism has several advantages, therefore they are often used to represent process

models (Aalst, 1998): formal semantics (a clear and precise deﬁnition), graphical nature

(intuitive and easy to learn), expressiveness (support all primitives needed to model a

process), properties (the mathematical foundation allows for reasoning of Petri Nets

properties), analysis (many analysis techniques to prove properties and calculate per-

formance measures), vendor independent (not based on software package of a speciﬁc

vendor). In Figure 1(b), after executing a, either task b or task f can be executed. If task

f is executed, tasks h and g can be executed in parallel. A parallel execution of tasks h

and g means that they can appear in any order.

In the case of real-world processes which can involve many more tasks and which

can exhibit higher levels of parallelism, the problem of discovering the underlying

process can become prohibitively complex. Moreover, process mining can be harmed

and hindered when process logs contain noise—random replacements or insertions of

incorrect symbols—or have missing information. A process log is complete when all

tasks that potentially directly follow each other, in fact do directly follow each other in

some trace in the log. In case of a complex process, incomplete process logs will not

contain enough information to detect the causal relation between tasks. The notion of

completeness is formally deﬁned in Aalst et al. (2004). Note that a process log can be

complete without containing all possible cases. A heuristic process discovery method,

based on simple count statistics, able to handle certain levels of noise is described in

Weijters and Aalst (2001). Nevertheless, in some situations this heuristic method is not

robust enough for discovering the complete process. Tackling the problem of process

discovery at a more robust level was subsequently introduced in M

arus¸ter, Weijters

et al. (2002), using an empirical data-driven approach; more speciﬁcally, a logistic

regression model able to detect the causal relations (or direct successors) from process

logs. However, that logistic regression approach requires a global threshold value for

deciding when there is a direct succession relation between two tasks. The use of a global

threshold has the drawback of being too rigid, thus real relations may not be found and

false relations may be considered. In Medeiros, Weijters, and Aalst (2004) subsequent

advanced issues in robustness towards noisy data and ﬁnding causality between tasks

are tackled by using genetic algorithms. An overview of issues and related work about

Process Mining can be found in Aalst and Weijters (2004).

The problem of noisy and incomplete process log is not the only difﬁculty which

may occur during process mining. A review of challenging process mining problems is

made in Aalst and Weijters (2004), which refer to mining hidden tasks, mining duplicate

tasks, mining loops, using time, mining different perspectives, and dealing with noise

and incompleteness.

In Aalst et al. (2004) it is developed an algorithm called ‘the α algorithm’, which

given a complete process log, it can (re-)discover quite a large class of Petri nets (the

discussion about the properties of these Petri nets is beyond the scope of this paper and

it is addressed in Aalst et al. (2004)). However, the α algorithm has some limitations,

such as (i) mining loops and (ii) dealing with incomplete and noisy process logs. In

Medeiros et al. (2004), an extension of the α algorithm is provided, that address the ﬁrst

limitation, e.g. it can handle short loops. In this paper, we address the second limitation

of the α algorithm presented in Aalst et al. (2004), namely dealing with incomplete and

noisy process logs, to allow its applicability to real-world processes.

70 M

ARUS¸TER ET AL.

The aim of this article is two-fold. First, we describe a rule-based approach for process

discovery, assuming the existence of noisy information in the process log and imbalance

in execution priorities. Second, we want to gain insight into the effects of noise and

imbalance during the process discovery. Our goal is to use machine learning techniques

to induce classiﬁcation rules for (i) causal relations (i.e., for each task, ﬁnd its direct

successor tasks) and (ii) ﬁnd the parallel/exclusive relations (i.e., for tasks that share

the same cause or the same direct successor, detect if they can be executed in parallel

or there is a choice between them). Knowing these relations between tasks, a process

model can be constructed by using the α algorithm (Aalst et al., 2004).

The article is organized as follows: in Section 2 the types of relations that can exist

between two tasks are described. The methodology for generating experimental data used

to induce the rule sets is presented in Section 3. In Section 4 the methods for inducing

the rule sets are introduced. In Section 5 we evaluate the rule sets, and in Section 6 we

discuss the results obtained, focusing on the inﬂuence of process characteristics on rule

sets performance. In Section 7 we illustrate our approach using real data from a case

study. We end with discussing issues for further research in Section 8.

2. The log-based relations

Discovering a model from process logs involves determining the dependencies among

tasks. We choose to express these dependencies as log-based relations. The log-based

relations are formally introduced in M

arus¸ter et al. (2002) and Aalst et al. (2004), in

the context of workﬂow logs and workﬂow traces. Because we focus on the process

perspective, we use the same deﬁnitions as in Aalst et al. (2004), this time referring to

process logs and process traces.

Deﬁnition 1. Process trace, process log

Let T be a set of tasks. δ ∈ T

∗

is a process trace and W : T

∗

→ N

Figure 1(a) is an example of a process log, “afghikl ” is an example of a process

trace belonging to case 1. This process trace is unique (i.e., W(afghikl ) =1). However,

the process trace “abcejl ” appears three times (e.g. for cases 2, 5 and 7) in the log

(i.e., W(abcejl ) = 3). Especially in the case that logs may contain noise the use of

frequency information appears crucial.

Deﬁnition 2. Succession relation

Let W be a process log over the tasks T with a, b ∈ T. Then between a and b there

is a succession relation (notation a > b), i.e., b succeeds a if and only if there is a

trace δ = t

... t

in W (i.e., W(δ) > 0), where i ∈ {1, ..., n−1} and t

= a, t

i+1

= b. The succession relation > describes which tasks appeared in sequence, i.e., one

directly following the other. In the log from Figure 1(a), a > f, f > g, b > c, h > g,

g > h,etc.

Deﬁnition 3. Causal, exclusive and parallel relations

Let W be a process log over the tasks T with a, b ∈ T . If we assume that there is no

noise in W, then between x and y there is:

1. a causal relation (notation x →y), i.e., x causes y if and only if x > y and y ≯ x.We

consider the inverse of the causal relation →

−1

, i.e., →

−1

= {(y, x) ∈ T× T | x →

y}. We call task x the cause of task y and task y the direct successor of task x.

A RULE-BASED APPROACH FOR PROCESS DISCOVERY 71

2. an exclusive relation (notation x#y) if and only if x ≯ y and y ≯ x;

3. a parallel relation (notation x  y)ifx > y and y > x.

The relations → ,→

−1

, # and  are mutually exclusive and partition T× T (Aalst

et al., 2004).

To illustrate the above deﬁnitions, let’s consider again the process log from Figure

1(a) corresponding to the Petri net from Figure 1(b). If there is no noise, there are three

possible situations in which a pair of events (henceforth referred to as tasks) can be

related, namely causal, exclusive, and parallel:

causal relation. Tasks c and e have a causal relation, because c > e, e ≯ c, thus c → e;

exclusive relation. There is a choice between tasks b and f, because b ≯ f, f ≯ b, thus b

#f (and f # b);

parallel relation. Tasks h and i are in parallel, because h > i, i > h, thus h  i (and i 

h).

The information on all three types of relations occurring between all tasks is necessary

and sufﬁcient to construct the Petri net model using the α algorithm (Aalst et al., 2004).

The α algorithm considers ﬁrst all tasks that stand in a causal relation. Then, for all

tasks that share the same immediately-neighboring input or output task, their exclusive

or parallel relations are incorporated in the Petri net. Although this algorithm can (re-

)discover quite a large class of Petri nets, it also has some limitations, particularly with

respect to incomplete and noisy process logs.

The existence of incompleteness and noise in a process log is disturbing the application

of the notions presented in Deﬁnition 3. Considering the Petri net from Figure 1(b),

suppose that we want to discover the relations between pairs of tasks c and e, b

and f,

and h and i, given a particular example log ﬁle. We may ﬁnd in this ﬁle that c > e ten

times; however, because of some noisy sequences, we may also ﬁnd that e > c once.

Applying Deﬁnition 3, we could conclude that c e, which is incorrect, because actually

c → e. Also, we have to ﬁnd at least once in the log that c > e in order to determine c

→ e, otherwise the log is incomplete and we cannot detect the causal relation between

c and e. Similarly, when noise exists, we may ﬁnd in our noisy example log that both b

> f and f > b occur once, which according to Deﬁnition 3 means that b and f stand in a

parallel relation (actually, b # f!).

We want to be able to use the α algorithm on noisy logs. Therefore, instead of using

the deﬁnitions given in Deﬁnition 3 that break down in noisy circumstances, we use

machine learning techniques to induce noise-robust rule sets to determine the status

of relations among task pairs. Given these relations, we can apply the α algorithm to

construct the Petri net process model.

3. Experimental setting and data generation

Our experimental setup assumes the presence of learning material for inducing rule sets

to detect causal, parallel, and exclusive relations. This learning material should resemble

realistic process logs and should be sufﬁciently general to allow for generic rule sets

to be induced. We assume here that the following four characteristics underly a typical

realistic process, where variations of these characterisics affect the process logs: (i) the

A Rule-Based Approach for Process Discovery: Dealing with Noise and Imbalance in Process Logs

Figures

Citations

Machine learning

Process mining

A multi-dimensional quality assessment of state-of-the-art process discovery algorithms using real-life event logs

Robust Process Discovery with Artificial Negative Events

Event log imperfection patterns for process mining

References

C4.5: Programs for Machine Learning

Machine learning

Fast effective rule induction

The application of Petri-nets to workflow management

Workflow mining: discovering process models from event logs

Related Papers (5)

Workflow mining: discovering process models from event logs

Workflow mining: a survey of issues and approaches

Fuzzy mining: adaptive process simplification based on multi-perspective metrics

Discovering models of software processes from event-based data

Process Mining: Discovery, Conformance and Enhancement of Business Processes

Frequently Asked Questions (16)

Q1. What are the contributions in "A rule-based approach for process discovery: dealing with noise and imbalance in process logs" ?

Q2. How do the authors control how their method misses or incorrectly predicts some relations?

Q3. What are the three performance indicators used to compare the 10 obtained models?

Q4. Why do the authors use the first rule set for future predictions?

Q5. What are the three possible situations where a pair of events can be related?

Q6. What are the advantages of petri nets?

Q7. How is the training error rate calculated for RIPPER CAUS?

Q8. How do the authors estimate the generalization performance of a rule set?

Q9. What are the challenges of process mining?

Q10. What are the CM, LM, YX and XY metrics?

Q11. What is the effect of log size on generalization performance?

Q12. What is the main reason for the interest in process mining?

Q13. Why do the authors use the same definitions as in Aalst et al. (2004)?

Q14. What are the limitations of the algorithm?

Q15. What is the effect of noise on the generalization performance of causal relations?

Q16. What is the LM measure for tasks a and c?