scispace - formally typeset
Open AccessJournal ArticleDOI

A Rule-Based Approach for Process Discovery: Dealing with Noise and Imbalance in Process Logs

TLDR
This work proposes a method that constructs the process model from process log data, by determining the relations between process tasks, by employing machine learning technique to induce rule sets.
Abstract
Effective information systems require the existence of explicit process models. A completely specified process design needs to be developed in order to enact a given business process. This development is time consuming and often subjective and incomplete. We propose a method that constructs the process model from process log data, by determining the relations between process tasks. To predict these relations, we employ machine learning technique to induce rule sets. These rule sets are induced from simulated process log data generated by varying process characteristics such as noise and log size. Tests reveal that the induced rule sets have a high predictive accuracy on new data. The effects of noise and imbalance of execution priorities during the discovery of the relations between process tasks are also discussed. Knowing the causal, exclusive, and parallel relations, a process model expressed in the Petri net formalism can be built. We illustrate our approach with real world data in a case study.

read more

Content maybe subject to copyright    Report

Data Mining and Knowledge Discovery, 13, 67–87, 2006
c
2006 Springer Science + Business Media, LLC. Manufactured in the United States.
DOI: 10.1007/s10618-005-0029-z
A Rule-Based Approach for Process Discovery:
Dealing with Noise and Imbalance in Process Logs
LAURA M
˘
ARUS¸TER l. maruster@rug.nl
University of Groningen, P.O. Box 800, 9700 AV, Groningen, NL
A.J.M.M. (TON) WEIJTERS a.j.m.m.weijters@tm.tue.nl
WIL M.P. VAN DER AALST w.m.p.v.d.aalst@tm.tue.nl
Eindhoven University of Technology, P.O. Box 513, 5600 MB, Eindhoven, NL
ANTAL VAN DEN BOSCH antal.vdnbosch@uvt.nl
Tilburg University, P.O. Box 90153, 5000 LE, Tilburg, NL
Published online: 12 May 2006
Abstract. Effective information systems require the existence of explicit process models. A completely
specified process design needs to be developed in order to enact a given business process. This development
is time consuming and often subjective and incomplete. We propose a method that constructs the process
model from process log data, by determining the relations between process tasks. To predict these relations, we
employ machine learning technique to induce rule sets. These rule sets are induced from simulated process log
data generated by varying process characteristics such as noise and log size. Tests reveal that the induced rule
sets have a high predictive accuracy on new data. The effects of noise and imbalance of execution priorities
during the discovery of the relations between process tasks are also discussed. Knowing the causal, exclusive,
and parallel relations, a process model expressed in the Petri net formalism can be built. We illustrate our
approach with real world data in a case study.
Keywords: rule induction, process mining, knowledge discovery, Petri nets
1. Introduction
Managing complex business processes calls for the development of powerful infor-
mation systems, able to control and support the underlying processes. To support a
structured business process, such information systems have to offer generic process
modelling and process execution capabilities. Because problems are encountered when
designing and employing such information systems, the interest in Business Process
Analysis and Continuous Process Improvement Efforts increases. Yet, whatever the
goal is (e.g. modelling, designing, redesigning or implementing business processes),
it needs to be preceded by an analysis of the existing processes. The growing interest
into the automation of analysing existing processes, process mining can be explained
by the availability of logged information, which most information systems (traditional
or process-aware) support.
The goal of process mining is to abstract process information from transaction
logs (Aalst et al., 2003). Process mining focuses on different levels. Accordingly,
this leads to different mining perspectives, such as the process perspective, the or-
ganizational perspective, and the case perspective. The process perspective focuses

68 M
˘
ARUS¸TER ET AL.
on the control flow, i.e., the ordering of activities. The goal of this type of mining is
to find the possible relations between tasks, expressed in terms of a process model,
e.g., expressed in terms of a Petri net (Reisig and Rosenberg, 1998) or an Event-
driven Process Chain (EPC) (IDS Scheer, 2002; Keller and Teufel, 1998). For process
mining with a focus on the process perspective, the specific terms process discov-
ery or workflow mining are used (Aalst et al., 2004). Using this perspective, it is
assumed that it is possible to record events such that (i) each event refers to a task,
(ii) each event occurs in a case (i.e., process instance) and (iii) events are totally ordered.
A set of such recorded sequences is called a process log. For mining the other perspec-
tives, we refer to (Aalst et al., 2003), and http://www.processmining.org. In this paper
we will focus on the process perspective.
The idea of discovering models from process logs was previously investigated in
contexts such as software engineering and workflow management (Agrawal et al., 1998;
Cook and Wolf, 1998a; Herbst, 2000a) etc. Cook and Wolf propose alternative methods
for process discovery in case of software engineering, focusing on sequential (Cook
and Wolf, 1998a) and concurrent processes (Cook and Wolf, 1998b). Herbst and Kara-
giannis use a hidden Markov model in the context of workflow management, focusing
on sequential (Herbst and Karagiannis, 2000; Herbst, 2000b) and concurrent processes
(Herbst, 2000a). In M
˘
arus¸ter et al. (2002), a technique for discovering workflow pro-
cesses in hospital data is presented. Theoretical results are presented in Aalst et al.
(2004), providing proof that for certain subclasses of processes it is possible to find the
correct process model.
To illustrate the idea of process discovery, consider the process log from Figure 1(a). In
this example seven executed cases are logged. Twelve different tasks occur in these cases.
We can notice the following example regularities: for each case, the execution starts with
task a and ends with task l;ifc is executed, then e is executed immediately afterwards.
Using the information shown in the process log from Figure 1(a), we can discover
the process model shown in Figure 1(b). We represented the model using Petri nets
(Reisig and Rosenberg, 1998), where all tasks are expressed as transitions. Petri net
Figure 1. An excerpt of a process log and the corresponding Petri net process model.

A RULE-BASED APPROACH FOR PROCESS DISCOVERY 69
formalism has several advantages, therefore they are often used to represent process
models (Aalst, 1998): formal semantics (a clear and precise definition), graphical nature
(intuitive and easy to learn), expressiveness (support all primitives needed to model a
process), properties (the mathematical foundation allows for reasoning of Petri Nets
properties), analysis (many analysis techniques to prove properties and calculate per-
formance measures), vendor independent (not based on software package of a specific
vendor). In Figure 1(b), after executing a, either task b or task f can be executed. If task
f is executed, tasks h and g can be executed in parallel. A parallel execution of tasks h
and g means that they can appear in any order.
In the case of real-world processes which can involve many more tasks and which
can exhibit higher levels of parallelism, the problem of discovering the underlying
process can become prohibitively complex. Moreover, process mining can be harmed
and hindered when process logs contain noise—random replacements or insertions of
incorrect symbols—or have missing information. A process log is complete when all
tasks that potentially directly follow each other, in fact do directly follow each other in
some trace in the log. In case of a complex process, incomplete process logs will not
contain enough information to detect the causal relation between tasks. The notion of
completeness is formally defined in Aalst et al. (2004). Note that a process log can be
complete without containing all possible cases. A heuristic process discovery method,
based on simple count statistics, able to handle certain levels of noise is described in
Weijters and Aalst (2001). Nevertheless, in some situations this heuristic method is not
robust enough for discovering the complete process. Tackling the problem of process
discovery at a more robust level was subsequently introduced in M
˘
arus¸ter, Weijters
et al. (2002), using an empirical data-driven approach; more specifically, a logistic
regression model able to detect the causal relations (or direct successors) from process
logs. However, that logistic regression approach requires a global threshold value for
deciding when there is a direct succession relation between two tasks. The use of a global
threshold has the drawback of being too rigid, thus real relations may not be found and
false relations may be considered. In Medeiros, Weijters, and Aalst (2004) subsequent
advanced issues in robustness towards noisy data and finding causality between tasks
are tackled by using genetic algorithms. An overview of issues and related work about
Process Mining can be found in Aalst and Weijters (2004).
1
The problem of noisy and incomplete process log is not the only difficulty which
may occur during process mining. A review of challenging process mining problems is
made in Aalst and Weijters (2004), which refer to mining hidden tasks, mining duplicate
tasks, mining loops, using time, mining different perspectives, and dealing with noise
and incompleteness.
In Aalst et al. (2004) it is developed an algorithm called ‘the α algorithm’, which
given a complete process log, it can (re-)discover quite a large class of Petri nets (the
discussion about the properties of these Petri nets is beyond the scope of this paper and
it is addressed in Aalst et al. (2004)). However, the α algorithm has some limitations,
such as (i) mining loops and (ii) dealing with incomplete and noisy process logs. In
Medeiros et al. (2004), an extension of the α algorithm is provided, that address the first
limitation, e.g. it can handle short loops. In this paper, we address the second limitation
of the α algorithm presented in Aalst et al. (2004), namely dealing with incomplete and
noisy process logs, to allow its applicability to real-world processes.

70 M
˘
ARUS¸TER ET AL.
The aim of this article is two-fold. First, we describe a rule-based approach for process
discovery, assuming the existence of noisy information in the process log and imbalance
in execution priorities. Second, we want to gain insight into the effects of noise and
imbalance during the process discovery. Our goal is to use machine learning techniques
to induce classification rules for (i) causal relations (i.e., for each task, find its direct
successor tasks) and (ii) find the parallel/exclusive relations (i.e., for tasks that share
the same cause or the same direct successor, detect if they can be executed in parallel
or there is a choice between them). Knowing these relations between tasks, a process
model can be constructed by using the α algorithm (Aalst et al., 2004).
The article is organized as follows: in Section 2 the types of relations that can exist
between two tasks are described. The methodology for generating experimental data used
to induce the rule sets is presented in Section 3. In Section 4 the methods for inducing
the rule sets are introduced. In Section 5 we evaluate the rule sets, and in Section 6 we
discuss the results obtained, focusing on the influence of process characteristics on rule
sets performance. In Section 7 we illustrate our approach using real data from a case
study. We end with discussing issues for further research in Section 8.
2. The log-based relations
Discovering a model from process logs involves determining the dependencies among
tasks. We choose to express these dependencies as log-based relations. The log-based
relations are formally introduced in M
˘
arus¸ter et al. (2002) and Aalst et al. (2004), in
the context of workflow logs and workflow traces. Because we focus on the process
perspective, we use the same definitions as in Aalst et al. (2004), this time referring to
process logs and process traces.
Definition 1. Process trace, process log
Let T be a set of tasks. δ T
is a process trace and W : T
N
2
Figure 1(a) is an example of a process log, afghikl is an example of a process
trace belonging to case 1. This process trace is unique (i.e., W(afghikl ) =1). However,
the process trace abcejl appears three times (e.g. for cases 2, 5 and 7) in the log
(i.e., W(abcejl ) = 3). Especially in the case that logs may contain noise the use of
frequency information appears crucial.
Definition 2. Succession relation
Let W be a process log over the tasks T with a, b T. Then between a and b there
is a succession relation (notation a > b), i.e., b succeeds a if and only if there is a
trace δ = t
1
t
2
... t
n
in W (i.e., W(δ) > 0), where i {1, ..., n1} and t
i
= a, t
i+1
= b. The succession relation > describes which tasks appeared in sequence, i.e., one
directly following the other. In the log from Figure 1(a), a > f, f > g, b > c, h > g,
g > h,etc.
Definition 3. Causal, exclusive and parallel relations
Let W be a process log over the tasks T with a, b T . If we assume that there is no
noise in W, then between x and y there is:
1. a causal relation (notation x y), i.e., x causes y if and only if x > y and y x.We
consider the inverse of the causal relation
1
, i.e.,
1
= {(y, x) T× T | x
y}. We call task x the cause of task y and task y the direct successor of task x.

A RULE-BASED APPROACH FOR PROCESS DISCOVERY 71
2. an exclusive relation (notation x#y) if and only if x y and y x;
3. a parallel relation (notation x y)ifx > y and y > x.
The relations ,
1
, # and are mutually exclusive and partition T× T (Aalst
et al., 2004).
To illustrate the above definitions, let’s consider again the process log from Figure
1(a) corresponding to the Petri net from Figure 1(b). If there is no noise, there are three
possible situations in which a pair of events (henceforth referred to as tasks) can be
related, namely causal, exclusive, and parallel:
causal relation. Tasks c and e have a causal relation, because c > e, e c, thus c e;
exclusive relation. There is a choice between tasks b and f, because b f, f b, thus b
#f (and f # b);
parallel relation. Tasks h and i are in parallel, because h > i, i > h, thus h i (and i
h).
The information on all three types of relations occurring between all tasks is necessary
and sufficient to construct the Petri net model using the α algorithm (Aalst et al., 2004).
The α algorithm considers first all tasks that stand in a causal relation. Then, for all
tasks that share the same immediately-neighboring input or output task, their exclusive
or parallel relations are incorporated in the Petri net. Although this algorithm can (re-
)discover quite a large class of Petri nets, it also has some limitations, particularly with
respect to incomplete and noisy process logs.
The existence of incompleteness and noise in a process log is disturbing the application
of the notions presented in Definition 3. Considering the Petri net from Figure 1(b),
suppose that we want to discover the relations between pairs of tasks c and e, b
and f,
and h and i, given a particular example log file. We may find in this file that c > e ten
times; however, because of some noisy sequences, we may also find that e > c once.
Applying Definition 3, we could conclude that c e, which is incorrect, because actually
c e. Also, we have to find at least once in the log that c > e in order to determine c
e, otherwise the log is incomplete and we cannot detect the causal relation between
c and e. Similarly, when noise exists, we may find in our noisy example log that both b
> f and f > b occur once, which according to Definition 3 means that b and f stand in a
parallel relation (actually, b # f!).
We want to be able to use the α algorithm on noisy logs. Therefore, instead of using
the definitions given in Definition 3 that break down in noisy circumstances, we use
machine learning techniques to induce noise-robust rule sets to determine the status
of relations among task pairs. Given these relations, we can apply the α algorithm to
construct the Petri net process model.
3. Experimental setting and data generation
Our experimental setup assumes the presence of learning material for inducing rule sets
to detect causal, parallel, and exclusive relations. This learning material should resemble
realistic process logs and should be sufficiently general to allow for generic rule sets
to be induced. We assume here that the following four characteristics underly a typical
realistic process, where variations of these characterisics affect the process logs: (i) the

Citations
More filters
Journal ArticleDOI

Machine learning

TL;DR: Machine learning addresses many of the same research questions as the fields of statistics, data mining, and psychology, but with differences of emphasis.
Journal ArticleDOI

Process mining

TL;DR: Using real event data to X-ray business processes helps ensure conformance between design and reality.
Journal ArticleDOI

A multi-dimensional quality assessment of state-of-the-art process discovery algorithms using real-life event logs

TL;DR: The results of this study indicate that the HeuristicsMiner algorithm is especially suited in a real-life setting, and it is shown that, particularly for highly complex event logs, knowledge discovery from such data sets can become a major problem for traditional process discovery techniques.
Journal Article

Robust Process Discovery with Artificial Negative Events

TL;DR: This paper presents a configurable technique that deals with process discovery as a multi-relational classification problem on event logs supplemented with Artificially Generated Negative Events (AGNEs) that allows users to have a declarative control over the inductive bias and language bias.
Journal ArticleDOI

Event log imperfection patterns for process mining

TL;DR: This paper describes a set of data quality issues commonly found in process mining event logs or encountered while preparing event logs from raw data sources and proposes a systematic approach to using such a pattern repository to identify and repair event log quality issues.
References
More filters
Book

C4.5: Programs for Machine Learning

TL;DR: A complete guide to the C4.5 system as implemented in C for the UNIX environment, which starts from simple core learning methods and shows how they can be elaborated and extended to deal with typical problems such as missing data and over hitting.
Journal ArticleDOI

Machine learning

TL;DR: Machine learning addresses many of the same research questions as the fields of statistics, data mining, and psychology, but with differences of emphasis.
Book ChapterDOI

Fast effective rule induction

TL;DR: This paper evaluates the recently-proposed rule learning algorithm IREP on a large and diverse collection of benchmark problems, and proposes a number of modifications resulting in an algorithm RIPPERk that is very competitive with C4.5 and C 4.5rules with respect to error rates, but much more efficient on large samples.
Journal ArticleDOI

The application of Petri-nets to workflow management

TL;DR: This paper introduces workflow management as an application domain for Petri nets, presents state-of-the-art results with respect to the verification of workflows, and highlights some Petri-net-based workflow tools.
Journal ArticleDOI

Workflow mining: discovering process models from event logs

TL;DR: A new algorithm is presented to extract a process model from a so-called "workflow log" containing information about the workflow process as it is actually being executed and represent it in terms of a Petri net.
Related Papers (5)
Frequently Asked Questions (16)
Q1. What are the contributions in "A rule-based approach for process discovery: dealing with noise and imbalance in process logs" ?

The authors propose a method that constructs the process model from process log data, by determining the relations between process tasks. The authors illustrate their approach with real world data in a case study. 

By generating experimental data where variations appear in the number of event types, imbalance, noise and log size, the authors attempt to control how their method misses or incorrectly predicts some relations. 

In order to compare the performance of the 10 obtained models, the authors consider three averaged performance indicators: the error rate, precision and recall. 

Because the performance indicators do not differ significantly, the authors have support to use the induced first rule set for performing future predictions on causal relations. 

If there is no noise, there are three possible situations in which a pair of events (henceforth referred to as tasks) can be related, namely causal, exclusive, and parallel:causal relation. 

Petri netformalism has several advantages, therefore they are often used to represent process models (Aalst, 1998): formal semantics (a clear and precise definition), graphical nature (intuitive and easy to learn), expressiveness (support all primitives needed to model a process), properties (the mathematical foundation allows for reasoning of Petri Nets properties), analysis (many analysis techniques to prove properties and calculate performance measures), vendor independent (not based on software package of a specific vendor). 

The training error rate for RIPPER CAUS is 0.08% (the training error rate represents the rate of incorrect predictions made by the model relabeling the training data set). 

Since training error is not relevant to assess the generalization performance and quality of a rule set, the authors estimate its generalization performance using test material in Section 5.3. 

A review of challenging process mining problems is made in Aalst and Weijters (2004), which refer to mining hidden tasks, mining duplicate tasks, mining loops, using time, mining different perspectives, and dealing with noise and incompleteness. 

In Section 4 the authors introduced five relational metrics CM, GM, LM, YX and XY to be used as predictive features for determining the causal and exclusive/parallel relations between pairs of events. 

As expected, the incompleteness of the log affecting the generalization performance of finding causal relations: as log size increases, performance increases. 

The growing interest into the automation of analysing existing processes, process mining can be explained by the availability of logged information, which most information systems (traditional or process-aware) support. 

Because the authors focus on the process perspective, the authors use the same definitions as in Aalst et al. (2004), this time referring to process logs and process traces. 

In this paper, the authors address the second limitation of the α algorithm presented in Aalst et al. (2004), namely dealing with incomplete and noisy process logs, to allow its applicability to real-world processes. 

It appears that noise is affecting exclusive and parallel relations in a similar way as the causal relations, e.g., if the level of noise increases, the accuracy of finding the excusive/parallel relations decreases. 

Their LM measure for tasks a and b gives a value of LM = 0.85 and for tasks a and c gives a value of LM = 0.90, which is in line with their intuition.