scispace - formally typeset
Open AccessBook ChapterDOI

A Grid-Based Multi-relational Approach to Process Mining

Reads0
Chats0
TLDR
This paper investigates the use of a multi-level relational frequent pattern discovery method as a means of process mining using a Grid-based implementation of the knowledge discovery algorithm that distributes the computation on several nodes of a Grid platform.
Abstract
Industrial, scientific, and commercial applications use information systems to trace the execution of a business process. Relevant events are registered in massive logs and process mining techniques are used to automatically discover knowledge that reveals the execution and organization of the process instances (cases). In this paper, we investigate the use of a multi-level relational frequent pattern discovery method as a means of process mining. In order to process such massive logs we resort to a Grid-based implementation of the knowledge discovery algorithm that distributes the computation on several nodes of a Grid platform. Experiments are performed on real event logs.

read more

Content maybe subject to copyright    Report

A Grid-Based Multi-relational Approach to
Process Mining
Antonio Turi, Annalisa Appice, Michelangelo Ceci, and Donato Malerba
Dipartimento di Informatica, Universit`a degli Studi di Bari
via Orabona, 4 - 70126 Bari - Italy
{turi,appice,ceci,malerba}@di.uniba.it
Abstract. Industrial, scientific, and commercial applications use infor-
mation systems to trace the execution of a business process. Relevant
events are registered in massive logs and process mining techniques are
used to automatically discover knowledge that reveals the execution and
organization of the process instances (cases). In this paper, we investigate
the use of a multi-level relational frequent pattern discovery method as a
means of process mining. In order to process such massive logs we resort
to a Grid-based implementation of the knowledge discovery algorithm
that distributes the computation on several nodes of a Grid platform.
Experiments are performed on real event logs.
1 Introduction
Many information systems, such as Workflow Management Systems, ERP sys-
tems, Business-to-business systems and Firewall systems trace behavior of run-
ning processes by registering relevant events in massive logs. Events are described
in a structured form that includes properties of cases and activities. A case rep-
resents the process instance which is being handled, while an activity represents
the operation on the case. Information on timestamp and on the person execut-
ing the event (performer) is available in the logs. Both activities and performers
may belong to different categories. Event logs are stored in multi-terabyte ware-
houses and sophisticated data mining techniques are required to process this
huge amount of data and extract knowledge concerning the execution and orga-
nization of the recorded processes. This huge amount of data is the main concern
of research in process mining whose aim is to discover a description or prediction
of real process, control, organizational, and social structures [10].
Process mining poses several challenges to the traditional data mining tasks.
In fact, data stored in event logs describe objects of different type (cases, ac-
tivities and performers) which are naturally modeled as several relational data
tables, one for each object type. Foreign key constraints express the relations
between these objects. This (relational) data representation makes necessary
distinguishing between the reference objects of analysis (cases) and other task-
relevant objects (activities and performers), and to represent their interactions.
Another challenge is represented by the temporal autocorrelation. Events are
temporally related according to a timestamp. This means that the effect of a
S.S. Bhowmick, J. ung, and R. Wagner (Eds.): DEXA 2008, LNCS 5181, pp. 701–709, 2008.
c
Springer-Verlag Berlin Heidelberg 2008

702 A. Turi et al.
property at any event may not be limited to the specific event. Furthermore, ac-
tivities and performers are generally organized in hierarchies of categories (e.g.
the performer of operations on a text le can be a writer or a reader). By de-
scending or ascending through a hierarchy, it is possible to view the same object
at different levels of abstraction (or granularity). Finally, reasoning is the process
by which information about objects and their relations (e.g. operator of indirect
successor) are used to arrive at valid conclusions regarding the object relations
[7]. This source of knowledge cannot be ignored in the search.
Currently, many algorithms [2,1,11,3] have dealt with several of these chal-
lenges and some of them are integrated into the ProM framework [12]. Any-
way, to the best of our knowledge, methods of process mining neither support
a multi-level analysis nor use inferential mechanisms defined within a reasoning
theory. Conversely, the multi-relational data mining method SPADA [5] offers
a sufficiently complete solution to all the challenges posed by the process min-
ing tasks in descriptive case. However, SPADA is not applicable in practice.
Indeed, frequent pattern discovery is a very complex task, particularly in the
multi-relational case [5]. In addition SPADA, similarly to most of the multi-
relational data mining algorithms, operates with data in main memory, hence it
is not appropriate for processing massive logs. Advantages of the multi-relational
approach in facing the challenges of the process mining justify the attempt of
resorting to the computational power of distributed high-performance environ-
ments (e.g., computational Grids [4]) to mitigate the complexity of the relational
frequent pattern discovery on massive event logs.
In this paper, we present G-SPADA, an extension of SPADA, which discovers
approximate multi-level relational frequent patterns by distributing exact com-
putation of locally frequent multi-level relational patterns on a computational
Grid and then by post-processing local patterns in order to approximate the set
of the globally frequent patterns as well as their supports. Distributing relational
frequent pattern discovery on a Grid poses several issues. Firstly, relational data
must be divided in data subsets and each subset has to be distributed on the
Grid. Split must take into account relational structure of data, that is, each data
split must includes a subset of reference objects and the task-relevant objects to
reconstruct all interactions between them. Secondly, it is necessary a framework
for building the Grid applications utilizing the power of distributed computation
and storage resources across the Internet. Finally, processing local patterns to
approximate global ones requires a way of combining distinct sets of patterns
into a single one and obtaining an estimate of the global support.
2 Multi-level Relational Frequent Pattern Discovery
The multi-level relational pattern discovery task is formally defined as follows:
Given:asetS of reference objects, some sets R
k
,1 k m of task-relevant ob-
jects, a background knowledge BK which includes hierarchies H
k
on the objects
in R
k
and domain knowledge in form of rules, a deductive database D that is
formed by an extensional (D
E
) part where properties and relations of reference

A Grid-Based Multi-relational Approach to Process Mining 703
objects and task-relevant objects are expressed in derived ground predicates and
an intensional part (D
I
) where domain knowledge in BK is expressed in form
of rules, M granularity levels in the descriptions (1 for the highest), a set of
granularity ψ
k
which associate each object in H
k
with a granularity level to deal
with several hierarchies at once, a threshold minsup[l] for each granularity level
l (1 l M), Find, for each granularity level l,thefrequent
1
relational patterns
which involve properties and relations of task relevant-objects at level l of H
k
.
The relational formalization of the task of frequent pattern discovery is based
on the idea that each unit of analysis (or example) D[s] includes a reference
object s S and all the task-relevant objects of R
k
which are (directly or
indirectly) related to s according to some foreign key path in D. The frequency
(support) of a pattern is based on the number of units of analysis, i.e., reference
objects, covered by the pattern. An example of relational pattern is:
Example 1. Let D
E
be the extensional database described in Example 1. A
possible relational pattern P1 on D is in the form:
P1: case(A), activity(A,B), is
a(B,activity), before(B,C), is a(C,activity),
description(C,workinprogress), user(B, D), is
a(D, performer) [72.25%]
P1 expresses the fact that a process A is formed by two sequential activities,
namely B and C,theperformerofB is generic. The support is 72.5%.
By taking into account hierarchies on task-relevant objects, relational patterns
can be discovered at multiple level of granularity.
Example 2. Let us consider two level hierarchies defined on performers and ac-
tivities defined in the followings:
administrator, user performer ; namemaker, delete, workflow activity
P2 is a finer-grained relational pattern than P1 obtained by descending one level
in hierarchies. P2 is in the form:
P2: case(A), activity(A,B), is
a(B,namemaker), before(B,C), is a(C,workflow),
description(C,workinprogress), is
a(D, administrator) [62.5%]
P2 provides better insight than P1 on the nature of B, C and D.
In SPADA [5], multi-level relational frequent patterns are discovered according
to the levelwise method [6] that is based on a breadth-first search in the lattice
of patterns spanned by θ-subsumption [8] generality order (
θ
).
3G-SPADA
Similarly to Partition [9], G-SPADA splits a dataset into several partitions to
be processed independently. It approximates the multi-level relational frequent
pattern discovery by means of a three stepped strategy. In the first step, the set
of original N reference objects is partitioned into n approximately equally-sized
subsets (n<<N). Each partition includes a subset of the reference objects
and the set of task-relevant objects. In the second step, the frequent pattern
1
With support greater than minsup[l].

704 A. Turi et al.
computation is parallelized and distributed on n nodes of a Grid platform, one
node for each partition. In this way, G-SPADA generates n parallel executions
of SPADA at the same time and retrieves local patterns which are frequent in
at least one of the data partition. In the third step, G-SPADA approximates the
set of globally frequent patterns by merging patterns discovered at the nodes.
The basic idea in approximating the global patterns is that each globally fre-
quent pattern must be locally frequent in at least k partitions of the original
dataset. In the case k is set to 1, this guarantees that the union of all local
solutions is a superset of the global solution. However, a merge step with k =1
may generate several false positives, i.e. patterns that result locally frequent but
globally infrequent. Hence, value of k should be adequately tuned between 1 and
n in order to find the best trade-off between false positive and false negative fre-
quent patterns. The merge step also attempts to approximate values of support
for the global patterns starting from the local values of support.
3.1 Relational Data Partitioning
G-SPADA pre-processes the deductive database of logs and completes the de-
scription explicitly provided for each example (D
E
) with the information that is
implicit in the domain knowledge (D
I
). An example of this saturation step is:
Example 3. Let us consider the deductive database:
case(c1). case(c2). activity(c1,a1). activity(c1,a2). activity(c1,a3). activ-
ity(c2,a4). time(a1,10). time(a2,25). time(a3,29). time(a4,13). descrip-
tion(a1,create). ...
before(A,B):-activity(C, A1),activity(C, A2), time(A1,T1), A1= A2,
time(A2,T2), T1<T2, not(activity(C, A), A= A1, A= A2, time(A,T),
T1<T, T<T2).
By performing the saturation step, the following predicates are made explicit in
the database: before(a1,a2). before(a2,a3).
Saturation precedes data partitioning. In this way, redundant inferences are
prevented for properties and relations of task-relevant objects shared from two
or more reference objects belonging to different data partitions.
Data partitioning is performed by randomly splitting the set of reference ob-
jects in n approximately equal-sized partitions such that the union of the par-
titions is the entire set of reference objects. These data partitions are enriched
by adding the ground predicates which describe properties and relations of the
reference objects falling in the partition at hand. Subsequently, properties and
relations of task-relevant objects related to reference objects according to some
foreign key path are also added to the partition.
3.2 Distributing Computation on Grid
Each dataset partition is shipped along with the G-SPADA pattern discovery
algorithm to computation nodes on Grid using gLite
2
middleware. This is done
2
gLite (http://glite.web.cern.ch/glite/) is a next generation middleware for Grid com-
puting which provides a framework for building Grid applications.

A Grid-Based Multi-relational Approach to Process Mining 705
by submitting parametric jobs described in JDL (Job Description Language)
through the CLI (command line interface). Submission of jobs on Grid are di-
vided in several steps: (i) Authenticate on a UI (user interface) through PKI
based authentication system with proxy credentials (GSI); (ii) Prepare the jobs
(JDL, shell script to automate procedure, input file); (iii) Upload (Stage-in)
a set of dataset; (iv) Submit a relative parametric job; (v) Check/wait results;
(vi) Finally, once the job is executed on Grid, we get the output (Stage-out) files
containing the frequent pattern sets along with their support for each sample.
3.3 Computing Approximate Global Frequent Patterns
The n sets of local frequent patterns are collected from the computation nodes
of the Grid platform and then merged to approximate the set of global patterns.
For each local pattern discovered in at least k data partitions (1 k n), G-
SPADA derives an approximate of the global support by averaging the support
values collected on the partitions where the pattern is found to be frequent. The
check that the same local pattern occurs in different partitions is based on an
equivalence test between two patterns under θ-subsumption, which corresponds
to performing a double θ-subsumption test (P
θ
Q and Q
θ
P ). Local pat-
terns occurring in less than k partitions are filtered out. The global frequent
patterns obtained following this merge procedure approximate the original fre-
quent patterns which can be possibly mined on the entire dataset.
An example of approximate global process pattern is:
case(A), activity(A,B), is
a(B,namemaker), before(B,C), is a(C,workflow),
description(C,workinprogress) [7, 72.5%]
which describes the order of execution between two activities, namely B and C,
in the process A. B is a name-maker activity while C is a workflow activity. In
addition, C is described as work in progress. 7 means that this pattern is found
in 7 partitions (sample-level support), while 72.5% indicates the macro average
support obtained by averaging the support values computed on the 7 samples.
4 Experimental Results
Experiments are performed by processing event logs provided by THINK3 Inc
3
in the context of the TOCAI.It project
4
. THINK3 is a global player in Cad
and Plm market whose mission is to help manufacturers optimizing their entire
product development processes. G-SPADA is run on the deductive database
that is obtained by boiling down the event logs from January 1st to February
28th, 2006 and considering as domain knowledge the definition of the “before”
predicate. In the experiments, each case (process instance) traced in the logs
is considered as a whole and multi-level relational patterns are discovered from
traced business processes. These patterns capture the possible relation between
the order of activities and the properties of their performers.
3
http://www.think3.com/en/default.aspx
4
http://www.dis.uniroma1.it/tocai/index.php

Citations
More filters
Journal ArticleDOI

Traffic Pattern Mining and Forecasting Technologies in Maritime Traffic Service Networks: A Comprehensive Survey

TL;DR: The development of maritime traffic research in pattern mining and traffic forecasting affirms the importance of advanced maritime traffic studies and the great potential in maritime traffic safety and intelligence enhancement to accommodate the implementation of the Internet of Things, artificial intelligence technologies, and knowledge engineering and big data computing solution.
Book ChapterDOI

FIT for SOA? Introducing the F.I.T.-Metric to Optimize the Availability of Service Oriented Architectures

TL;DR: This paper proposes the FIT-metric as a tool to characterize the stability of existing service configurations based on three components: functionality, integration and traffic and applies it to configurations taken from a production-strength SOA-landscape.
References
More filters
Journal ArticleDOI

Discovering expressive process models by clustering log traces

TL;DR: A novel process mining framework is introduced and some relevant computational issues are deeply studied, where an iterative, hierarchical, refinement of the process model is founded, where traces sharing similar behavior patterns are clustered together and equipped with a specialized schema.

Software Process Validation: Quantitatively Measuring the Correspondence of a Process to a Model ; CU-CS-840-97

TL;DR: Process validation takes a process execution and a process model, and measures the level of correspondence between the two, which provides detailed information once a high-level measurement indicates the presence of a problem.
Journal ArticleDOI

Software process validation: quantitatively measuring the correspondence of a process to a model

TL;DR: In this article, the authors developed techniques for uncovering and measuring the discrepancies between models and executions, which they call process validation, which takes a process execution and a process model, and measures the level of correspondence between the two.
BookDOI

Applications and Theory of Petri Nets 2005

TL;DR: In this paper, the authors present a high level language for structural relations in well-formed Petri Nets, which can be used for modeling and analysis of well-formed Petri nets.
Related Papers (5)