A Grid-Based Multi-relational Approach to Process Mining

doi:10.1007/978-3-540-85654-2_61

Book Chapter•DOI•

A Grid-Based Multi-relational Approach to Process Mining

Antonio Turi¹, Annalisa Appice¹, Michelangelo Ceci¹, Donato Malerba¹•Institutions (1)

01 Sep 2008-Vol. 5181, pp 701-709

TL;DR: This paper investigates the use of a multi-level relational frequent pattern discovery method as a means of process mining using a Grid-based implementation of the knowledge discovery algorithm that distributes the computation on several nodes of a Grid platform.

read less

Abstract: Industrial, scientific, and commercial applications use information systems to trace the execution of a business process. Relevant events are registered in massive logs and process mining techniques are used to automatically discover knowledge that reveals the execution and organization of the process instances (cases). In this paper, we investigate the use of a multi-level relational frequent pattern discovery method as a means of process mining. In order to process such massive logs we resort to a Grid-based implementation of the knowledge discovery algorithm that distributes the computation on several nodes of a Grid platform. Experiments are performed on real event logs.

...read moreread less

Summary (3 min read)

Jump to: [1 Introduction] – [2 Multi-level Relational Frequent Pattern Discovery] – [3 G-SPADA] – [3.1 Relational Data Partitioning] – [3.2 Distributing Computation on Grid] – [3.3 Computing Approximate Global Frequent Patterns] – [4 Experimental Results] – [4.1 Data Description] – [4.2 Local and Global Multi-level Relational Patterns Discovery] and [5 Conclusions]

1 Introduction

Many information systems, such as Workflow Management Systems, ERP systems, Business-to-business systems and Firewall systems trace behavior of running processes by registering relevant events in massive logs.
Process mining poses several challenges to the traditional data mining tasks.
This data representation makes necessary distinguishing between the reference objects of analysis and other taskrelevant objects (activities and performers), and to represent their interactions.
The authors present G-SPADA, an extension of SPADA, which discovers approximate multi-level relational frequent patterns by distributing exact computation of locally frequent multi-level relational patterns on a computational Grid and then by post-processing local patterns in order to approximate the set of the globally frequent patterns as well as their supports.

2 Multi-level Relational Frequent Pattern Discovery

By taking into account hierarchies on task-relevant objects, relational patterns can be discovered at multiple level of granularity.
P2 provides better insight than P1 on the nature of B, C and D. In SPADA [5], multi-level relational frequent patterns are discovered according to the levelwise method [6] that is based on a breadth-first search in the lattice of patterns spanned by θ-subsumption [8] generality order ( θ).

3 G-SPADA

Similarly to Partition [9], G-SPADA splits a dataset into several partitions to be processed independently.
Each partition includes a subset of the reference objects and the set of task-relevant objects.
In the second step, the frequent pattern 1 With support greater than minsup[l].
In the third step, G-SPADA approximates the set of globally frequent patterns by merging patterns discovered at the nodes.
A merge step with k = 1 may generate several false positives, i.e. patterns that result locally frequent but globally infrequent.

3.1 Relational Data Partitioning

G-SPADA pre-processes the deductive database of logs and completes the description explicitly provided for each example (DE) with the information that is implicit in the domain knowledge (DI).
By performing the saturation step, the following predicates are made explicit in the database: before(a1,a2). before(a2,a3).
These data partitions are enriched by adding the ground predicates which describe properties and relations of the reference objects falling in the partition at hand.

3.2 Distributing Computation on Grid

Each dataset partition is shipped along with the G-SPADA pattern discovery algorithm to computation nodes on Grid using gLite2 middleware.
This is done 2 gLite (http://glite.web.cern.ch/glite/) is a next generation middleware for Grid com- puting which provides a framework for building Grid applications.
By submitting parametric jobs described in JDL (Job Description Language) through the CLI (command line interface).

3.3 Computing Approximate Global Frequent Patterns

The n sets of local frequent patterns are collected from the computation nodes of the Grid platform and then merged to approximate the set of global patterns.
Local patterns occurring in less than k partitions are filtered out.
The global frequent patterns obtained following this merge procedure approximate the original frequent patterns which can be possibly mined on the entire dataset.
7 means that this pattern is found in 7 partitions (sample-level support), while 72.5% indicates the macro average support obtained by averaging the support values computed on the 7 samples.

4 Experimental Results

Experiments are performed by processing event logs provided by THINK3 Inc3 in the context of the TOCAI.
THINK3 is a global player in Cad and Plm market whose mission is to help manufacturers optimizing their entire product development processes.
G-SPADA is run on the deductive database that is obtained by boiling down the event logs from January 1st to February 28th, 2006 and considering as domain knowledge the definition of the “before” predicate.
In the experiments, each case (process instance) traced in the logs is considered as a whole and multi-level relational patterns are discovered from traced business processes.
These patterns capture the possible relation between the order of activities and the properties of their performers.

4.1 Data Description

Data trace the behavior of 21,256 instances of a business process recorded in the period under analysis.
This corresponds to model activities and performers by means of three-level hierarchies .
For each activity, a text description of the operation is registered in the event logs.
The right part is a characterization of the description of the operation provided in the left part.
Finally, each performer is described by the belonging group.

4.2 Local and Global Multi-level Relational Patterns Discovery

G-SPADA is run on the event logs including 395,404 ground predicates.
Indeed, SPADA generates a memory exception when running on the entire dataset.
Multi-level relational patterns are discovered at each node with minsup[l] = 0.2 (l = 1, 2) and max len path = 9 5. Finally, for each level of granularity, global patterns are approximated from the local ones by varying k between 1 and 20.
Global patterns provide a compact description of the instances of process traced in the logs.
Finally, the relational pattern: P4: case(A), activity(A,B),before(B,C),before(C,D), is a(B,namemaker), is a(C,workflow), is a(D,workflow), descleft(C,creation), descleft(D,wip2k) [k=16, avgSup=21.35%] describes the execution order among three sequential activities, namely B, C and D. B is a namemaker activity, while C and D are workflow activities.

5 Conclusions

The authors present G-SPADA, an extension of the system SPADA, to discover approximate multi-lever relational frequent patterns in the context of process mining.
G-SPADA exploits a multi-relational approach in order to deal with both multiple nature of data stored in event logs and temporal autocorrelation.
G-SPADA faces the need of processing massive logs by resorting to a grid based architecture.
Experiments on the real event logs allow us to discover interpretable patterns which capture regularities in the execution of activities and the characteristics of the performers of a business process.
Such patterns can be used to deploy new systems supporting the execution of business processes or analyzing and improving already enacted business processes.

Did you find this useful? Give us your feedback

Figures (2)

Fig. 1. Three-level hierarchies on activity and performer

Table 1. Number of global frequent patterns discovered by varying k in [1,20]

Content maybe subject to copyright Report

A Grid-Based Multi-relational Approach to

Process Mining

Antonio Turi, Annalisa Appice, Michelangelo Ceci, and Donato Malerba

Dipartimento di Informatica, Universit`a degli Studi di Bari

via Orabona, 4 - 70126 Bari - Italy

{turi,appice,ceci,malerba}@di.uniba.it

Abstract. Industrial, scientiﬁc, and commercial applications use infor-

mation systems to trace the execution of a business process. Relevant

events are registered in massive logs and process mining techniques are

used to automatically discover knowledge that reveals the execution and

organization of the process instances (cases). In this paper, we investigate

the use of a multi-level relational frequent pattern discovery method as a

means of process mining. In order to process such massive logs we resort

to a Grid-based implementation of the knowledge discovery algorithm

that distributes the computation on several nodes of a Grid platform.

Experiments are performed on real event logs.

1 Introduction

Many information systems, such as Workﬂow Management Systems, ERP sys-

tems, Business-to-business systems and Firewall systems trace behavior of run-

ning processes by registering relevant events in massive logs. Events are described

in a structured form that includes properties of cases and activities. A case rep-

resents the process instance which is being handled, while an activity represents

the operation on the case. Information on timestamp and on the person execut-

ing the event (performer) is available in the logs. Both activities and performers

may belong to diﬀerent categories. Event logs are stored in multi-terabyte ware-

houses and sophisticated data mining techniques are required to process this

huge amount of data and extract knowledge concerning the execution and orga-

nization of the recorded processes. This huge amount of data is the main concern

of research in process mining whose aim is to discover a description or prediction

of real process, control, organizational, and social structures [10].

Process mining poses several challenges to the traditional data mining tasks.

In fact, data stored in event logs describe objects of diﬀerent type (cases, ac-

tivities and performers) which are naturally modeled as several relational data

tables, one for each object type. Foreign key constraints express the relations

between these objects. This (relational) data representation makes necessary

distinguishing between the reference objects of analysis (cases) and other task-

relevant objects (activities and performers), and to represent their interactions.

Another challenge is represented by the temporal autocorrelation. Events are

temporally related according to a timestamp. This means that the eﬀect of a

S.S. Bhowmick, J. K¨ung, and R. Wagner (Eds.): DEXA 2008, LNCS 5181, pp. 701–709, 2008.

 Springer-Verlag Berlin Heidelberg 2008

702 A. Turi et al.

property at any event may not be limited to the speciﬁc event. Furthermore, ac-

tivities and performers are generally organized in hierarchies of categories (e.g.

the performer of operations on a text ﬁle can be a writer or a reader). By de-

scending or ascending through a hierarchy, it is possible to view the same object

at diﬀerent levels of abstraction (or granularity). Finally, reasoning is the process

by which information about objects and their relations (e.g. operator of indirect

successor) are used to arrive at valid conclusions regarding the object relations

[7]. This source of knowledge cannot be ignored in the search.

Currently, many algorithms [2,1,11,3] have dealt with several of these chal-

lenges and some of them are integrated into the ProM framework [12]. Any-

way, to the best of our knowledge, methods of process mining neither support

a multi-level analysis nor use inferential mechanisms deﬁned within a reasoning

theory. Conversely, the multi-relational data mining method SPADA [5] oﬀers

a suﬃciently complete solution to all the challenges posed by the process min-

ing tasks in descriptive case. However, SPADA is not applicable in practice.

Indeed, frequent pattern discovery is a very complex task, particularly in the

multi-relational case [5]. In addition SPADA, similarly to most of the multi-

relational data mining algorithms, operates with data in main memory, hence it

is not appropriate for processing massive logs. Advantages of the multi-relational

approach in facing the challenges of the process mining justify the attempt of

resorting to the computational power of distributed high-performance environ-

ments (e.g., computational Grids [4]) to mitigate the complexity of the relational

frequent pattern discovery on massive event logs.

In this paper, we present G-SPADA, an extension of SPADA, which discovers

approximate multi-level relational frequent patterns by distributing exact com-

putation of locally frequent multi-level relational patterns on a computational

Grid and then by post-processing local patterns in order to approximate the set

of the globally frequent patterns as well as their supports. Distributing relational

frequent pattern discovery on a Grid poses several issues. Firstly, relational data

must be divided in data subsets and each subset has to be distributed on the

Grid. Split must take into account relational structure of data, that is, each data

split must includes a subset of reference objects and the task-relevant objects to

reconstruct all interactions between them. Secondly, it is necessary a framework

for building the Grid applications utilizing the power of distributed computation

and storage resources across the Internet. Finally, processing local patterns to

approximate global ones requires a way of combining distinct sets of patterns

into a single one and obtaining an estimate of the global support.

2 Multi-level Relational Frequent Pattern Discovery

The multi-level relational pattern discovery task is formally deﬁned as follows:

Given:asetS of reference objects, some sets R

,1≤ k ≤ m of task-relevant ob-

jects, a background knowledge BK which includes hierarchies H

on the objects

in R

and domain knowledge in form of rules, a deductive database D that is

formed by an extensional (D

) part where properties and relations of reference

A Grid-Based Multi-relational Approach to Process Mining 703

objects and task-relevant objects are expressed in derived ground predicates and

an intensional part (D

) where domain knowledge in BK is expressed in form

of rules, M granularity levels in the descriptions (1 for the highest), a set of

granularity ψ

which associate each object in H

with a granularity level to deal

with several hierarchies at once, a threshold minsup[l] for each granularity level

l (1 ≤ l ≤ M), Find, for each granularity level l,thefrequent

relational patterns

which involve properties and relations of task relevant-objects at level l of H

The relational formalization of the task of frequent pattern discovery is based

on the idea that each unit of analysis (or example) D[s] includes a reference

object s ∈ S and all the task-relevant objects of R

which are (directly or

indirectly) related to s according to some foreign key path in D. The frequency

(support) of a pattern is based on the number of units of analysis, i.e., reference

objects, covered by the pattern. An example of relational pattern is:

Example 1. Let D

be the extensional database described in Example 1. A

possible relational pattern P1 on D is in the form:

P1: case(A), activity(A,B), is

a(B,activity), before(B,C), is a(C,activity),

description(C,workinprogress), user(B, D), is

a(D, performer) [72.25%]

P1 expresses the fact that a process A is formed by two sequential activities,

namely B and C,theperformerofB is generic. The support is 72.5%.

By taking into account hierarchies on task-relevant objects, relational patterns

can be discovered at multiple level of granularity.

Example 2. Let us consider two level hierarchies deﬁned on performers and ac-

tivities deﬁned in the followings:

administrator, user → performer ; namemaker, delete, workﬂow → activity

P2 is a ﬁner-grained relational pattern than P1 obtained by descending one level

in hierarchies. P2 is in the form:

P2: case(A), activity(A,B), is

a(B,namemaker), before(B,C), is a(C,workﬂow),

description(C,workinprogress), is

a(D, administrator) [62.5%]

P2 provides better insight than P1 on the nature of B, C and D.

In SPADA [5], multi-level relational frequent patterns are discovered according

to the levelwise method [6] that is based on a breadth-ﬁrst search in the lattice

of patterns spanned by θ-subsumption [8] generality order (

3G-SPADA

Similarly to Partition [9], G-SPADA splits a dataset into several partitions to

be processed independently. It approximates the multi-level relational frequent

pattern discovery by means of a three stepped strategy. In the ﬁrst step, the set

of original N reference objects is partitioned into n approximately equally-sized

subsets (n<<N). Each partition includes a subset of the reference objects

and the set of task-relevant objects. In the second step, the frequent pattern

With support greater than minsup[l].

704 A. Turi et al.

computation is parallelized and distributed on n nodes of a Grid platform, one

node for each partition. In this way, G-SPADA generates n parallel executions

of SPADA at the same time and retrieves local patterns which are frequent in

at least one of the data partition. In the third step, G-SPADA approximates the

set of globally frequent patterns by merging patterns discovered at the nodes.

The basic idea in approximating the global patterns is that each globally fre-

quent pattern must be locally frequent in at least k partitions of the original

dataset. In the case k is set to 1, this guarantees that the union of all local

solutions is a superset of the global solution. However, a merge step with k =1

may generate several false positives, i.e. patterns that result locally frequent but

globally infrequent. Hence, value of k should be adequately tuned between 1 and

n in order to ﬁnd the best trade-oﬀ between false positive and false negative fre-

quent patterns. The merge step also attempts to approximate values of support

for the global patterns starting from the local values of support.

3.1 Relational Data Partitioning

G-SPADA pre-processes the deductive database of logs and completes the de-

scription explicitly provided for each example (D

) with the information that is

implicit in the domain knowledge (D

). An example of this saturation step is:

Example 3. Let us consider the deductive database:

case(c1). case(c2). activity(c1,a1). activity(c1,a2). activity(c1,a3). activ-

ity(c2,a4). time(a1,10). time(a2,25). time(a3,29). time(a4,13). descrip-

tion(a1,create). ...

before(A,B):-activity(C, A1),activity(C, A2), time(A1,T1), A1= A2,

time(A2,T2), T1<T2, not(activity(C, A), A= A1, A= A2, time(A,T),

T1<T, T<T2).

By performing the saturation step, the following predicates are made explicit in

the database: before(a1,a2). before(a2,a3).

Saturation precedes data partitioning. In this way, redundant inferences are

prevented for properties and relations of task-relevant objects shared from two

or more reference objects belonging to diﬀerent data partitions.

Data partitioning is performed by randomly splitting the set of reference ob-

jects in n approximately equal-sized partitions such that the union of the par-

titions is the entire set of reference objects. These data partitions are enriched

by adding the ground predicates which describe properties and relations of the

reference objects falling in the partition at hand. Subsequently, properties and

relations of task-relevant objects related to reference objects according to some

foreign key path are also added to the partition.

3.2 Distributing Computation on Grid

Each dataset partition is shipped along with the G-SPADA pattern discovery

algorithm to computation nodes on Grid using gLite

middleware. This is done

gLite (http://glite.web.cern.ch/glite/) is a next generation middleware for Grid com-

puting which provides a framework for building Grid applications.

A Grid-Based Multi-relational Approach to Process Mining 705

by submitting parametric jobs described in JDL (Job Description Language)

through the CLI (command line interface). Submission of jobs on Grid are di-

vided in several steps: (i) Authenticate on a UI (user interface) through PKI

based authentication system with proxy credentials (GSI); (ii) Prepare the jobs

(JDL, shell script to automate procedure, input ﬁle); (iii) Upload (Stage-in)

a set of dataset; (iv) Submit a relative parametric job; (v) Check/wait results;

(vi) Finally, once the job is executed on Grid, we get the output (Stage-out) ﬁles

containing the frequent pattern sets along with their support for each sample.

3.3 Computing Approximate Global Frequent Patterns

The n sets of local frequent patterns are collected from the computation nodes

of the Grid platform and then merged to approximate the set of global patterns.

For each local pattern discovered in at least k data partitions (1 ≤ k ≤ n), G-

SPADA derives an approximate of the global support by averaging the support

values collected on the partitions where the pattern is found to be frequent. The

check that the same local pattern occurs in diﬀerent partitions is based on an

equivalence test between two patterns under θ-subsumption, which corresponds

to performing a double θ-subsumption test (P 

Q and Q 

P ). Local pat-

terns occurring in less than k partitions are ﬁltered out. The global frequent

patterns obtained following this merge procedure approximate the original fre-

quent patterns which can be possibly mined on the entire dataset.

An example of approximate global process pattern is:

case(A), activity(A,B), is

a(B,namemaker), before(B,C), is a(C,workﬂow),

description(C,workinprogress) [7, 72.5%]

which describes the order of execution between two activities, namely B and C,

in the process A. B is a name-maker activity while C is a workﬂow activity. In

addition, C is described as work in progress. 7 means that this pattern is found

in 7 partitions (sample-level support), while 72.5% indicates the macro average

support obtained by averaging the support values computed on the 7 samples.

4 Experimental Results

Experiments are performed by processing event logs provided by THINK3 Inc

in the context of the TOCAI.It project

. THINK3 is a global player in Cad

and Plm market whose mission is to help manufacturers optimizing their entire

product development processes. G-SPADA is run on the deductive database

that is obtained by boiling down the event logs from January 1st to February

28th, 2006 and considering as domain knowledge the deﬁnition of the “before”

predicate. In the experiments, each case (process instance) traced in the logs

is considered as a whole and multi-level relational patterns are discovered from

traced business processes. These patterns capture the possible relation between

the order of activities and the properties of their performers.

http://www.think3.com/en/default.aspx

http://www.dis.uniroma1.it/∼tocai/index.php

HTML Viewer

A Grid-Based Multi-relational Approach to Process Mining

Summary (3 min read)

1 Introduction

2 Multi-level Relational Frequent Pattern Discovery

3 G-SPADA

3.1 Relational Data Partitioning

3.2 Distributing Computation on Grid

3.3 Computing Approximate Global Frequent Patterns

4 Experimental Results

4.1 Data Description

4.2 Local and Global Multi-level Relational Patterns Discovery

5 Conclusions

Figures (2)

Citations

Cites methods from "A Grid-Based Multi-relational Appro..."

References

"A Grid-Based Multi-relational Appro..." refers methods in this paper

"A Grid-Based Multi-relational Appro..." refers background in this paper

"A Grid-Based Multi-relational Appro..." refers background in this paper

"A Grid-Based Multi-relational Appro..." refers methods in this paper

Related Papers (5)