A Grid-Based Multi-relational Approach to Process Mining

doi:10.1007/978-3-540-85654-2_61

A Grid-Based Multi-relational Approach to

Process Mining

Antonio Turi, Annalisa Appice, Michelangelo Ceci, and Donato Malerba

Dipartimento di Informatica, Universit`a degli Studi di Bari

via Orabona, 4 - 70126 Bari - Italy

{turi,appice,ceci,malerba}@di.uniba.it

Abstract. Industrial, scientiﬁc, and commercial applications use infor-

mation systems to trace the execution of a business process. Relevant

events are registered in massive logs and process mining techniques are

used to automatically discover knowledge that reveals the execution and

organization of the process instances (cases). In this paper, we investigate

the use of a multi-level relational frequent pattern discovery method as a

means of process mining. In order to process such massive logs we resort

to a Grid-based implementation of the knowledge discovery algorithm

that distributes the computation on several nodes of a Grid platform.

Experiments are performed on real event logs.

1 Introduction

Many information systems, such as Workﬂow Management Systems, ERP sys-

tems, Business-to-business systems and Firewall systems trace behavior of run-

ning processes by registering relevant events in massive logs. Events are described

in a structured form that includes properties of cases and activities. A case rep-

resents the process instance which is being handled, while an activity represents

the operation on the case. Information on timestamp and on the person execut-

ing the event (performer) is available in the logs. Both activities and performers

may belong to diﬀerent categories. Event logs are stored in multi-terabyte ware-

houses and sophisticated data mining techniques are required to process this

huge amount of data and extract knowledge concerning the execution and orga-

nization of the recorded processes. This huge amount of data is the main concern

of research in process mining whose aim is to discover a description or prediction

of real process, control, organizational, and social structures [10].

Process mining poses several challenges to the traditional data mining tasks.

In fact, data stored in event logs describe objects of diﬀerent type (cases, ac-

tivities and performers) which are naturally modeled as several relational data

tables, one for each object type. Foreign key constraints express the relations

between these objects. This (relational) data representation makes necessary

distinguishing between the reference objects of analysis (cases) and other task-

relevant objects (activities and performers), and to represent their interactions.

Another challenge is represented by the temporal autocorrelation. Events are

temporally related according to a timestamp. This means that the eﬀect of a

S.S. Bhowmick, J. K¨ung, and R. Wagner (Eds.): DEXA 2008, LNCS 5181, pp. 701–709, 2008.

c

 Springer-Verlag Berlin Heidelberg 2008

702 A. Turi et al.

property at any event may not be limited to the speciﬁc event. Furthermore, ac-

tivities and performers are generally organized in hierarchies of categories (e.g.

the performer of operations on a text ﬁle can be a writer or a reader). By de-

scending or ascending through a hierarchy, it is possible to view the same object

at diﬀerent levels of abstraction (or granularity). Finally, reasoning is the process

by which information about objects and their relations (e.g. operator of indirect

successor) are used to arrive at valid conclusions regarding the object relations

[7]. This source of knowledge cannot be ignored in the search.

Currently, many algorithms [2,1,11,3] have dealt with several of these chal-

lenges and some of them are integrated into the ProM framework [12]. Any-

way, to the best of our knowledge, methods of process mining neither support

a multi-level analysis nor use inferential mechanisms deﬁned within a reasoning

theory. Conversely, the multi-relational data mining method SPADA [5] oﬀers

a suﬃciently complete solution to all the challenges posed by the process min-

ing tasks in descriptive case. However, SPADA is not applicable in practice.

Indeed, frequent pattern discovery is a very complex task, particularly in the

multi-relational case [5]. In addition SPADA, similarly to most of the multi-

relational data mining algorithms, operates with data in main memory, hence it

is not appropriate for processing massive logs. Advantages of the multi-relational

approach in facing the challenges of the process mining justify the attempt of

resorting to the computational power of distributed high-performance environ-

ments (e.g., computational Grids [4]) to mitigate the complexity of the relational

frequent pattern discovery on massive event logs.

In this paper, we present G-SPADA, an extension of SPADA, which discovers

approximate multi-level relational frequent patterns by distributing exact com-

putation of locally frequent multi-level relational patterns on a computational

Grid and then by post-processing local patterns in order to approximate the set

of the globally frequent patterns as well as their supports. Distributing relational

frequent pattern discovery on a Grid poses several issues. Firstly, relational data

must be divided in data subsets and each subset has to be distributed on the

Grid. Split must take into account relational structure of data, that is, each data

split must includes a subset of reference objects and the task-relevant objects to

reconstruct all interactions between them. Secondly, it is necessary a framework

for building the Grid applications utilizing the power of distributed computation

and storage resources across the Internet. Finally, processing local patterns to

approximate global ones requires a way of combining distinct sets of patterns

into a single one and obtaining an estimate of the global support.

2 Multi-level Relational Frequent Pattern Discovery

The multi-level relational pattern discovery task is formally deﬁned as follows:

Given:asetS of reference objects, some sets R

k

,1≤ k ≤ m of task-relevant ob-

jects, a background knowledge BK which includes hierarchies H

k

on the objects

in R

k

and domain knowledge in form of rules, a deductive database D that is

formed by an extensional (D

E

) part where properties and relations of reference

A Grid-Based Multi-relational Approach to Process Mining 703

objects and task-relevant objects are expressed in derived ground predicates and

an intensional part (D

I

) where domain knowledge in BK is expressed in form

of rules, M granularity levels in the descriptions (1 for the highest), a set of

granularity ψ

k

which associate each object in H

k

with a granularity level to deal

with several hierarchies at once, a threshold minsup[l] for each granularity level

l (1 ≤ l ≤ M), Find, for each granularity level l,thefrequent

1

relational patterns

which involve properties and relations of task relevant-objects at level l of H

k

.

The relational formalization of the task of frequent pattern discovery is based

on the idea that each unit of analysis (or example) D[s] includes a reference

object s ∈ S and all the task-relevant objects of R

k

which are (directly or

indirectly) related to s according to some foreign key path in D. The frequency

(support) of a pattern is based on the number of units of analysis, i.e., reference

objects, covered by the pattern. An example of relational pattern is:

Example 1. Let D

E

be the extensional database described in Example 1. A

possible relational pattern P1 on D is in the form:

P1: case(A), activity(A,B), is

a(B,activity), before(B,C), is a(C,activity),

description(C,workinprogress), user(B, D), is

a(D, performer) [72.25%]

P1 expresses the fact that a process A is formed by two sequential activities,

namely B and C,theperformerofB is generic. The support is 72.5%.

By taking into account hierarchies on task-relevant objects, relational patterns

can be discovered at multiple level of granularity.

Example 2. Let us consider two level hierarchies deﬁned on performers and ac-

tivities deﬁned in the followings:

administrator, user → performer ; namemaker, delete, workﬂow → activity

P2 is a ﬁner-grained relational pattern than P1 obtained by descending one level

in hierarchies. P2 is in the form:

P2: case(A), activity(A,B), is

a(B,namemaker), before(B,C), is a(C,workﬂow),

description(C,workinprogress), is

a(D, administrator) [62.5%]

P2 provides better insight than P1 on the nature of B, C and D.

In SPADA [5], multi-level relational frequent patterns are discovered according

to the levelwise method [6] that is based on a breadth-ﬁrst search in the lattice

of patterns spanned by θ-subsumption [8] generality order (

θ

).

3G-SPADA

Similarly to Partition [9], G-SPADA splits a dataset into several partitions to

be processed independently. It approximates the multi-level relational frequent

pattern discovery by means of a three stepped strategy. In the ﬁrst step, the set

of original N reference objects is partitioned into n approximately equally-sized

subsets (n<<N). Each partition includes a subset of the reference objects

and the set of task-relevant objects. In the second step, the frequent pattern

1

With support greater than minsup[l].

704 A. Turi et al.

computation is parallelized and distributed on n nodes of a Grid platform, one

node for each partition. In this way, G-SPADA generates n parallel executions

of SPADA at the same time and retrieves local patterns which are frequent in

at least one of the data partition. In the third step, G-SPADA approximates the

set of globally frequent patterns by merging patterns discovered at the nodes.

The basic idea in approximating the global patterns is that each globally fre-

quent pattern must be locally frequent in at least k partitions of the original

dataset. In the case k is set to 1, this guarantees that the union of all local

solutions is a superset of the global solution. However, a merge step with k =1

may generate several false positives, i.e. patterns that result locally frequent but

globally infrequent. Hence, value of k should be adequately tuned between 1 and

n in order to ﬁnd the best trade-oﬀ between false positive and false negative fre-

quent patterns. The merge step also attempts to approximate values of support

for the global patterns starting from the local values of support.

3.1 Relational Data Partitioning

G-SPADA pre-processes the deductive database of logs and completes the de-

scription explicitly provided for each example (D

E

) with the information that is

implicit in the domain knowledge (D

I

). An example of this saturation step is:

Example 3. Let us consider the deductive database:

case(c1). case(c2). activity(c1,a1). activity(c1,a2). activity(c1,a3). activ-

ity(c2,a4). time(a1,10). time(a2,25). time(a3,29). time(a4,13). descrip-

tion(a1,create). ...

before(A,B):-activity(C, A1),activity(C, A2), time(A1,T1), A1= A2,

time(A2,T2), T1<T2, not(activity(C, A), A= A1, A= A2, time(A,T),

T1<T, T<T2).

By performing the saturation step, the following predicates are made explicit in

the database: before(a1,a2). before(a2,a3).

Saturation precedes data partitioning. In this way, redundant inferences are

prevented for properties and relations of task-relevant objects shared from two

or more reference objects belonging to diﬀerent data partitions.

Data partitioning is performed by randomly splitting the set of reference ob-

jects in n approximately equal-sized partitions such that the union of the par-

titions is the entire set of reference objects. These data partitions are enriched

by adding the ground predicates which describe properties and relations of the

reference objects falling in the partition at hand. Subsequently, properties and

relations of task-relevant objects related to reference objects according to some

foreign key path are also added to the partition.

3.2 Distributing Computation on Grid

Each dataset partition is shipped along with the G-SPADA pattern discovery

algorithm to computation nodes on Grid using gLite

2

middleware. This is done

2

gLite (http://glite.web.cern.ch/glite/) is a next generation middleware for Grid com-

puting which provides a framework for building Grid applications.

A Grid-Based Multi-relational Approach to Process Mining 705

by submitting parametric jobs described in JDL (Job Description Language)

through the CLI (command line interface). Submission of jobs on Grid are di-

vided in several steps: (i) Authenticate on a UI (user interface) through PKI

based authentication system with proxy credentials (GSI); (ii) Prepare the jobs

(JDL, shell script to automate procedure, input ﬁle); (iii) Upload (Stage-in)

a set of dataset; (iv) Submit a relative parametric job; (v) Check/wait results;

(vi) Finally, once the job is executed on Grid, we get the output (Stage-out) ﬁles

containing the frequent pattern sets along with their support for each sample.

3.3 Computing Approximate Global Frequent Patterns

The n sets of local frequent patterns are collected from the computation nodes

of the Grid platform and then merged to approximate the set of global patterns.

For each local pattern discovered in at least k data partitions (1 ≤ k ≤ n), G-

SPADA derives an approximate of the global support by averaging the support

values collected on the partitions where the pattern is found to be frequent. The

check that the same local pattern occurs in diﬀerent partitions is based on an

equivalence test between two patterns under θ-subsumption, which corresponds

to performing a double θ-subsumption test (P 

θ

Q and Q 

θ

P ). Local pat-

terns occurring in less than k partitions are ﬁltered out. The global frequent

patterns obtained following this merge procedure approximate the original fre-

quent patterns which can be possibly mined on the entire dataset.

An example of approximate global process pattern is:

case(A), activity(A,B), is

a(B,namemaker), before(B,C), is a(C,workﬂow),

description(C,workinprogress) [7, 72.5%]

which describes the order of execution between two activities, namely B and C,

in the process A. B is a name-maker activity while C is a workﬂow activity. In

addition, C is described as work in progress. 7 means that this pattern is found

in 7 partitions (sample-level support), while 72.5% indicates the macro average

support obtained by averaging the support values computed on the 7 samples.

4 Experimental Results

Experiments are performed by processing event logs provided by THINK3 Inc

3

in the context of the TOCAI.It project

4

. THINK3 is a global player in Cad

and Plm market whose mission is to help manufacturers optimizing their entire

product development processes. G-SPADA is run on the deductive database

that is obtained by boiling down the event logs from January 1st to February

28th, 2006 and considering as domain knowledge the deﬁnition of the “before”

predicate. In the experiments, each case (process instance) traced in the logs

is considered as a whole and multi-level relational patterns are discovered from

traced business processes. These patterns capture the possible relation between

the order of activities and the properties of their performers.

3

http://www.think3.com/en/default.aspx

4

http://www.dis.uniroma1.it/∼tocai/index.php

A Grid-Based Multi-relational Approach to Process Mining

Figures

Citations

Traffic Pattern Mining and Forecasting Technologies in Maritime Traffic Service Networks: A Comprehensive Survey

FIT for SOA? Introducing the F.I.T.-Metric to Optimize the Availability of Service Oriented Architectures

References

Discovering expressive process models by clustering log traces

Software Process Validation: Quantitatively Measuring the Correspondence of a Process to a Model ; CU-CS-840-97

Software process validation: quantitatively measuring the correspondence of a process to a model

Applications and Theory of Petri Nets 2005

Advances in Database Technology — EDBT'98

Related Papers (5)

Leveraging process discovery with trace clustering and text mining for intelligent analysis of incident management processes

Process discovery in event logs: An application in the telecom industry

Active Trace Clustering for Improved Process Discovery

Scientific workflows for process mining: building blocks, scenarios, and implementation

Business process mining approaches: a relative comparison