scispace - formally typeset
Search or ask a question
Book ChapterDOI

A Grid-Based Multi-relational Approach to Process Mining

01 Sep 2008-Vol. 5181, pp 701-709
TL;DR: This paper investigates the use of a multi-level relational frequent pattern discovery method as a means of process mining using a Grid-based implementation of the knowledge discovery algorithm that distributes the computation on several nodes of a Grid platform.
Abstract: Industrial, scientific, and commercial applications use information systems to trace the execution of a business process. Relevant events are registered in massive logs and process mining techniques are used to automatically discover knowledge that reveals the execution and organization of the process instances (cases). In this paper, we investigate the use of a multi-level relational frequent pattern discovery method as a means of process mining. In order to process such massive logs we resort to a Grid-based implementation of the knowledge discovery algorithm that distributes the computation on several nodes of a Grid platform. Experiments are performed on real event logs.

Summary (3 min read)

1 Introduction

  • Many information systems, such as Workflow Management Systems, ERP systems, Business-to-business systems and Firewall systems trace behavior of running processes by registering relevant events in massive logs.
  • Process mining poses several challenges to the traditional data mining tasks.
  • This data representation makes necessary distinguishing between the reference objects of analysis and other taskrelevant objects (activities and performers), and to represent their interactions.
  • The authors present G-SPADA, an extension of SPADA, which discovers approximate multi-level relational frequent patterns by distributing exact computation of locally frequent multi-level relational patterns on a computational Grid and then by post-processing local patterns in order to approximate the set of the globally frequent patterns as well as their supports.

2 Multi-level Relational Frequent Pattern Discovery

  • By taking into account hierarchies on task-relevant objects, relational patterns can be discovered at multiple level of granularity.
  • P2 provides better insight than P1 on the nature of B, C and D. In SPADA [5], multi-level relational frequent patterns are discovered according to the levelwise method [6] that is based on a breadth-first search in the lattice of patterns spanned by θ-subsumption [8] generality order ( θ).

3 G-SPADA

  • Similarly to Partition [9], G-SPADA splits a dataset into several partitions to be processed independently.
  • Each partition includes a subset of the reference objects and the set of task-relevant objects.
  • In the second step, the frequent pattern 1 With support greater than minsup[l].
  • In the third step, G-SPADA approximates the set of globally frequent patterns by merging patterns discovered at the nodes.
  • A merge step with k = 1 may generate several false positives, i.e. patterns that result locally frequent but globally infrequent.

3.1 Relational Data Partitioning

  • G-SPADA pre-processes the deductive database of logs and completes the description explicitly provided for each example (DE) with the information that is implicit in the domain knowledge (DI).
  • By performing the saturation step, the following predicates are made explicit in the database: before(a1,a2). before(a2,a3).
  • These data partitions are enriched by adding the ground predicates which describe properties and relations of the reference objects falling in the partition at hand.

3.2 Distributing Computation on Grid

  • Each dataset partition is shipped along with the G-SPADA pattern discovery algorithm to computation nodes on Grid using gLite2 middleware.
  • This is done 2 gLite (http://glite.web.cern.ch/glite/) is a next generation middleware for Grid com- puting which provides a framework for building Grid applications.
  • By submitting parametric jobs described in JDL (Job Description Language) through the CLI (command line interface).

3.3 Computing Approximate Global Frequent Patterns

  • The n sets of local frequent patterns are collected from the computation nodes of the Grid platform and then merged to approximate the set of global patterns.
  • Local patterns occurring in less than k partitions are filtered out.
  • The global frequent patterns obtained following this merge procedure approximate the original frequent patterns which can be possibly mined on the entire dataset.
  • 7 means that this pattern is found in 7 partitions (sample-level support), while 72.5% indicates the macro average support obtained by averaging the support values computed on the 7 samples.

4 Experimental Results

  • Experiments are performed by processing event logs provided by THINK3 Inc3 in the context of the TOCAI.
  • THINK3 is a global player in Cad and Plm market whose mission is to help manufacturers optimizing their entire product development processes.
  • G-SPADA is run on the deductive database that is obtained by boiling down the event logs from January 1st to February 28th, 2006 and considering as domain knowledge the definition of the “before” predicate.
  • In the experiments, each case (process instance) traced in the logs is considered as a whole and multi-level relational patterns are discovered from traced business processes.
  • These patterns capture the possible relation between the order of activities and the properties of their performers.

4.1 Data Description

  • Data trace the behavior of 21,256 instances of a business process recorded in the period under analysis.
  • This corresponds to model activities and performers by means of three-level hierarchies .
  • For each activity, a text description of the operation is registered in the event logs.
  • The right part is a characterization of the description of the operation provided in the left part.
  • Finally, each performer is described by the belonging group.

4.2 Local and Global Multi-level Relational Patterns Discovery

  • G-SPADA is run on the event logs including 395,404 ground predicates.
  • Indeed, SPADA generates a memory exception when running on the entire dataset.
  • Multi-level relational patterns are discovered at each node with minsup[l] = 0.2 (l = 1, 2) and max len path = 9 5. Finally, for each level of granularity, global patterns are approximated from the local ones by varying k between 1 and 20.
  • Global patterns provide a compact description of the instances of process traced in the logs.
  • Finally, the relational pattern: P4: case(A), activity(A,B),before(B,C),before(C,D), is a(B,namemaker), is a(C,workflow), is a(D,workflow), descleft(C,creation), descleft(D,wip2k) [k=16, avgSup=21.35%] describes the execution order among three sequential activities, namely B, C and D. B is a namemaker activity, while C and D are workflow activities.

5 Conclusions

  • The authors present G-SPADA, an extension of the system SPADA, to discover approximate multi-lever relational frequent patterns in the context of process mining.
  • G-SPADA exploits a multi-relational approach in order to deal with both multiple nature of data stored in event logs and temporal autocorrelation.
  • G-SPADA faces the need of processing massive logs by resorting to a grid based architecture.
  • Experiments on the real event logs allow us to discover interpretable patterns which capture regularities in the execution of activities and the characteristics of the performers of a business process.
  • Such patterns can be used to deploy new systems supporting the execution of business processes or analyzing and improving already enacted business processes.

Did you find this useful? Give us your feedback

Content maybe subject to copyright    Report

A Grid-Based Multi-relational Approach to
Process Mining
Antonio Turi, Annalisa Appice, Michelangelo Ceci, and Donato Malerba
Dipartimento di Informatica, Universit`a degli Studi di Bari
via Orabona, 4 - 70126 Bari - Italy
{turi,appice,ceci,malerba}@di.uniba.it
Abstract. Industrial, scientific, and commercial applications use infor-
mation systems to trace the execution of a business process. Relevant
events are registered in massive logs and process mining techniques are
used to automatically discover knowledge that reveals the execution and
organization of the process instances (cases). In this paper, we investigate
the use of a multi-level relational frequent pattern discovery method as a
means of process mining. In order to process such massive logs we resort
to a Grid-based implementation of the knowledge discovery algorithm
that distributes the computation on several nodes of a Grid platform.
Experiments are performed on real event logs.
1 Introduction
Many information systems, such as Workflow Management Systems, ERP sys-
tems, Business-to-business systems and Firewall systems trace behavior of run-
ning processes by registering relevant events in massive logs. Events are described
in a structured form that includes properties of cases and activities. A case rep-
resents the process instance which is being handled, while an activity represents
the operation on the case. Information on timestamp and on the person execut-
ing the event (performer) is available in the logs. Both activities and performers
may belong to different categories. Event logs are stored in multi-terabyte ware-
houses and sophisticated data mining techniques are required to process this
huge amount of data and extract knowledge concerning the execution and orga-
nization of the recorded processes. This huge amount of data is the main concern
of research in process mining whose aim is to discover a description or prediction
of real process, control, organizational, and social structures [10].
Process mining poses several challenges to the traditional data mining tasks.
In fact, data stored in event logs describe objects of different type (cases, ac-
tivities and performers) which are naturally modeled as several relational data
tables, one for each object type. Foreign key constraints express the relations
between these objects. This (relational) data representation makes necessary
distinguishing between the reference objects of analysis (cases) and other task-
relevant objects (activities and performers), and to represent their interactions.
Another challenge is represented by the temporal autocorrelation. Events are
temporally related according to a timestamp. This means that the effect of a
S.S. Bhowmick, J. ung, and R. Wagner (Eds.): DEXA 2008, LNCS 5181, pp. 701–709, 2008.
c
Springer-Verlag Berlin Heidelberg 2008

702 A. Turi et al.
property at any event may not be limited to the specific event. Furthermore, ac-
tivities and performers are generally organized in hierarchies of categories (e.g.
the performer of operations on a text le can be a writer or a reader). By de-
scending or ascending through a hierarchy, it is possible to view the same object
at different levels of abstraction (or granularity). Finally, reasoning is the process
by which information about objects and their relations (e.g. operator of indirect
successor) are used to arrive at valid conclusions regarding the object relations
[7]. This source of knowledge cannot be ignored in the search.
Currently, many algorithms [2,1,11,3] have dealt with several of these chal-
lenges and some of them are integrated into the ProM framework [12]. Any-
way, to the best of our knowledge, methods of process mining neither support
a multi-level analysis nor use inferential mechanisms defined within a reasoning
theory. Conversely, the multi-relational data mining method SPADA [5] offers
a sufficiently complete solution to all the challenges posed by the process min-
ing tasks in descriptive case. However, SPADA is not applicable in practice.
Indeed, frequent pattern discovery is a very complex task, particularly in the
multi-relational case [5]. In addition SPADA, similarly to most of the multi-
relational data mining algorithms, operates with data in main memory, hence it
is not appropriate for processing massive logs. Advantages of the multi-relational
approach in facing the challenges of the process mining justify the attempt of
resorting to the computational power of distributed high-performance environ-
ments (e.g., computational Grids [4]) to mitigate the complexity of the relational
frequent pattern discovery on massive event logs.
In this paper, we present G-SPADA, an extension of SPADA, which discovers
approximate multi-level relational frequent patterns by distributing exact com-
putation of locally frequent multi-level relational patterns on a computational
Grid and then by post-processing local patterns in order to approximate the set
of the globally frequent patterns as well as their supports. Distributing relational
frequent pattern discovery on a Grid poses several issues. Firstly, relational data
must be divided in data subsets and each subset has to be distributed on the
Grid. Split must take into account relational structure of data, that is, each data
split must includes a subset of reference objects and the task-relevant objects to
reconstruct all interactions between them. Secondly, it is necessary a framework
for building the Grid applications utilizing the power of distributed computation
and storage resources across the Internet. Finally, processing local patterns to
approximate global ones requires a way of combining distinct sets of patterns
into a single one and obtaining an estimate of the global support.
2 Multi-level Relational Frequent Pattern Discovery
The multi-level relational pattern discovery task is formally defined as follows:
Given:asetS of reference objects, some sets R
k
,1 k m of task-relevant ob-
jects, a background knowledge BK which includes hierarchies H
k
on the objects
in R
k
and domain knowledge in form of rules, a deductive database D that is
formed by an extensional (D
E
) part where properties and relations of reference

A Grid-Based Multi-relational Approach to Process Mining 703
objects and task-relevant objects are expressed in derived ground predicates and
an intensional part (D
I
) where domain knowledge in BK is expressed in form
of rules, M granularity levels in the descriptions (1 for the highest), a set of
granularity ψ
k
which associate each object in H
k
with a granularity level to deal
with several hierarchies at once, a threshold minsup[l] for each granularity level
l (1 l M), Find, for each granularity level l,thefrequent
1
relational patterns
which involve properties and relations of task relevant-objects at level l of H
k
.
The relational formalization of the task of frequent pattern discovery is based
on the idea that each unit of analysis (or example) D[s] includes a reference
object s S and all the task-relevant objects of R
k
which are (directly or
indirectly) related to s according to some foreign key path in D. The frequency
(support) of a pattern is based on the number of units of analysis, i.e., reference
objects, covered by the pattern. An example of relational pattern is:
Example 1. Let D
E
be the extensional database described in Example 1. A
possible relational pattern P1 on D is in the form:
P1: case(A), activity(A,B), is
a(B,activity), before(B,C), is a(C,activity),
description(C,workinprogress), user(B, D), is
a(D, performer) [72.25%]
P1 expresses the fact that a process A is formed by two sequential activities,
namely B and C,theperformerofB is generic. The support is 72.5%.
By taking into account hierarchies on task-relevant objects, relational patterns
can be discovered at multiple level of granularity.
Example 2. Let us consider two level hierarchies defined on performers and ac-
tivities defined in the followings:
administrator, user performer ; namemaker, delete, workflow activity
P2 is a finer-grained relational pattern than P1 obtained by descending one level
in hierarchies. P2 is in the form:
P2: case(A), activity(A,B), is
a(B,namemaker), before(B,C), is a(C,workflow),
description(C,workinprogress), is
a(D, administrator) [62.5%]
P2 provides better insight than P1 on the nature of B, C and D.
In SPADA [5], multi-level relational frequent patterns are discovered according
to the levelwise method [6] that is based on a breadth-first search in the lattice
of patterns spanned by θ-subsumption [8] generality order (
θ
).
3G-SPADA
Similarly to Partition [9], G-SPADA splits a dataset into several partitions to
be processed independently. It approximates the multi-level relational frequent
pattern discovery by means of a three stepped strategy. In the first step, the set
of original N reference objects is partitioned into n approximately equally-sized
subsets (n<<N). Each partition includes a subset of the reference objects
and the set of task-relevant objects. In the second step, the frequent pattern
1
With support greater than minsup[l].

704 A. Turi et al.
computation is parallelized and distributed on n nodes of a Grid platform, one
node for each partition. In this way, G-SPADA generates n parallel executions
of SPADA at the same time and retrieves local patterns which are frequent in
at least one of the data partition. In the third step, G-SPADA approximates the
set of globally frequent patterns by merging patterns discovered at the nodes.
The basic idea in approximating the global patterns is that each globally fre-
quent pattern must be locally frequent in at least k partitions of the original
dataset. In the case k is set to 1, this guarantees that the union of all local
solutions is a superset of the global solution. However, a merge step with k =1
may generate several false positives, i.e. patterns that result locally frequent but
globally infrequent. Hence, value of k should be adequately tuned between 1 and
n in order to find the best trade-off between false positive and false negative fre-
quent patterns. The merge step also attempts to approximate values of support
for the global patterns starting from the local values of support.
3.1 Relational Data Partitioning
G-SPADA pre-processes the deductive database of logs and completes the de-
scription explicitly provided for each example (D
E
) with the information that is
implicit in the domain knowledge (D
I
). An example of this saturation step is:
Example 3. Let us consider the deductive database:
case(c1). case(c2). activity(c1,a1). activity(c1,a2). activity(c1,a3). activ-
ity(c2,a4). time(a1,10). time(a2,25). time(a3,29). time(a4,13). descrip-
tion(a1,create). ...
before(A,B):-activity(C, A1),activity(C, A2), time(A1,T1), A1= A2,
time(A2,T2), T1<T2, not(activity(C, A), A= A1, A= A2, time(A,T),
T1<T, T<T2).
By performing the saturation step, the following predicates are made explicit in
the database: before(a1,a2). before(a2,a3).
Saturation precedes data partitioning. In this way, redundant inferences are
prevented for properties and relations of task-relevant objects shared from two
or more reference objects belonging to different data partitions.
Data partitioning is performed by randomly splitting the set of reference ob-
jects in n approximately equal-sized partitions such that the union of the par-
titions is the entire set of reference objects. These data partitions are enriched
by adding the ground predicates which describe properties and relations of the
reference objects falling in the partition at hand. Subsequently, properties and
relations of task-relevant objects related to reference objects according to some
foreign key path are also added to the partition.
3.2 Distributing Computation on Grid
Each dataset partition is shipped along with the G-SPADA pattern discovery
algorithm to computation nodes on Grid using gLite
2
middleware. This is done
2
gLite (http://glite.web.cern.ch/glite/) is a next generation middleware for Grid com-
puting which provides a framework for building Grid applications.

A Grid-Based Multi-relational Approach to Process Mining 705
by submitting parametric jobs described in JDL (Job Description Language)
through the CLI (command line interface). Submission of jobs on Grid are di-
vided in several steps: (i) Authenticate on a UI (user interface) through PKI
based authentication system with proxy credentials (GSI); (ii) Prepare the jobs
(JDL, shell script to automate procedure, input file); (iii) Upload (Stage-in)
a set of dataset; (iv) Submit a relative parametric job; (v) Check/wait results;
(vi) Finally, once the job is executed on Grid, we get the output (Stage-out) files
containing the frequent pattern sets along with their support for each sample.
3.3 Computing Approximate Global Frequent Patterns
The n sets of local frequent patterns are collected from the computation nodes
of the Grid platform and then merged to approximate the set of global patterns.
For each local pattern discovered in at least k data partitions (1 k n), G-
SPADA derives an approximate of the global support by averaging the support
values collected on the partitions where the pattern is found to be frequent. The
check that the same local pattern occurs in different partitions is based on an
equivalence test between two patterns under θ-subsumption, which corresponds
to performing a double θ-subsumption test (P
θ
Q and Q
θ
P ). Local pat-
terns occurring in less than k partitions are filtered out. The global frequent
patterns obtained following this merge procedure approximate the original fre-
quent patterns which can be possibly mined on the entire dataset.
An example of approximate global process pattern is:
case(A), activity(A,B), is
a(B,namemaker), before(B,C), is a(C,workflow),
description(C,workinprogress) [7, 72.5%]
which describes the order of execution between two activities, namely B and C,
in the process A. B is a name-maker activity while C is a workflow activity. In
addition, C is described as work in progress. 7 means that this pattern is found
in 7 partitions (sample-level support), while 72.5% indicates the macro average
support obtained by averaging the support values computed on the 7 samples.
4 Experimental Results
Experiments are performed by processing event logs provided by THINK3 Inc
3
in the context of the TOCAI.It project
4
. THINK3 is a global player in Cad
and Plm market whose mission is to help manufacturers optimizing their entire
product development processes. G-SPADA is run on the deductive database
that is obtained by boiling down the event logs from January 1st to February
28th, 2006 and considering as domain knowledge the definition of the “before”
predicate. In the experiments, each case (process instance) traced in the logs
is considered as a whole and multi-level relational patterns are discovered from
traced business processes. These patterns capture the possible relation between
the order of activities and the properties of their performers.
3
http://www.think3.com/en/default.aspx
4
http://www.dis.uniroma1.it/tocai/index.php

Citations
More filters
Journal ArticleDOI
TL;DR: The development of maritime traffic research in pattern mining and traffic forecasting affirms the importance of advanced maritime traffic studies and the great potential in maritime traffic safety and intelligence enhancement to accommodate the implementation of the Internet of Things, artificial intelligence technologies, and knowledge engineering and big data computing solution.
Abstract: Maritime traffic service networks and information systems play a vital role in maritime traffic safety management. The data collected from the maritime traffic networks are essential for the perception of traffic dynamics and predictive traffic regulation. This paper is devoted to surveying the key processing components in maritime traffic networks. Specifically, the latest progress on maritime traffic data mining technologies for maritime traffic pattern extraction and the recent effort on vessels’ motion forecasting for better situation awareness are reviewed. Through the review, we highlight that the traffic pattern knowledge presents valued insights for wide-spectrum domain application purposes, and serves as a prerequisite for the knowledge based forecasting techniques that are growing in popularity. The development of maritime traffic research in pattern mining and traffic forecasting reviewed in this paper affirms the importance of advanced maritime traffic studies and the great potential in maritime traffic safety and intelligence enhancement to accommodate the implementation of the Internet of Things, artificial intelligence technologies, and knowledge engineering and big data computing solution.

105 citations


Cites methods from "A Grid-Based Multi-relational Appro..."

  • ...SPADA [80]–[82] has been applied to discover associations between a vessel and a trajectory to represent navigation spatio-temporal pattern....

    [...]

Book ChapterDOI
01 Jan 2012
TL;DR: This paper proposes the FIT-metric as a tool to characterize the stability of existing service configurations based on three components: functionality, integration and traffic and applies it to configurations taken from a production-strength SOA-landscape.
Abstract: The paradigm of service-oriented architectures (SOA) is by now accepted for application integration and in widespread use. As an underlying key-technology of cloud computing and because of unresolved issues during operation and maintenance it remains a hot topic. SOA encapsulates business functionality in services, combining aspects from both the business and infrastructure level. The reuse of services results in hidden chains of dependencies that affect governance and optimization of service-based systems. To guarantee the cost-effective availability of the whole service-based application landscape, the real criticality of each dependency has to be determined for IT Service Management (ITSM) to act accordingly. We propose the FIT-metric as a tool to characterize the stability of existing service configurations based on three components: functionality, integration and traffic. In this paper we describe the design of FIT and apply it to configurations taken from a production-strength SOA-landscape. A prototype of FIT is currently being implemented at Deutsche Post MAIL.

7 citations

References
More filters
Journal ArticleDOI
TL;DR: The concept of the border of a theory, a notion that turns out to be surprisingly powerful in analyzing the algorithm, is introduced and strong connections between the verification problem and the hypergraph transversal problem are shown.
Abstract: One of the basic problems in knowledge discovery in databases (KDD) is the following: given a data set r, a class L of sentences for defining subgroups of r, and a selection predicate, find all sentences of L deemed interesting by the selection predicate. We analyze the simple levelwise algorithm for finding all such descriptions. We give bounds for the number of database accesses that the algorithm makes. For this, we introduce the concept of the border of a theory, a notion that turns out to be surprisingly powerful in analyzing the algorithm. We also consider the verification problem of a KDD process: given r and a set of sentences S ⊆ L determine whether S is exactly the set of interesting statements about r. We show strong connections between the verification problem and the hypergraph transversal problem. The verification problem arises in a natural way when using sampling to speed up the pattern discovery step in KDD.

952 citations


"A Grid-Based Multi-relational Appro..." refers methods in this paper

  • ...In SPADA [5], multi-level relational frequent patterns are discovered according to the levelwise method [6] that is based on a breadth-first search in the lattice of patterns spanned by θ-subsumption [8] generality order ( θ)....

    [...]

Journal ArticleDOI
TL;DR: This paper describes the application of process mining in one of the provincial offices of the Dutch National Public Works Department, responsible for the construction and maintenance of the road and water infrastructure.

804 citations


"A Grid-Based Multi-relational Appro..." refers background in this paper

  • ...This huge amount of data is the main concern of research in process mining whose aim is to discover a description or prediction of real process, control, organizational, and social structures [10]....

    [...]

Journal Article
TL;DR: In this paper, the authors present an approach for a system that constructs process models from logs of past, unstructured executions of the given process, which conforms to the dependencies and put executions present in the log.
Abstract: Modern enterprises increasingly use the workflow paradigm to prescribe how business processes should be performed. Processes are typically modeled as annotated activity graphs. We present an approach for a system that constructs process models from logs of past, unstructured executions of the given process. The graph so produced conforms to the dependencies and put executions present in the log. By providing models that capture the previous executions of the process, this technique allows easier introduction of a workflow system and evaluation and evolution of existing process models. We also present results from applying the algorithm to synthetic data sets as well as process logs obtained from an IBM Flowmark installation.

784 citations

Book ChapterDOI
23 Mar 1998
TL;DR: This work presents an approach for a system that constructs process models from logs of past, unstructured executions of the given process, and presents results from applying the algorithm to synthetic data sets as well as process logs obtained from an IBM Flowmark installation.
Abstract: Modern enterprises increasingly use the workflow paradigm to prescribe how business processes should be performed. Processes are typically modeled as annotated activity graphs. We present an approach for a system that constructs process models from logs of past, unstructured executions of the given process. The graph so produced conforms to the dependencies and past executions present in the log. By providing models that capture the previous executions of the process, this technique allows easier introduction of a workflow system and evaluation and evolution of existing process models. We also present results from applying the algorithm to synthetic data sets as well as process logs obtained from an IBM Flowmark installation.

742 citations


"A Grid-Based Multi-relational Appro..." refers background in this paper

  • ...+ − − prpModify [1] | + −− o54318 + − − cast [7] + −− o1609,o1672,o1673,o8299,o8300,....

    [...]

  • ...Currently, many algorithms [2,1,11,3] have dealt with several of these challenges and some of them are integrated into the ProM framework [12]....

    [...]

  • ...Number of global frequent patterns discovered by varying k in [1,20]...

    [...]

01 Jan 2008

480 citations


"A Grid-Based Multi-relational Appro..." refers methods in this paper

  • ...In SPADA [5], multi-level relational frequent patterns are discovered according to the levelwise method [6] that is based on a breadth-first search in the lattice of patterns spanned by θ-subsumption [8] generality order ( θ)....

    [...]