A Grid-Based Multi-relational Approach to Process Mining
Summary (3 min read)
1 Introduction
- Many information systems, such as Workflow Management Systems, ERP systems, Business-to-business systems and Firewall systems trace behavior of running processes by registering relevant events in massive logs.
- Process mining poses several challenges to the traditional data mining tasks.
- This data representation makes necessary distinguishing between the reference objects of analysis and other taskrelevant objects (activities and performers), and to represent their interactions.
- The authors present G-SPADA, an extension of SPADA, which discovers approximate multi-level relational frequent patterns by distributing exact computation of locally frequent multi-level relational patterns on a computational Grid and then by post-processing local patterns in order to approximate the set of the globally frequent patterns as well as their supports.
2 Multi-level Relational Frequent Pattern Discovery
- By taking into account hierarchies on task-relevant objects, relational patterns can be discovered at multiple level of granularity.
- P2 provides better insight than P1 on the nature of B, C and D. In SPADA [5], multi-level relational frequent patterns are discovered according to the levelwise method [6] that is based on a breadth-first search in the lattice of patterns spanned by θ-subsumption [8] generality order ( θ).
3 G-SPADA
- Similarly to Partition [9], G-SPADA splits a dataset into several partitions to be processed independently.
- Each partition includes a subset of the reference objects and the set of task-relevant objects.
- In the second step, the frequent pattern 1 With support greater than minsup[l].
- In the third step, G-SPADA approximates the set of globally frequent patterns by merging patterns discovered at the nodes.
- A merge step with k = 1 may generate several false positives, i.e. patterns that result locally frequent but globally infrequent.
3.1 Relational Data Partitioning
- G-SPADA pre-processes the deductive database of logs and completes the description explicitly provided for each example (DE) with the information that is implicit in the domain knowledge (DI).
- By performing the saturation step, the following predicates are made explicit in the database: before(a1,a2). before(a2,a3).
- These data partitions are enriched by adding the ground predicates which describe properties and relations of the reference objects falling in the partition at hand.
3.2 Distributing Computation on Grid
- Each dataset partition is shipped along with the G-SPADA pattern discovery algorithm to computation nodes on Grid using gLite2 middleware.
- This is done 2 gLite (http://glite.web.cern.ch/glite/) is a next generation middleware for Grid com- puting which provides a framework for building Grid applications.
- By submitting parametric jobs described in JDL (Job Description Language) through the CLI (command line interface).
3.3 Computing Approximate Global Frequent Patterns
- The n sets of local frequent patterns are collected from the computation nodes of the Grid platform and then merged to approximate the set of global patterns.
- Local patterns occurring in less than k partitions are filtered out.
- The global frequent patterns obtained following this merge procedure approximate the original frequent patterns which can be possibly mined on the entire dataset.
- 7 means that this pattern is found in 7 partitions (sample-level support), while 72.5% indicates the macro average support obtained by averaging the support values computed on the 7 samples.
4 Experimental Results
- Experiments are performed by processing event logs provided by THINK3 Inc3 in the context of the TOCAI.
- THINK3 is a global player in Cad and Plm market whose mission is to help manufacturers optimizing their entire product development processes.
- G-SPADA is run on the deductive database that is obtained by boiling down the event logs from January 1st to February 28th, 2006 and considering as domain knowledge the definition of the “before” predicate.
- In the experiments, each case (process instance) traced in the logs is considered as a whole and multi-level relational patterns are discovered from traced business processes.
- These patterns capture the possible relation between the order of activities and the properties of their performers.
4.1 Data Description
- Data trace the behavior of 21,256 instances of a business process recorded in the period under analysis.
- This corresponds to model activities and performers by means of three-level hierarchies .
- For each activity, a text description of the operation is registered in the event logs.
- The right part is a characterization of the description of the operation provided in the left part.
- Finally, each performer is described by the belonging group.
4.2 Local and Global Multi-level Relational Patterns Discovery
- G-SPADA is run on the event logs including 395,404 ground predicates.
- Indeed, SPADA generates a memory exception when running on the entire dataset.
- Multi-level relational patterns are discovered at each node with minsup[l] = 0.2 (l = 1, 2) and max len path = 9 5. Finally, for each level of granularity, global patterns are approximated from the local ones by varying k between 1 and 20.
- Global patterns provide a compact description of the instances of process traced in the logs.
- Finally, the relational pattern: P4: case(A), activity(A,B),before(B,C),before(C,D), is a(B,namemaker), is a(C,workflow), is a(D,workflow), descleft(C,creation), descleft(D,wip2k) [k=16, avgSup=21.35%] describes the execution order among three sequential activities, namely B, C and D. B is a namemaker activity, while C and D are workflow activities.
5 Conclusions
- The authors present G-SPADA, an extension of the system SPADA, to discover approximate multi-lever relational frequent patterns in the context of process mining.
- G-SPADA exploits a multi-relational approach in order to deal with both multiple nature of data stored in event logs and temporal autocorrelation.
- G-SPADA faces the need of processing massive logs by resorting to a grid based architecture.
- Experiments on the real event logs allow us to discover interpretable patterns which capture regularities in the execution of activities and the characteristics of the performers of a business process.
- Such patterns can be used to deploy new systems supporting the execution of business processes or analyzing and improving already enacted business processes.
Did you find this useful? Give us your feedback
Citations
105 citations
Cites methods from "A Grid-Based Multi-relational Appro..."
...SPADA [80]–[82] has been applied to discover associations between a vessel and a trajectory to represent navigation spatio-temporal pattern....
[...]
7 citations
References
952 citations
"A Grid-Based Multi-relational Appro..." refers methods in this paper
...In SPADA [5], multi-level relational frequent patterns are discovered according to the levelwise method [6] that is based on a breadth-first search in the lattice of patterns spanned by θ-subsumption [8] generality order ( θ)....
[...]
804 citations
"A Grid-Based Multi-relational Appro..." refers background in this paper
...This huge amount of data is the main concern of research in process mining whose aim is to discover a description or prediction of real process, control, organizational, and social structures [10]....
[...]
784 citations
742 citations
"A Grid-Based Multi-relational Appro..." refers background in this paper
...+ − − prpModify [1] | + −− o54318 + − − cast [7] + −− o1609,o1672,o1673,o8299,o8300,....
[...]
...Currently, many algorithms [2,1,11,3] have dealt with several of these challenges and some of them are integrated into the ProM framework [12]....
[...]
...Number of global frequent patterns discovered by varying k in [1,20]...
[...]
480 citations
"A Grid-Based Multi-relational Appro..." refers methods in this paper
...In SPADA [5], multi-level relational frequent patterns are discovered according to the levelwise method [6] that is based on a breadth-first search in the lattice of patterns spanned by θ-subsumption [8] generality order ( θ)....
[...]