TL;DR: This paper investigates the use of a multi-level relational frequent pattern discovery method as a means of process mining using a Grid-based implementation of the knowledge discovery algorithm that distributes the computation on several nodes of a Grid platform.
Abstract: Industrial, scientific, and commercial applications use information systems to trace the execution of a business process. Relevant events are registered in massive logs and process mining techniques are used to automatically discover knowledge that reveals the execution and organization of the process instances (cases). In this paper, we investigate the use of a multi-level relational frequent pattern discovery method as a means of process mining. In order to process such massive logs we resort to a Grid-based implementation of the knowledge discovery algorithm that distributes the computation on several nodes of a Grid platform. Experiments are performed on real event logs.
Many information systems, such as Workflow Management Systems, ERP systems, Business-to-business systems and Firewall systems trace behavior of running processes by registering relevant events in massive logs.
Process mining poses several challenges to the traditional data mining tasks.
This data representation makes necessary distinguishing between the reference objects of analysis and other taskrelevant objects (activities and performers), and to represent their interactions.
The authors present G-SPADA, an extension of SPADA, which discovers approximate multi-level relational frequent patterns by distributing exact computation of locally frequent multi-level relational patterns on a computational Grid and then by post-processing local patterns in order to approximate the set of the globally frequent patterns as well as their supports.
By taking into account hierarchies on task-relevant objects, relational patterns can be discovered at multiple level of granularity.
P2 provides better insight than P1 on the nature of B, C and D. In SPADA [5], multi-level relational frequent patterns are discovered according to the levelwise method [6] that is based on a breadth-first search in the lattice of patterns spanned by θ-subsumption [8] generality order ( θ).
3 G-SPADA
Similarly to Partition [9], G-SPADA splits a dataset into several partitions to be processed independently.
Each partition includes a subset of the reference objects and the set of task-relevant objects.
In the second step, the frequent pattern 1 With support greater than minsup[l].
In the third step, G-SPADA approximates the set of globally frequent patterns by merging patterns discovered at the nodes.
A merge step with k = 1 may generate several false positives, i.e. patterns that result locally frequent but globally infrequent.
3.1 Relational Data Partitioning
G-SPADA pre-processes the deductive database of logs and completes the description explicitly provided for each example (DE) with the information that is implicit in the domain knowledge (DI).
By performing the saturation step, the following predicates are made explicit in the database: before(a1,a2). before(a2,a3).
These data partitions are enriched by adding the ground predicates which describe properties and relations of the reference objects falling in the partition at hand.
3.2 Distributing Computation on Grid
Each dataset partition is shipped along with the G-SPADA pattern discovery algorithm to computation nodes on Grid using gLite2 middleware.
This is done 2 gLite (http://glite.web.cern.ch/glite/) is a next generation middleware for Grid com- puting which provides a framework for building Grid applications.
By submitting parametric jobs described in JDL (Job Description Language) through the CLI (command line interface).
3.3 Computing Approximate Global Frequent Patterns
The n sets of local frequent patterns are collected from the computation nodes of the Grid platform and then merged to approximate the set of global patterns.
Local patterns occurring in less than k partitions are filtered out.
The global frequent patterns obtained following this merge procedure approximate the original frequent patterns which can be possibly mined on the entire dataset.
7 means that this pattern is found in 7 partitions (sample-level support), while 72.5% indicates the macro average support obtained by averaging the support values computed on the 7 samples.
4 Experimental Results
Experiments are performed by processing event logs provided by THINK3 Inc3 in the context of the TOCAI.
THINK3 is a global player in Cad and Plm market whose mission is to help manufacturers optimizing their entire product development processes.
G-SPADA is run on the deductive database that is obtained by boiling down the event logs from January 1st to February 28th, 2006 and considering as domain knowledge the definition of the “before” predicate.
In the experiments, each case (process instance) traced in the logs is considered as a whole and multi-level relational patterns are discovered from traced business processes.
These patterns capture the possible relation between the order of activities and the properties of their performers.
4.1 Data Description
Data trace the behavior of 21,256 instances of a business process recorded in the period under analysis.
This corresponds to model activities and performers by means of three-level hierarchies .
For each activity, a text description of the operation is registered in the event logs.
The right part is a characterization of the description of the operation provided in the left part.
Finally, each performer is described by the belonging group.
4.2 Local and Global Multi-level Relational Patterns Discovery
G-SPADA is run on the event logs including 395,404 ground predicates.
Indeed, SPADA generates a memory exception when running on the entire dataset.
Multi-level relational patterns are discovered at each node with minsup[l] = 0.2 (l = 1, 2) and max len path = 9 5. Finally, for each level of granularity, global patterns are approximated from the local ones by varying k between 1 and 20.
Global patterns provide a compact description of the instances of process traced in the logs.
Finally, the relational pattern: P4: case(A), activity(A,B),before(B,C),before(C,D), is a(B,namemaker), is a(C,workflow), is a(D,workflow), descleft(C,creation), descleft(D,wip2k) [k=16, avgSup=21.35%] describes the execution order among three sequential activities, namely B, C and D. B is a namemaker activity, while C and D are workflow activities.
5 Conclusions
The authors present G-SPADA, an extension of the system SPADA, to discover approximate multi-lever relational frequent patterns in the context of process mining.
G-SPADA exploits a multi-relational approach in order to deal with both multiple nature of data stored in event logs and temporal autocorrelation.
G-SPADA faces the need of processing massive logs by resorting to a grid based architecture.
Experiments on the real event logs allow us to discover interpretable patterns which capture regularities in the execution of activities and the characteristics of the performers of a business process.
Such patterns can be used to deploy new systems supporting the execution of business processes or analyzing and improving already enacted business processes.
TL;DR: The development of maritime traffic research in pattern mining and traffic forecasting affirms the importance of advanced maritime traffic studies and the great potential in maritime traffic safety and intelligence enhancement to accommodate the implementation of the Internet of Things, artificial intelligence technologies, and knowledge engineering and big data computing solution.
Abstract: Maritime traffic service networks and information systems play a vital role in maritime traffic safety management. The data collected from the maritime traffic networks are essential for the perception of traffic dynamics and predictive traffic regulation. This paper is devoted to surveying the key processing components in maritime traffic networks. Specifically, the latest progress on maritime traffic data mining technologies for maritime traffic pattern extraction and the recent effort on vessels’ motion forecasting for better situation awareness are reviewed. Through the review, we highlight that the traffic pattern knowledge presents valued insights for wide-spectrum domain application purposes, and serves as a prerequisite for the knowledge based forecasting techniques that are growing in popularity. The development of maritime traffic research in pattern mining and traffic forecasting reviewed in this paper affirms the importance of advanced maritime traffic studies and the great potential in maritime traffic safety and intelligence enhancement to accommodate the implementation of the Internet of Things, artificial intelligence technologies, and knowledge engineering and big data computing solution.
105 citations
Cites methods from "A Grid-Based Multi-relational Appro..."
...SPADA [80]–[82] has been applied to discover associations between a vessel and a trajectory to represent navigation spatio-temporal pattern....
TL;DR: This paper proposes the FIT-metric as a tool to characterize the stability of existing service configurations based on three components: functionality, integration and traffic and applies it to configurations taken from a production-strength SOA-landscape.
Abstract: The paradigm of service-oriented architectures (SOA) is by now accepted for application integration and in widespread use. As an underlying key-technology of cloud computing and because of unresolved issues during operation and maintenance it remains a hot topic. SOA encapsulates business functionality in services, combining aspects from both the business and infrastructure level. The reuse of services results in hidden chains of dependencies that affect governance and optimization of service-based systems. To guarantee the cost-effective availability of the whole service-based application landscape, the real criticality of each dependency has to be determined for IT Service Management (ITSM) to act accordingly. We propose the FIT-metric as a tool to characterize the stability of existing service configurations based on three components: functionality, integration and traffic. In this paper we describe the design of FIT and apply it to configurations taken from a production-strength SOA-landscape. A prototype of FIT is currently being implemented at Deutsche Post MAIL.
TL;DR: Without a concerted effort to develop knowledge discovery techniques, organizations stand to forfeit much of the value from the data they currently collect and store.
Abstract: Current computing and storage technology is rapidly outstripping society's ability to make meaningful use of the torrent of available data. Without a concerted effort to develop knowledge discovery techniques, organizations stand to forfeit much of the value from the data they currently collect and store.
TL;DR: This paper presents an efficient algorithm for mining association rules that is fundamentally different from known algorithms and not only reduces the I/O overhead significantly but also has lower CPU overhead for most cases.
Abstract: Mining for a.ssociation rules between items in a large database of sales transactions has been described as an important database mining problem. In this paper we present an efficient algorithm for mining association rules that is fundamentally different from known algorithms. Compared to previous algorithms, our algorithm not only reduces the I/O overhead significantly but also has lower CPU overhead for most cases. We have performed extensive experiments and compared the performance of our algorithm with one of the best existing algorithms. It was found that for large databases, the CPU overhead was reduced by as much as a factor of four and I/O was reduced by almost an order of magnitude. Hence this algorithm is especially suitable for very large size databases.
1,822 citations
Additional excerpts
...Similarly to Partition [9], G-SPADA splits a dataset into several partitions to be processed independently....
TL;DR: The presented theory views inductive learning as a heuristic search through a space of symbolic descriptions, generated by an application of various inference rules to the initial observational statements.
Abstract: The presented theory views inductive learning as a heuristic search through a space of symbolic descriptions, generated by an application of various inference rules to the initial observational statements. The inference rules include generalization rules, which perform generalizing transformations on descriptions, and conventional truth-preserving deductive rules. The application of the inference rules to descriptions is constrained by problem background knowledge, and guided by criteria evaluating the “quality” of generated inductive assertions.
TL;DR: This paper introduces the concept of workflow mining and presents a common format for workflow logs, and discusses the most challenging problems and present some of the workflow mining approaches available today.
Abstract: Many of today's information systems are driven by explicit process models. Workflow management systems, but also ERP, CRM, SCM, and B2B, are configured on the basis of a workflow model specifying the order in which tasks need to be executed. Creating a workflow design is a complicated time-consuming process and typically there are discrepancies between the actual workflow processes and the processes as perceived by the management. To support the design of workflows, we propose the use of workflow mining. Starting point for workflow mining is a so-called "workflow log" containing information about the workflow process as it is actually being executed. In this paper, we introduce the concept of workflow mining and present a common format for workflow logs. Then we discuss the most challenging problems and present some of the workflow mining approaches available today.
1,168 citations
"A Grid-Based Multi-relational Appro..." refers background in this paper
...Currently, many algorithms [2,1,11,3] have dealt with several of these challenges and some of them are integrated into the ProM framework [12]....
TL;DR: The ProM framework is introduced and an overview of the plug-ins that have been developed and is flexible with respect to the input and output format, and is also open enough to allow for the easy reuse of code during the implementation of new process mining ideas.
Abstract: Under the umbrella of buzzwords such as “Business Activity Monitoring” (BAM) and “Business Process Intelligence” (BPI) both academic (e.g., EMiT, Little Thumb, InWoLvE, Process Miner, and MinSoN) and commercial tools (e.g., ARIS PPM, HP BPI, and ILOG JViews) have been developed. The goal of these tools is to extract knowledge from event logs (e.g., transaction logs in an ERP system or audit trails in a WFM system), i.e., to do process mining. Unfortunately, tools use different formats for reading/storing log files and present their results in different ways. This makes it difficult to use different tools on the same data set and to compare the mining results. Furthermore, some of these tools implement concepts that can be very useful in the other tools but it is often difficult to combine tools. As a result, researchers working on new process mining techniques are forced to build a mining infrastructure from scratch or test their techniques in an isolated way, disconnected from any practical applications. To overcome these kind of problems, we have developed the ProM framework, i.e., an “pluggable” environment for process mining. The framework is flexible with respect to the input and output format, and is also open enough to allow for the easy reuse of code during the implementation of new process mining ideas. This paper introduces the ProM framework and gives an overview of the plug-ins that have been developed.
958 citations
"A Grid-Based Multi-relational Appro..." refers background in this paper
...Currently, many algorithms [2,1,11,3] have dealt with several of these challenges and some of them are integrated into the ProM framework [12]....