A Rule-Based Approach for Process Discovery: Dealing with Noise and Imbalance in Process Logs
read more
Citations
Machine learning
Process mining
A multi-dimensional quality assessment of state-of-the-art process discovery algorithms using real-life event logs
Robust Process Discovery with Artificial Negative Events
Event log imperfection patterns for process mining
References
C4.5: Programs for Machine Learning
Machine learning
Fast effective rule induction
The application of Petri-nets to workflow management
Workflow mining: discovering process models from event logs
Related Papers (5)
Frequently Asked Questions (16)
Q2. How do the authors control how their method misses or incorrectly predicts some relations?
By generating experimental data where variations appear in the number of event types, imbalance, noise and log size, the authors attempt to control how their method misses or incorrectly predicts some relations.
Q3. What are the three performance indicators used to compare the 10 obtained models?
In order to compare the performance of the 10 obtained models, the authors consider three averaged performance indicators: the error rate, precision and recall.
Q4. Why do the authors use the first rule set for future predictions?
Because the performance indicators do not differ significantly, the authors have support to use the induced first rule set for performing future predictions on causal relations.
Q5. What are the three possible situations where a pair of events can be related?
If there is no noise, there are three possible situations in which a pair of events (henceforth referred to as tasks) can be related, namely causal, exclusive, and parallel:causal relation.
Q6. What are the advantages of petri nets?
Petri netformalism has several advantages, therefore they are often used to represent process models (Aalst, 1998): formal semantics (a clear and precise definition), graphical nature (intuitive and easy to learn), expressiveness (support all primitives needed to model a process), properties (the mathematical foundation allows for reasoning of Petri Nets properties), analysis (many analysis techniques to prove properties and calculate performance measures), vendor independent (not based on software package of a specific vendor).
Q7. How is the training error rate calculated for RIPPER CAUS?
The training error rate for RIPPER CAUS is 0.08% (the training error rate represents the rate of incorrect predictions made by the model relabeling the training data set).
Q8. How do the authors estimate the generalization performance of a rule set?
Since training error is not relevant to assess the generalization performance and quality of a rule set, the authors estimate its generalization performance using test material in Section 5.3.
Q9. What are the challenges of process mining?
A review of challenging process mining problems is made in Aalst and Weijters (2004), which refer to mining hidden tasks, mining duplicate tasks, mining loops, using time, mining different perspectives, and dealing with noise and incompleteness.
Q10. What are the CM, LM, YX and XY metrics?
In Section 4 the authors introduced five relational metrics CM, GM, LM, YX and XY to be used as predictive features for determining the causal and exclusive/parallel relations between pairs of events.
Q11. What is the effect of log size on generalization performance?
As expected, the incompleteness of the log affecting the generalization performance of finding causal relations: as log size increases, performance increases.
Q12. What is the main reason for the interest in process mining?
The growing interest into the automation of analysing existing processes, process mining can be explained by the availability of logged information, which most information systems (traditional or process-aware) support.
Q13. Why do the authors use the same definitions as in Aalst et al. (2004)?
Because the authors focus on the process perspective, the authors use the same definitions as in Aalst et al. (2004), this time referring to process logs and process traces.
Q14. What are the limitations of the algorithm?
In this paper, the authors address the second limitation of the α algorithm presented in Aalst et al. (2004), namely dealing with incomplete and noisy process logs, to allow its applicability to real-world processes.
Q15. What is the effect of noise on the generalization performance of causal relations?
It appears that noise is affecting exclusive and parallel relations in a similar way as the causal relations, e.g., if the level of noise increases, the accuracy of finding the excusive/parallel relations decreases.
Q16. What is the LM measure for tasks a and c?
Their LM measure for tasks a and b gives a value of LM = 0.85 and for tasks a and c gives a value of LM = 0.90, which is in line with their intuition.