TL;DR: In this paper, an automated and scalable framework for ML that requires minimum human input is presented for the domain of telecommunications risk management, where the modules included in the framework are task detection (to detect classification or regression), data preprocessing, feature selection, model training, and deployment.
Abstract: Due to the growth of data and widespread usage of Machine Learning (ML) by non-experts, automation and scalability are becoming key issues for ML. This paper presents an automated and scalable framework for ML that requires minimum human input. We designed the framework for the domain of telecommunications risk management. This domain often requires non-ML-experts to continuously update supervised learning models that are trained on huge amounts of data. Thus, the framework uses Automated Machine Learning (AutoML), to select and tune the ML models, and distributed ML, to deal with Big Data. The modules included in the framework are task detection (to detect classification or regression), data preprocessing, feature selection, model training, and deployment. In this paper, we focus the experiments on the model training module. We first analyze the capabilities of eight AutoML tools: Auto-Gluon, Auto-Keras, Auto-Sklearn, Auto-Weka, H2O AutoML, Rminer, TPOT, and TransmogrifAI. Then, to select the tool for model training, we performed a benchmark with the only two tools that address a distributed ML (H2O AutoML and TransmogrifAI). The experiments used three real-world datasets from the telecommunications domain (churn, event forecasting, and fraud detection), as provided by an analytics company. The experiments allowed us to measure the computational effort and predictive capability of the AutoML tools. Both tools obtained high-quality results and did not present substantial predictive differences. Nevertheless, H2O AutoML was selected by the analytics company for the model training module, since it was considered a more mature technology that presented a more interesting set of features (e.g., integration with more platforms). After choosing H2O AutoML for the ML training, we selected the technologies for the remaining components of the architecture (e.g., data preprocessing and web interface).
Nowadays, Machine Learning applications can make use of a great amount of data, complex algorithms, and machines with great processing power to produce effective predictions and forecasts [11].
The fact that it is possible to add new processing units enables ML applications to surpass time and memory restrictions [29].
The experiments used three real-world datasets from the domain of telecommunications.
The main novelty of this extended version is the technological architecture that is presented in Section 6.
This section describes the particular technologies that were used to implement the components of the proposed AutoML distributed framework apart from model training.
2 Related Work
To face that concern, classical distributed ML distributes the work among different processors, each performing part of the algorithm.
Another current ML problem concerns the choice of ML algorithms and hyperparameters for a given task.
Within their knowledge, few studies directly compare AutoML tools.
Most studies compare one specific AutoML framework with state-of-the-art ML algorithms [15], do not present experimental tests [12,35], or are related to ML automation challenges [18–20].
Furthermore, none of the studies used datasets from the domain of telecommunications risk management, such as churn prediction or fraud detection.
3 Proposed Architecture
This paper is part of “Intelligent Risk Management for the Digital Age” , a R&D project developed by a leading Portuguese company in the area of software and analytics.
Both scalability and automation are central requirements to the ML system since the company has many clients with diverse amounts of data (large or small) and that are typically nonML-experts.
The ML technological architecture that is proposed by this work identifies and automates all typical tasks of a common supervised ML application, with minimum human input (only the dataset and the target column).
Also, since the architecture was developed to work within a cluster with several processing nodes, the users can handle any size of datasets just by managing the number of cluster nodes.
3.1 Phases
The proposed architecture assumes two main phases (Fig. 1): a training phase and a testing phase.
The only human input needed by the user is the selection of the training dataset and the identification of the target column.
When all stages are defined, the pipeline is fitted to the training data, creating a pipeline model.
The last stage of the testing pipeline is the application of the best model obtained during training, generating the predictions.
Performance metrics are also computed and presented to the user.
3.2 Components
The proposed architecture includes five main components: task detection, data preprocessing, feature selection, model training (with the usage of AutoML), and pipeline deployment.
The applied transformations depend on the data type of the columns, number of levels, and number of missing values.
Deletes features from the dataset that may decrease the predictive performance of the ML models, using filtering methods, also known as Feature Selection.
The component also identifies the best model to be used on the test phase.
This module saves the pipeline that will be used on a test set, ensuring that the new data will pass through the same transformations as the training data.
4.1 Experimental Evaluation
For the experimental evaluation, the authors first examined the characteristics of the open-source AutoML tools.
Then, the authors used the tools that could be implemented in their architecture to perform a benchmark study.
In order to be considered for the experimental evaluation, the tools have to implement distributed ML.
4.2 AutoML Tools
The authors first analyzed eight recent open-source AutoML tools, to verify their compliance with the project requirements.
Auto-Sklearn is an AutoML Python library based on ScikitLearn [28] that implements methods for automatic algorithm selection and hyperparameter tuning, also known as Auto-Sklearn.
H2O AutoML uses H2O’s infrastructure to provide functions to automate algorithm selection and hyperparameter optimization [21].
Rminer is a package for the R tool, intending to facilitate the use of Machine Learning algorithms, also known as Rminer.
The last two rows are related to the stacking ensembles implemented by H2O AutoML: all, which combines all trained algorithms; and best, which only combines the best algorithm per family.
4.3 Data
For the benchmark study, the authors used three real-world datasets from the domain of telecommunications, provided by the IRMDA project analytics company.
Table 3 describes each attribute of the churn dataset.
The only attributes are the timestamp and the number of events in that interval, as described in Table 4.
The dataset contains more than 1 million examples, which correspond to one day of phone calls from one of the company clients.
5.1 Experimental Setup
The benchmark consisted of several computational experiments that used three real-world datasets to compare the selected AutoML tools (H2O AutoML and TransmogrifAI).
Every AutoML execution implemented a 10-fold cross-validation during the training of the algorithms.
The first scenario (1) considered all the attributes of the dataset as input features for the ML algorithms.
For event forecasting, the authors transformed the dataset, creating time lags as inputs for a regression task.
5.2 Discussion
The experimental results show that both AutoML tools require a small execution time to select the best ML model, with the highest mean execution time being slightly higher than 7 minutes.
The low training time can be justified with the usage of distributed ML, datasets with small number of rows or columns, and the removal of Deep Learning algorithms.
TransmogrifAI obtained the best predictive results in two regression scenarios and two classification scenarios.
This choice was supported by two main reasons.
First, H2O AutoML obtained better predictive results for most of the scenarios.
6 Technological Architecture
After the comparative ML experiments, the analytics company selected the H2O AutoML tool for the model training component.
The remaining technological modules were then designed in cooperation with the company.
Given that H2O can be integrated with Apache Spark (using the Sparkling Water module) and that Spark provides functions for data processing, the authors relied on Spark’s Application Programming Interface (API) functions to implement the remaining components of the architecture.
The updated architecture, with references to the technologies used, is illustrated in Fig. 2.
6.1 Components
This subsection describes the current implementation of each module of the architecture.
These changes were related to feedback received from the analytics company or due to technological restrictions.
Currently, the preprocessing transformations (e.g., dealing with missing data, the encoding of categorical features, standardization of numerical features) are done using Apache Spark’s functions for extracting, transforming and selecting features [1], also known as Data Preprocessing.
This function replaces the unknown values of a column with its mean value.
For classification (binary or multi-class) and regression tasks, the authors use H2O AutoML to automatically find and tune the best model.
6.2 API
In order to facilitate the execution of the architecture, the authors also created a REST API to mediate the communication between the end-users and the pipelines.
Since the execution of each request consists of one Apache Spark job (using H2O’s capabilities through the Sparkling Water module), the API works as an intermediary between the end-user and the execution of the code inside Spark.
The server formats the response to the appropriate format (e.g., XML, JSON) and sends the response to the client interface.
7 Conclusions
This paper proposes a ML framework to automate the typical workflow of supervised ML applications without the need for human input.
The framework was developed within project IRMDA, a R&D project developed by a leading Portuguese software and analytics company that provides services for the domain of telecommunications risk management.
In order to assess the most appropriate AutoML tools for this model training module, the authors initially conducted a benchmark experiment.
The authors selected technologies with distributed capabilities for the remaining modules of the initially proposed framework.
Besides, the authors intend to add more ML tasks to the framework, such as ordinal classification, multi-target regression, or multivariate time series.
TL;DR: In this article , four automated machine learning (AutoML) approach was applied to model the effects of microplastics on anaerobic digestion processes, and integrated explainable analysis was explored to reveal the relationships between key variables (e.g., concentration, type, and size of micro-plastics) and methane production.
Abstract: Microplastics as emerging pollutants have been heavily accumulated in the waste activated sludge (WAS) during biological wastewater treatment, which showed significantly diverse impacts on the subsequent anaerobic sludge digestion for methane production. However, a robust modeling approach for predicting and unveiling the complex effects of accumulated microplastics within WAS on methane production is still missing. In this study, four automated machine learning (AutoML) approach was applied to model the effects of microplastics on anaerobic digestion processes, and integrated explainable analysis was explored to reveal the relationships between key variables (e.g., concentration, type, and size of microplastics) and methane production. The results showed that the gradient boosting machine had better prediction performance (mean squared error (MSE) = 17.0) than common neural networks models (MSE = 58.0), demonstrating that the AutoML algorithms succeeded in predicting the methane production and could select the best machine learning model without human intervention. Explainable analysis results indicated that the variable of microplastic types was more important than the variable of microplastic diameter and concentration. The existence of polystyrene was associated with higher methane production, whereas increasing microplastic diameter and concentration both inhibited methane production. This work also provided a novel modeling approach for comprehensively understanding the complex effects of microplastics on methane production, which revealed the dependence relationships between methane production and key variables and may be served as a reference for optimizing operational adjustments in anaerobic digestion processes.
TL;DR: A preoperative autoML prediction model for CSA-AKI that provided high predictive performance that was comparable to RF and superior to other ML and multivariable logistic regression models is presented.
Abstract: Background: We aimed to develop and validate an automated machine learning (autoML) prediction model for cardiac surgery-associated acute kidney injury (CSA-AKI). Methods: Using 69 preoperative variables, we developed several models to predict post-operative AKI in adult patients undergoing cardiac surgery. Models included autoML and non-autoML types, including decision tree (DT), random forest (RF), extreme gradient boosting (XGBoost), and artificial neural network (ANN), as well as a logistic regression prediction model. We then compared model performance using area under the receiver operating characteristic curve (AUROC) and assessed model calibration using Brier score on the independent testing dataset. Results: The incidence of CSA-AKI was 36%. Stacked ensemble autoML had the highest predictive performance among autoML models, and was chosen for comparison with other non-autoML and multivariable logistic regression models. The autoML had the highest AUROC (0.79), followed by RF (0.78), XGBoost (0.77), multivariable logistic regression (0.77), ANN (0.75), and DT (0.64). The autoML had comparable AUROC with RF and outperformed the other models. The autoML was well-calibrated. The Brier score for autoML, RF, DT, XGBoost, ANN, and multivariable logistic regression was 0.18, 0.18, 0.21, 0.19, 0.19, and 0.18, respectively. We applied SHAP and LIME algorithms to our autoML prediction model to extract an explanation of the variables that drive patient-specific predictions of CSA-AKI. Conclusion: We were able to present a preoperative autoML prediction model for CSA-AKI that provided high predictive performance that was comparable to RF and superior to other ML and multivariable logistic regression models. The novel approaches of the proposed explainable preoperative autoML prediction model for CSA-AKI may guide clinicians in advancing individualized medicine plans for patients under cardiac surgery.
TL;DR: Wang et al. as mentioned in this paper proposed an automated machine learning (AutoML)-based indirect carbon emission analysis (ACIA) approach and predicted the specific indirect carbon emissions from electrical consumption (SEe; kg CO2/m3) successfully in nine full-scale WWTPs (W1-W9) with different treatment configurations based on the historical operational data.
Abstract: The indirect carbon emission from electrical consumption of wastewater treatment plants (WWTPs) accounts for large proportions of their total carbon emissions, which deserves intensive attention. This work proposed an automated machine learning (AutoML)-based indirect carbon emission analysis (ACIA) approach and predicted the specific indirect carbon emission from electrical consumption (SEe; kg CO2/m3) successfully in nine full-scale WWTPs (W1–W9) with different treatment configurations based on the historical operational data. The stacked ensemble models generated by the AutoML accurately predicted the SEe (mean absolute error = 0.02232–0.02352, R2 = 0.65107–0.67509). Then, the variable importance and Shapley additive explanations (SHAP) summary plots qualitatively revealed that the influent volume and the types of secondary and tertiary treatment processes were the most important variables associated with SEe prediction. The interpretation results of partial dependence and individual conditional expectation further verified quantitative relationships between input variables and SEe. Also, low energy efficiency with high indirect carbon emission of WWTPs was distinguished. Compared with traditional carbon emission analysis and prediction methods, the ACIA method could accurately evaluate and predict SEe of WWTPs with different treatment scales and processes with easily available variables and reveal qualitative and quantitative relationships inside datasets simultaneously, which is a powerful tool to benefit the “carbon neutrality” of WWTPs.
TL;DR: In this article , the authors take a systematic approach to review articles containing risk management in software development projects and find the most exciting topics for researchers in risk management, especially in software engineering projects.
Abstract: Risk Management is an integral part of every project. Risk management must estimate the risks’ significance, especially in the SDLC process, and mitigate those risks. Since 2016, many papers and journals have researched planning, design, and risk control in software development projects over the last five years. This study aims to find the most exciting topics for researchers in risk management, especially in software engineering projects. This paper takes a systematic approach to reviewing articles containing risk management in software development projects. This study collects papers and journals included in the international online library database, then summarizes them according to the stages of the PICOC methodology. This paper results in the focus of research in the last five years on Agile methods. The current issue is that many researchers are trying to explicitly integrate risk management into the Agile development process by creating a comprehensive risk management framework. This SLR helps future research get a theoretical basis to solve the studied problem. The SLR explains the focuses of previous research, analysis of research results, and the weaknesses of the investigation. For further study, take one of the topic papers, do a critical review, and find research gaps.
TL;DR: Scikit-learn is a Python module integrating a wide range of state-of-the-art machine learning algorithms for medium-scale supervised and unsupervised problems, focusing on bringing machine learning to non-specialists using a general-purpose high-level language.
Abstract: Scikit-learn is a Python module integrating a wide range of state-of-the-art machine learning algorithms for medium-scale supervised and unsupervised problems. This package focuses on bringing machine learning to non-specialists using a general-purpose high-level language. Emphasis is put on ease of use, performance, documentation, and API consistency. It has minimal dependencies and is distributed under the simplified BSD license, encouraging its use in both academic and commercial settings. Source code, binaries, and documentation can be downloaded from http://scikit-learn.sourceforge.net.
TL;DR: Scikit-learn as mentioned in this paper is a Python module integrating a wide range of state-of-the-art machine learning algorithms for medium-scale supervised and unsupervised problems.
Abstract: Scikit-learn is a Python module integrating a wide range of state-of-the-art machine learning algorithms for medium-scale supervised and unsupervised problems. This package focuses on bringing machine learning to non-specialists using a general-purpose high-level language. Emphasis is put on ease of use, performance, documentation, and API consistency. It has minimal dependencies and is distributed under the simplified BSD license, encouraging its use in both academic and commercial settings. Source code, binaries, and documentation can be downloaded from this http URL.
TL;DR: In this article, a method of over-sampling the minority class involves creating synthetic minority class examples, which is evaluated using the area under the Receiver Operating Characteristic curve (AUC) and the ROC convex hull strategy.
Abstract: An approach to the construction of classifiers from imbalanced datasets is described. A dataset is imbalanced if the classification categories are not approximately equally represented. Often real-world data sets are predominately composed of "normal" examples with only a small percentage of "abnormal" or "interesting" examples. It is also the case that the cost of misclassifying an abnormal (interesting) example as a normal example is often much higher than the cost of the reverse error. Under-sampling of the majority (normal) class has been proposed as a good means of increasing the sensitivity of a classifier to the minority class. This paper shows that a combination of our method of oversampling the minority (abnormal)cla ss and under-sampling the majority (normal) class can achieve better classifier performance (in ROC space)tha n only under-sampling the majority class. This paper also shows that a combination of our method of over-sampling the minority class and under-sampling the majority class can achieve better classifier performance (in ROC space)t han varying the loss ratios in Ripper or class priors in Naive Bayes. Our method of over-sampling the minority class involves creating synthetic minority class examples. Experiments are performed using C4.5, Ripper and a Naive Bayes classifier. The method is evaluated using the area under the Receiver Operating Characteristic curve (AUC)and the ROC convex hull strategy.
TL;DR: In this article, a method of over-sampling the minority class involves creating synthetic minority class examples, which is evaluated using the area under the Receiver Operating Characteristic curve (AUC) and the ROC convex hull strategy.
Abstract: An approach to the construction of classifiers from imbalanced datasets is described. A dataset is imbalanced if the classification categories are not approximately equally represented. Often real-world data sets are predominately composed of "normal" examples with only a small percentage of "abnormal" or "interesting" examples. It is also the case that the cost of misclassifying an abnormal (interesting) example as a normal example is often much higher than the cost of the reverse error. Under-sampling of the majority (normal) class has been proposed as a good means of increasing the sensitivity of a classifier to the minority class. This paper shows that a combination of our method of over-sampling the minority (abnormal) class and under-sampling the majority (normal) class can achieve better classifier performance (in ROC space) than only under-sampling the majority class. This paper also shows that a combination of our method of over-sampling the minority class and under-sampling the majority class can achieve better classifier performance (in ROC space) than varying the loss ratios in Ripper or class priors in Naive Bayes. Our method of over-sampling the minority class involves creating synthetic minority class examples. Experiments are performed using C4.5, Ripper and a Naive Bayes classifier. The method is evaluated using the area under the Receiver Operating Characteristic curve (AUC) and the ROC convex hull strategy.
TL;DR: A survey of machine learning methods for handling data sets containing large amounts of irrelevant information can be found in this article, where the authors focus on two key issues: selecting relevant features and selecting relevant examples.
Abstract: In this survey, we review work in machine learning on methods for handling data sets containing large amounts of irrelevant information. We focus on two key issues: the problem of selecting relevant features, and the problem of selecting relevant examples. We describe the advances that have been made on these topics in both empirical and theoretical work in machine learning, and we present a general framework that we use to compare different methods. We close with some challenges for future work in this area. @ 1997 Elsevier Science B.V.
Q1. What are the contributions in "A scalable and automated machine learning framework to support risk management" ?
This paper presents an automated and scalable framework for ML that requires minimum human input. In this paper, the authors focus the experiments on the model training module. The authors first analyze the capabilities of eight AutoML tools: Auto-Gluon, Auto-Keras, Auto-Sklearn, Auto-Weka, H2O AutoML, Rminer, TPOT, and TransmogrifAI. Then, to select the tool for model training, the authors performed a benchmark with the only two tools that address a distributed ML ( H2O AutoML and TransmogrifAI ). The experiments used three real-world datasets from the telecommunications domain ( churn, event forecasting, and fraud detection ), as provided by an analytics company.
Q2. What are the future works mentioned in the paper "A scalable and automated machine learning framework to support risk management" ?
In future work, the authors intend to use more telecommunications datasets to provide additional benchmarks for the model training module. Finally, even though the framework was developed specifically for the telecommunications risk management domain, the authors intend to study the applicability of the framework to other areas. Moreover, new AutoML tools can be considered, as long as they provide distributed capabilities.