A scalable and automated machine learning framework to support risk management
Abstract: Due to the growth of data and widespread usage of Machine Learning (ML) by non-experts, automation and scalability are becoming key issues for ML. This paper presents an automated and scalable framework for ML that requires minimum human input. We designed the framework for the domain of telecommunications risk management. This domain often requires non-ML-experts to continuously update supervised learning models that are trained on huge amounts of data. Thus, the framework uses Automated Machine Learning (AutoML), to select and tune the ML models, and distributed ML, to deal with Big Data. The modules included in the framework are task detection (to detect classification or regression), data preprocessing, feature selection, model training, and deployment. In this paper, we focus the experiments on the model training module. We first analyze the capabilities of eight AutoML tools: Auto-Gluon, Auto-Keras, Auto-Sklearn, Auto-Weka, H2O AutoML, Rminer, TPOT, and TransmogrifAI. Then, to select the tool for model training, we performed a benchmark with the only two tools that address a distributed ML (H2O AutoML and TransmogrifAI). The experiments used three real-world datasets from the telecommunications domain (churn, event forecasting, and fraud detection), as provided by an analytics company. The experiments allowed us to measure the computational effort and predictive capability of the AutoML tools. Both tools obtained high-quality results and did not present substantial predictive differences. Nevertheless, H2O AutoML was selected by the analytics company for the model training module, since it was considered a more mature technology that presented a more interesting set of features (e.g., integration with more platforms). After choosing H2O AutoML for the ML training, we selected the technologies for the remaining components of the architecture (e.g., data preprocessing and web interface).
Summary (3 min read)
- Nowadays, Machine Learning applications can make use of a great amount of data, complex algorithms, and machines with great processing power to produce effective predictions and forecasts .
- The fact that it is possible to add new processing units enables ML applications to surpass time and memory restrictions .
- The experiments used three real-world datasets from the domain of telecommunications.
- The main novelty of this extended version is the technological architecture that is presented in Section 6.
- This section describes the particular technologies that were used to implement the components of the proposed AutoML distributed framework apart from model training.
3 Proposed Architecture
- This paper is part of “Intelligent Risk Management for the Digital Age” , a R&D project developed by a leading Portuguese company in the area of software and analytics.
- Both scalability and automation are central requirements to the ML system since the company has many clients with diverse amounts of data (large or small) and that are typically nonML-experts.
- The ML technological architecture that is proposed by this work identifies and automates all typical tasks of a common supervised ML application, with minimum human input (only the dataset and the target column).
- Also, since the architecture was developed to work within a cluster with several processing nodes, the users can handle any size of datasets just by managing the number of cluster nodes.
- The proposed architecture assumes two main phases (Fig. 1): a training phase and a testing phase.
- The only human input needed by the user is the selection of the training dataset and the identification of the target column.
- When all stages are defined, the pipeline is fitted to the training data, creating a pipeline model.
- The last stage of the testing pipeline is the application of the best model obtained during training, generating the predictions.
- Performance metrics are also computed and presented to the user.
- The proposed architecture includes five main components: task detection, data preprocessing, feature selection, model training (with the usage of AutoML), and pipeline deployment.
- The applied transformations depend on the data type of the columns, number of levels, and number of missing values.
- Deletes features from the dataset that may decrease the predictive performance of the ML models, using filtering methods, also known as Feature Selection.
- The component also identifies the best model to be used on the test phase.
- This module saves the pipeline that will be used on a test set, ensuring that the new data will pass through the same transformations as the training data.
4.1 Experimental Evaluation
- For the experimental evaluation, the authors first examined the characteristics of the open-source AutoML tools.
- Then, the authors used the tools that could be implemented in their architecture to perform a benchmark study.
- In order to be considered for the experimental evaluation, the tools have to implement distributed ML.
4.2 AutoML Tools
- The authors first analyzed eight recent open-source AutoML tools, to verify their compliance with the project requirements.
- Auto-Sklearn is an AutoML Python library based on ScikitLearn  that implements methods for automatic algorithm selection and hyperparameter tuning, also known as Auto-Sklearn.
- H2O AutoML uses H2O’s infrastructure to provide functions to automate algorithm selection and hyperparameter optimization .
- Rminer is a package for the R tool, intending to facilitate the use of Machine Learning algorithms, also known as Rminer.
- The last two rows are related to the stacking ensembles implemented by H2O AutoML: all, which combines all trained algorithms; and best, which only combines the best algorithm per family.
- For the benchmark study, the authors used three real-world datasets from the domain of telecommunications, provided by the IRMDA project analytics company.
- Table 3 describes each attribute of the churn dataset.
- The only attributes are the timestamp and the number of events in that interval, as described in Table 4.
- The dataset contains more than 1 million examples, which correspond to one day of phone calls from one of the company clients.
5.1 Experimental Setup
- The benchmark consisted of several computational experiments that used three real-world datasets to compare the selected AutoML tools (H2O AutoML and TransmogrifAI).
- Every AutoML execution implemented a 10-fold cross-validation during the training of the algorithms.
- The first scenario (1) considered all the attributes of the dataset as input features for the ML algorithms.
- For event forecasting, the authors transformed the dataset, creating time lags as inputs for a regression task.
- The experimental results show that both AutoML tools require a small execution time to select the best ML model, with the highest mean execution time being slightly higher than 7 minutes.
- The low training time can be justified with the usage of distributed ML, datasets with small number of rows or columns, and the removal of Deep Learning algorithms.
- TransmogrifAI obtained the best predictive results in two regression scenarios and two classification scenarios.
- This choice was supported by two main reasons.
- First, H2O AutoML obtained better predictive results for most of the scenarios.
6 Technological Architecture
- After the comparative ML experiments, the analytics company selected the H2O AutoML tool for the model training component.
- The remaining technological modules were then designed in cooperation with the company.
- Given that H2O can be integrated with Apache Spark (using the Sparkling Water module) and that Spark provides functions for data processing, the authors relied on Spark’s Application Programming Interface (API) functions to implement the remaining components of the architecture.
- The updated architecture, with references to the technologies used, is illustrated in Fig. 2.
- This subsection describes the current implementation of each module of the architecture.
- These changes were related to feedback received from the analytics company or due to technological restrictions.
- Currently, the preprocessing transformations (e.g., dealing with missing data, the encoding of categorical features, standardization of numerical features) are done using Apache Spark’s functions for extracting, transforming and selecting features , also known as Data Preprocessing.
- This function replaces the unknown values of a column with its mean value.
- For classification (binary or multi-class) and regression tasks, the authors use H2O AutoML to automatically find and tune the best model.
- In order to facilitate the execution of the architecture, the authors also created a REST API to mediate the communication between the end-users and the pipelines.
- Since the execution of each request consists of one Apache Spark job (using H2O’s capabilities through the Sparkling Water module), the API works as an intermediary between the end-user and the execution of the code inside Spark.
- The server formats the response to the appropriate format (e.g., XML, JSON) and sends the response to the client interface.
- This paper proposes a ML framework to automate the typical workflow of supervised ML applications without the need for human input.
- The framework was developed within project IRMDA, a R&D project developed by a leading Portuguese software and analytics company that provides services for the domain of telecommunications risk management.
- In order to assess the most appropriate AutoML tools for this model training module, the authors initially conducted a benchmark experiment.
- The authors selected technologies with distributed capabilities for the remaining modules of the initially proposed framework.
- Besides, the authors intend to add more ML tasks to the framework, such as ordinal classification, multi-target regression, or multivariate time series.
Did you find this useful? Give us your feedback
Related Papers (5)
Frequently Asked Questions (2)
Q1. What are the contributions in "A scalable and automated machine learning framework to support risk management" ?
This paper presents an automated and scalable framework for ML that requires minimum human input. In this paper, the authors focus the experiments on the model training module. The authors first analyze the capabilities of eight AutoML tools: Auto-Gluon, Auto-Keras, Auto-Sklearn, Auto-Weka, H2O AutoML, Rminer, TPOT, and TransmogrifAI. Then, to select the tool for model training, the authors performed a benchmark with the only two tools that address a distributed ML ( H2O AutoML and TransmogrifAI ). The experiments used three real-world datasets from the telecommunications domain ( churn, event forecasting, and fraud detection ), as provided by an analytics company.
Q2. What are the future works mentioned in the paper "A scalable and automated machine learning framework to support risk management" ?
In future work, the authors intend to use more telecommunications datasets to provide additional benchmarks for the model training module. Finally, even though the framework was developed specifically for the telecommunications risk management domain, the authors intend to study the applicability of the framework to other areas. Moreover, new AutoML tools can be considered, as long as they provide distributed capabilities.