Book Chapter•DOI•

A scalable and automated machine learning framework to support risk management

Luís Ferreira, André Luiz Pilastri¹, Carlos Martins, Pedro Santos, Paulo Cortez¹ - Show less +1 more•Institutions (1)

22 Feb 2020-pp 291-307

TL;DR: In this paper, an automated and scalable framework for ML that requires minimum human input is presented for the domain of telecommunications risk management, where the modules included in the framework are task detection (to detect classification or regression), data preprocessing, feature selection, model training, and deployment.

read less

Abstract: Due to the growth of data and widespread usage of Machine Learning (ML) by non-experts, automation and scalability are becoming key issues for ML. This paper presents an automated and scalable framework for ML that requires minimum human input. We designed the framework for the domain of telecommunications risk management. This domain often requires non-ML-experts to continuously update supervised learning models that are trained on huge amounts of data. Thus, the framework uses Automated Machine Learning (AutoML), to select and tune the ML models, and distributed ML, to deal with Big Data. The modules included in the framework are task detection (to detect classification or regression), data preprocessing, feature selection, model training, and deployment. In this paper, we focus the experiments on the model training module. We first analyze the capabilities of eight AutoML tools: Auto-Gluon, Auto-Keras, Auto-Sklearn, Auto-Weka, H2O AutoML, Rminer, TPOT, and TransmogrifAI. Then, to select the tool for model training, we performed a benchmark with the only two tools that address a distributed ML (H2O AutoML and TransmogrifAI). The experiments used three real-world datasets from the telecommunications domain (churn, event forecasting, and fraud detection), as provided by an analytics company. The experiments allowed us to measure the computational effort and predictive capability of the AutoML tools. Both tools obtained high-quality results and did not present substantial predictive differences. Nevertheless, H2O AutoML was selected by the analytics company for the model training module, since it was considered a more mature technology that presented a more interesting set of features (e.g., integration with more platforms). After choosing H2O AutoML for the ML training, we selected the technologies for the remaining components of the architecture (e.g., data preprocessing and web interface).

...read moreread less

Summary (3 min read)

Jump to: [1 Introduction] – [2 Related Work] – [3 Proposed Architecture] – [3.1 Phases] – [3.2 Components] – [4.1 Experimental Evaluation] – [4.2 AutoML Tools] – [4.3 Data] – [5.1 Experimental Setup] – [5.2 Discussion] – [6 Technological Architecture] – [6.1 Components] – [6.2 API] and [7 Conclusions]

1 Introduction

Nowadays, Machine Learning applications can make use of a great amount of data, complex algorithms, and machines with great processing power to produce effective predictions and forecasts [11].
The fact that it is possible to add new processing units enables ML applications to surpass time and memory restrictions [29].
The experiments used three real-world datasets from the domain of telecommunications.
The main novelty of this extended version is the technological architecture that is presented in Section 6.
This section describes the particular technologies that were used to implement the components of the proposed AutoML distributed framework apart from model training.

3 Proposed Architecture

This paper is part of “Intelligent Risk Management for the Digital Age” , a R&D project developed by a leading Portuguese company in the area of software and analytics.
Both scalability and automation are central requirements to the ML system since the company has many clients with diverse amounts of data (large or small) and that are typically nonML-experts.
The ML technological architecture that is proposed by this work identifies and automates all typical tasks of a common supervised ML application, with minimum human input (only the dataset and the target column).
Also, since the architecture was developed to work within a cluster with several processing nodes, the users can handle any size of datasets just by managing the number of cluster nodes.

3.1 Phases

The proposed architecture assumes two main phases (Fig. 1): a training phase and a testing phase.
The only human input needed by the user is the selection of the training dataset and the identification of the target column.
When all stages are defined, the pipeline is fitted to the training data, creating a pipeline model.
The last stage of the testing pipeline is the application of the best model obtained during training, generating the predictions.
Performance metrics are also computed and presented to the user.

3.2 Components

The proposed architecture includes five main components: task detection, data preprocessing, feature selection, model training (with the usage of AutoML), and pipeline deployment.
The applied transformations depend on the data type of the columns, number of levels, and number of missing values.
Deletes features from the dataset that may decrease the predictive performance of the ML models, using filtering methods, also known as Feature Selection.
The component also identifies the best model to be used on the test phase.
This module saves the pipeline that will be used on a test set, ensuring that the new data will pass through the same transformations as the training data.

4.1 Experimental Evaluation

For the experimental evaluation, the authors first examined the characteristics of the open-source AutoML tools.
Then, the authors used the tools that could be implemented in their architecture to perform a benchmark study.
In order to be considered for the experimental evaluation, the tools have to implement distributed ML.

4.2 AutoML Tools

The authors first analyzed eight recent open-source AutoML tools, to verify their compliance with the project requirements.
Auto-Sklearn is an AutoML Python library based on ScikitLearn [28] that implements methods for automatic algorithm selection and hyperparameter tuning, also known as Auto-Sklearn.
H2O AutoML uses H2O’s infrastructure to provide functions to automate algorithm selection and hyperparameter optimization [21].
Rminer is a package for the R tool, intending to facilitate the use of Machine Learning algorithms, also known as Rminer.
The last two rows are related to the stacking ensembles implemented by H2O AutoML: all, which combines all trained algorithms; and best, which only combines the best algorithm per family.

4.3 Data

For the benchmark study, the authors used three real-world datasets from the domain of telecommunications, provided by the IRMDA project analytics company.
Table 3 describes each attribute of the churn dataset.
The only attributes are the timestamp and the number of events in that interval, as described in Table 4.
The dataset contains more than 1 million examples, which correspond to one day of phone calls from one of the company clients.

5.1 Experimental Setup

The benchmark consisted of several computational experiments that used three real-world datasets to compare the selected AutoML tools (H2O AutoML and TransmogrifAI).
Every AutoML execution implemented a 10-fold cross-validation during the training of the algorithms.
The first scenario (1) considered all the attributes of the dataset as input features for the ML algorithms.
For event forecasting, the authors transformed the dataset, creating time lags as inputs for a regression task.

5.2 Discussion

The experimental results show that both AutoML tools require a small execution time to select the best ML model, with the highest mean execution time being slightly higher than 7 minutes.
The low training time can be justified with the usage of distributed ML, datasets with small number of rows or columns, and the removal of Deep Learning algorithms.
TransmogrifAI obtained the best predictive results in two regression scenarios and two classification scenarios.
This choice was supported by two main reasons.
First, H2O AutoML obtained better predictive results for most of the scenarios.

6 Technological Architecture

After the comparative ML experiments, the analytics company selected the H2O AutoML tool for the model training component.
The remaining technological modules were then designed in cooperation with the company.
Given that H2O can be integrated with Apache Spark (using the Sparkling Water module) and that Spark provides functions for data processing, the authors relied on Spark’s Application Programming Interface (API) functions to implement the remaining components of the architecture.
The updated architecture, with references to the technologies used, is illustrated in Fig. 2.

6.1 Components

This subsection describes the current implementation of each module of the architecture.
These changes were related to feedback received from the analytics company or due to technological restrictions.
Currently, the preprocessing transformations (e.g., dealing with missing data, the encoding of categorical features, standardization of numerical features) are done using Apache Spark’s functions for extracting, transforming and selecting features [1], also known as Data Preprocessing.
This function replaces the unknown values of a column with its mean value.
For classification (binary or multi-class) and regression tasks, the authors use H2O AutoML to automatically find and tune the best model.

6.2 API

In order to facilitate the execution of the architecture, the authors also created a REST API to mediate the communication between the end-users and the pipelines.
Since the execution of each request consists of one Apache Spark job (using H2O’s capabilities through the Sparkling Water module), the API works as an intermediary between the end-user and the execution of the code inside Spark.
The server formats the response to the appropriate format (e.g., XML, JSON) and sends the response to the client interface.

7 Conclusions

This paper proposes a ML framework to automate the typical workflow of supervised ML applications without the need for human input.
The framework was developed within project IRMDA, a R&D project developed by a leading Portuguese software and analytics company that provides services for the domain of telecommunications risk management.
In order to assess the most appropriate AutoML tools for this model training module, the authors initially conducted a benchmark experiment.
The authors selected technologies with distributed capabilities for the remaining modules of the initially proposed framework.
Besides, the authors intend to add more ML tasks to the framework, such as ordinal classification, multi-target regression, or multivariate time series.

Did you find this useful? Give us your feedback

Figures (9)

Fig. 2. The technological automated and scalable ML architecture (adapted from [14]).

Table 1. Main characteristics of the analyzed AutoML tools (extended from [14]).

Table 2. Algorithms implemented by H2O AutoML and TransmogrifAI (adapted from [14]).

Table 6. Summary of the experimental results, best values in bold (adapted from [14]).

Fig. 3. Adopted scheme for handing of requests and responses.

Table 4. Description of the attributes of the event forecasting dataset (adapted from [14]).

Table 3. Description of the attributes of the churn dataset (adapted from [14]).

Fig. 1. The proposed automated and scalable ML architecture (adapted from [14]).

Table 5. Description of the attributes of the fraud dataset (adapted from [14]).

Content maybe subject to copyright Report

A Scalable and Automated Machine Learning

Framework to Support Risk Management

Lu´ıs Ferreira

1,2[0000−0002−4790−5128]

, Andr´e Pilastri

2[0000−0002−4380−3220]

Carlos Martins

3[0000−0002−0678−4868]

, Pedro Santos

3[0000−0002−4269−5838]

, and

Paulo Cortez

2[0000−0002−7991−2090]

EPMQ - IT Engineering Maturity and Quality Lab, CCG ZGDV Institute,

Guimar˜aes, Portugal

{luis.ferreira, andre.pilastri}@ccg.pt

ALGORITMI Centre, Dep. Information Systems, University of Minho, Guimar˜aes,

Portugal

pcortez@dsi.uminho.pt

WeDo Technologies, Braga, Portugal

{pedro.santos, carlos.mmartins}@mobileum.com

Abstract. Due to the growth of data and widespread usage of Machine

Learning (ML) by non-experts, automation and scalability are becom-

ing key issues for ML. This paper presents an automated and scalable

framework for ML that requires minimum human input. We designed

the framework for the domain of telecommunications risk management.

This domain often requires non-ML-experts to continuously update su-

pervised learning models that are trained on huge amounts of data. Thus,

the framework uses Automated Machine Learning (AutoML), to select

and tune the ML models, and distributed ML, to deal with Big Data.

The modules included in the framework are task detection (to detect

classiﬁcation or regression), data preprocessing, feature selection, model

training, and deployment. In this paper, we focus the experiments on

the model training module. We ﬁrst analyze the capabilities of eight Au-

toML tools: Auto-Gluon, Auto-Keras, Auto-Sklearn, Auto-Weka, H2O

AutoML, Rminer, TPOT, and TransmogrifAI. Then, to select the tool

for model training, we performed a benchmark with the only two tools

that address a distributed ML (H2O AutoML and TransmogrifAI). The

experiments used three real-world datasets from the telecommunications

domain (churn, event forecasting, and fraud detection), as provided by an

analytics company. The experiments allowed us to measure the compu-

tational eﬀort and predictive capability of the AutoML tools. Both tools

obtained high-quality results and did not present substantial predictive

diﬀerences. Nevertheless, H2O AutoML was selected by the analytics

company for the model training module, since it was considered a more

mature technology that presented a more interesting set of features (e.g.,

integration with more platforms). After choosing H2O AutoML for the

ML training, we selected the technologies for the remaining components

of the architecture (e.g., data preprocessing and web interface).

Keywords: Automated Machine Learning · Distributed Machine Learn-

ing · Supervised Learning · Risk Management.

2 Lu´ıs Ferreira et al.

1 Introduction

Nowadays, Machine Learning applications can make use of a great amount of

data, complex algorithms, and machines with great processing power to produce

eﬀective predictions and forecasts [11]. Currently, two of the most important

features of real-world ML applications are distributed learning and AutoML.

Distributed learning is particularly useful for ML applications in the context of

Big Data or when there are hardware constraints. Distributed learning consists

of using multiple machines or processors to process parts of the ML algorithm

or parts of the data. The fact that it is possible to add new processing units

enables ML applications to surpass time and memory restrictions [29]. AutoML

intends to allow people that are not experts in ML to eﬃciently choose and

apply ML algorithms. AutoML is particularly relevant since there is a growing

number of non-specialists working with ML [31]. It is also important for real-

world applications that require constant updates to ML models.

In this paper, we propose a technological architecture that addresses these

two ML challenges. The architecture was adapted to the area of telecommunica-

tions risk management, which is a domain that mostly uses supervised learning

algorithms (e.g., for churn prediction). Moreover, the ML models are constantly

updated by people that are not experts in ML and may involve Big Data. Thus,

the proposed architecture delineates a set of steps to automate the typical work-

ﬂow of a ML application that uses supervised learning. The architecture includes

modules for task detection, data preprocessing, feature selection, model training,

and deployment.

The focus of this work is the model training module of the architecture,

which was designed to use a distributed AutoML tool. In order to select the

ML tool for this module, we initially evaluated the characteristics of eight open-

source AutoML tools (Auto-Gluon, Auto-Keras, Auto-Sklearn, Auto-Weka, H2O

AutoML, Rminer, TPOT, and TransmogrifAI). We then performed a benchmark

to compare the two tools that allowed a distributed execution (H2O AutoML

and TransmogrifAI). The experiments used three real-world datasets from the

domain of telecommunications. These datasets were related to churn (regression),

event forecasting (time series), and fraud detection (binary classiﬁcation).

This paper consists of an extended version of our previous work [14]. The

main novelty of this extended version is the technological architecture that is

presented in Section 6. This section describes the particular technologies that

were used to implement the components of the proposed AutoML distributed

framework apart from model training. Also, this section describes the REST

API that was developed to mediate the communication between the end-users

and the proposed framework.

The paper is organized as follows. Section 2 presents the related work. In

Section 3, we detail the proposed ML architecture. Nest, Section 4 describes the

analyzed AutoML technologies and the datasets used during the experimental

tests. Then, Section 5 discusses the experimental results. Section 6 details the

technological architecture. Finally, Section 7 presents the main conclusions and

future work directions.

A Scalable and Automated ML Framework 3

2 Related Work

In a Big Data context, it is critical to create and use scalable ML algorithms

to face the common constraints of memory and time [29]. To face that concern,

classical distributed ML distributes the work among diﬀerent processors, each

performing part of the algorithm. Another current ML problem concerns the

choice of ML algorithms and hyperparameters for a given task. For ML experts,

this selection of algorithms and hyperparameters may use domain knowledge or

heuristics, but it is not an easy task for non-ML-experts. AutoML was developed

to combat this relevant issue [22]. The deﬁnition of AutoML can be described as

the search for the best algorithm and hyperparameters for a given dataset with

minimum human input.

In recent years, a large number of AutoML tools was developed, such as Auto-

Gluon [3], Auto-Keras [23], Auto-Sklearn [15], Auto-Weka [24], H2O AutoML

[21], Rminer [10], TPOT [27], and TransmogrifAI [30]. Within our knowledge,

few studies directly compare AutoML tools. Most studies compare one speciﬁc

AutoML framework with state-of-the-art ML algorithms [15], do not present

experimental tests [12, 35], or are related to ML automation challenges [18–20].

Recently, some studies focused on experimental comparisons of AutoML

tools. In 2019, [17] and [32] compare a set of AutoML tools using diﬀerent

datasets and ML tasks. In 2020, a benchmark was conducted using publicly

available datasets from OpenML [33], comparing diﬀerent types of AutoML

tools, which were grouped by their capabilities [36]. None of the mentioned com-

parison studies considered the distributed ML capability for the AutoML tools.

Furthermore, none of the studies used datasets from the domain of telecommu-

nications risk management, such as churn prediction or fraud detection.

3 Proposed Architecture

This paper is part of “Intelligent Risk Management for the Digital Age” (IR-

MDA), a R&D project developed by a leading Portuguese company in the area

of software and analytics. The purpose of the project is to develop a ML system

to assist the company telecommunications clients. Both scalability and automa-

tion are central requirements to the ML system since the company has many

clients with diverse amounts of data (large or small) and that are typically non-

ML-experts.

The ML technological architecture that is proposed by this work identiﬁes

and automates all typical tasks of a common supervised ML application, with

minimum human input (only the dataset and the target column). Also, since

the architecture was developed to work within a cluster with several processing

nodes, the users can handle any size of datasets just by managing the number

of cluster nodes. The architecture is illustrated in Fig. 1.

4 Lu´ıs Ferreira et al.

Fig. 1. The proposed automated and scalable ML architecture (adapted from [14]).

3.1 Phases

The proposed architecture assumes two main phases (Fig. 1): a training phase

and a testing phase.

Training Phase: The training phase includes the creation of a pipeline instance

and the deﬁnition of its stages. The only human input needed by the user is the

selection of the training dataset and the identiﬁcation of the target column.

Depending on the dataset columns, the each module deﬁnes a set of stages for

the pipeline. Each stage either transforms data or also creates a model based

on the training data that will be used on the test phase to transform the data.

When all stages are deﬁned, the pipeline is ﬁtted to the training data, creating

a pipeline model. Finally, the pipeline model is exported to a ﬁle.

Testing Phase: The execution of the testing pipeline assumes the same trans-

formations that were applied to the training data. To execute the testing pipeline

the user only needs to specify the test data and a pipeline model (and a fore-

casting horizon in the case of time series forecasting task). The last stage of the

testing pipeline is the application of the best model obtained during training,

generating the predictions. Performance metrics are also computed and presented

to the user.

3.2 Components

The proposed architecture includes ﬁve main components: task detection, data

preprocessing, feature selection, model training (with the usage of AutoML),

and pipeline deployment.

Machine Learning Task Detection: Set to detect the ML task of the pipeline

(e.g., classiﬁcation, regression, time series). This detection is made by analyzing

the number of levels of the target column and the existence (or not) of a time

column.

A Scalable and Automated ML Framework 5

Data Preprocessing: Handles missing data, the encoding of categorical fea-

tures, and the standardization of numerical features. The applied transforma-

tions depend on the data type of the columns, number of levels, and number of

missing values.

Feature Selection: Deletes features from the dataset that may decrease the

predictive performance of the ML models, using ﬁltering methods. Filtering

methods are based on individual correlations between each feature and the tar-

get, removing several features that present the lowest correlations [4].

Model Training: Automatically trains and tunes a set of ML models using a

set of constraints (e.g., time limit, memory usage). The component also identiﬁes

the best model to be used on the test phase.

Pipeline Deployment: Manages the saving and loading of the pipelines to

and from ﬁles. This module saves the pipeline that will be used on a test set,

ensuring that the new data will pass through the same transformations as the

training data. Also, the component stores the best model obtained during the

training to make predictions, discarding all other ML models.

4 Materials and Methods

4.1 Experimental Evaluation

For the experimental evaluation, we ﬁrst examined the characteristics of the

open-source AutoML tools. Then, we used the tools that could be implemented

in our architecture to perform a benchmark study. In order to be considered for

the experimental evaluation, the tools have to implement distributed ML.

4.2 AutoML Tools

We ﬁrst analyzed eight recent open-source AutoML tools, to verify their com-

pliance with the project requirements.

Auto-Gluon: AutoGluon is an open-source AutoML toolkit with a focus on

Deep Learning. It is written in Python and runs on Linux operating system.

AutoGluon is divided into four main modules: tabular data, image classiﬁca-

tion, object detection, and text classiﬁcation [3]. In this article, only the tabular

prediction functionalities are being considered.

Auto-Keras: Auto-Keras is a Python library based on Keras [6] that imple-

ments AutoML methods with Deep Learning algorithms. The focus of Auto-

Keras is the automatic search for Deep Learning architectures and hyperparam-

eters, usually named Neural Architecture Search [13].

HTML Viewer

Frequently Asked Questions (2)

Q1. What are the contributions in "A scalable and automated machine learning framework to support risk management" ?

This paper presents an automated and scalable framework for ML that requires minimum human input. In this paper, the authors focus the experiments on the model training module. The authors first analyze the capabilities of eight AutoML tools: Auto-Gluon, Auto-Keras, Auto-Sklearn, Auto-Weka, H2O AutoML, Rminer, TPOT, and TransmogrifAI. Then, to select the tool for model training, the authors performed a benchmark with the only two tools that address a distributed ML ( H2O AutoML and TransmogrifAI ). The experiments used three real-world datasets from the telecommunications domain ( churn, event forecasting, and fraud detection ), as provided by an analytics company.

Q2. What are the future works mentioned in the paper "A scalable and automated machine learning framework to support risk management" ?

In future work, the authors intend to use more telecommunications datasets to provide additional benchmarks for the model training module. Finally, even though the framework was developed specifically for the telecommunications risk management domain, the authors intend to study the applicability of the framework to other areas. Moreover, new AutoML tools can be considered, as long as they provide distributed capabilities.

A scalable and automated machine learning framework to support risk management

Summary (3 min read)

1 Introduction

3 Proposed Architecture

3.1 Phases

3.2 Components

4.1 Experimental Evaluation

4.2 AutoML Tools

4.3 Data

5.1 Experimental Setup

5.2 Discussion

6 Technological Architecture

6.1 Components

6.2 API

7 Conclusions

Figures (9)

Citations

References

Related Papers (5)

Frequently Asked Questions (2)

Q1. What are the contributions in "A scalable and automated machine learning framework to support risk management" ?

Q2. What are the future works mentioned in the paper "A scalable and automated machine learning framework to support risk management" ?