What are the future works mentioned in the paper "A scalable and automated machine learning framework to support risk management" ?

In future work, the authors intend to use more telecommunications datasets to provide additional benchmarks for the model training module. Finally, even though the framework was developed specifically for the telecommunications risk management domain, the authors intend to study the applicability of the framework to other areas. Moreover, new AutoML tools can be considered, as long as they provide distributed capabilities.

(Open Access) A scalable and automated machine learning framework to support risk management (2020) | Luís Ferreira

A Scalable and Automated Machine Learning

Framework to Support Risk Management

Lu´ıs Ferreira

1,2[0000−0002−4790−5128]

, Andr´e Pilastri

2[0000−0002−4380−3220]

Carlos Martins

3[0000−0002−0678−4868]

, Pedro Santos

3[0000−0002−4269−5838]

, and

Paulo Cortez

2[0000−0002−7991−2090]

EPMQ - IT Engineering Maturity and Quality Lab, CCG ZGDV Institute,

Guimar˜aes, Portugal

{luis.ferreira, andre.pilastri}@ccg.pt

ALGORITMI Centre, Dep. Information Systems, University of Minho, Guimar˜aes,

Portugal

pcortez@dsi.uminho.pt

WeDo Technologies, Braga, Portugal

{pedro.santos, carlos.mmartins}@mobileum.com

Abstract. Due to the growth of data and widespread usage of Machine

Learning (ML) by non-experts, automation and scalability are becom-

ing key issues for ML. This paper presents an automated and scalable

framework for ML that requires minimum human input. We designed

the framework for the domain of telecommunications risk management.

This domain often requires non-ML-experts to continuously update su-

pervised learning models that are trained on huge amounts of data. Thus,

the framework uses Automated Machine Learning (AutoML), to select

and tune the ML models, and distributed ML, to deal with Big Data.

The modules included in the framework are task detection (to detect

classiﬁcation or regression), data preprocessing, feature selection, model

training, and deployment. In this paper, we focus the experiments on

the model training module. We ﬁrst analyze the capabilities of eight Au-

toML tools: Auto-Gluon, Auto-Keras, Auto-Sklearn, Auto-Weka, H2O

AutoML, Rminer, TPOT, and TransmogrifAI. Then, to select the tool

for model training, we performed a benchmark with the only two tools

that address a distributed ML (H2O AutoML and TransmogrifAI). The

experiments used three real-world datasets from the telecommunications

domain (churn, event forecasting, and fraud detection), as provided by an

analytics company. The experiments allowed us to measure the compu-

tational eﬀort and predictive capability of the AutoML tools. Both tools

obtained high-quality results and did not present substantial predictive

diﬀerences. Nevertheless, H2O AutoML was selected by the analytics

company for the model training module, since it was considered a more

mature technology that presented a more interesting set of features (e.g.,

integration with more platforms). After choosing H2O AutoML for the

ML training, we selected the technologies for the remaining components

of the architecture (e.g., data preprocessing and web interface).

Keywords: Automated Machine Learning · Distributed Machine Learn-

ing · Supervised Learning · Risk Management.

2 Lu´ıs Ferreira et al.

1 Introduction

Nowadays, Machine Learning applications can make use of a great amount of

data, complex algorithms, and machines with great processing power to produce

eﬀective predictions and forecasts [11]. Currently, two of the most important

features of real-world ML applications are distributed learning and AutoML.

Distributed learning is particularly useful for ML applications in the context of

Big Data or when there are hardware constraints. Distributed learning consists

of using multiple machines or processors to process parts of the ML algorithm

or parts of the data. The fact that it is possible to add new processing units

enables ML applications to surpass time and memory restrictions [29]. AutoML

intends to allow people that are not experts in ML to eﬃciently choose and

apply ML algorithms. AutoML is particularly relevant since there is a growing

number of non-specialists working with ML [31]. It is also important for real-

world applications that require constant updates to ML models.

In this paper, we propose a technological architecture that addresses these

two ML challenges. The architecture was adapted to the area of telecommunica-

tions risk management, which is a domain that mostly uses supervised learning

algorithms (e.g., for churn prediction). Moreover, the ML models are constantly

updated by people that are not experts in ML and may involve Big Data. Thus,

the proposed architecture delineates a set of steps to automate the typical work-

ﬂow of a ML application that uses supervised learning. The architecture includes

modules for task detection, data preprocessing, feature selection, model training,

and deployment.

The focus of this work is the model training module of the architecture,

which was designed to use a distributed AutoML tool. In order to select the

ML tool for this module, we initially evaluated the characteristics of eight open-

source AutoML tools (Auto-Gluon, Auto-Keras, Auto-Sklearn, Auto-Weka, H2O

AutoML, Rminer, TPOT, and TransmogrifAI). We then performed a benchmark

to compare the two tools that allowed a distributed execution (H2O AutoML

and TransmogrifAI). The experiments used three real-world datasets from the

domain of telecommunications. These datasets were related to churn (regression),

event forecasting (time series), and fraud detection (binary classiﬁcation).

This paper consists of an extended version of our previous work [14]. The

main novelty of this extended version is the technological architecture that is

presented in Section 6. This section describes the particular technologies that

were used to implement the components of the proposed AutoML distributed

framework apart from model training. Also, this section describes the REST

API that was developed to mediate the communication between the end-users

and the proposed framework.

The paper is organized as follows. Section 2 presents the related work. In

Section 3, we detail the proposed ML architecture. Nest, Section 4 describes the

analyzed AutoML technologies and the datasets used during the experimental

tests. Then, Section 5 discusses the experimental results. Section 6 details the

technological architecture. Finally, Section 7 presents the main conclusions and

future work directions.

A Scalable and Automated ML Framework 3

2 Related Work

In a Big Data context, it is critical to create and use scalable ML algorithms

to face the common constraints of memory and time [29]. To face that concern,

classical distributed ML distributes the work among diﬀerent processors, each

performing part of the algorithm. Another current ML problem concerns the

choice of ML algorithms and hyperparameters for a given task. For ML experts,

this selection of algorithms and hyperparameters may use domain knowledge or

heuristics, but it is not an easy task for non-ML-experts. AutoML was developed

to combat this relevant issue [22]. The deﬁnition of AutoML can be described as

the search for the best algorithm and hyperparameters for a given dataset with

minimum human input.

In recent years, a large number of AutoML tools was developed, such as Auto-

Gluon [3], Auto-Keras [23], Auto-Sklearn [15], Auto-Weka [24], H2O AutoML

[21], Rminer [10], TPOT [27], and TransmogrifAI [30]. Within our knowledge,

few studies directly compare AutoML tools. Most studies compare one speciﬁc

AutoML framework with state-of-the-art ML algorithms [15], do not present

experimental tests [12, 35], or are related to ML automation challenges [18–20].

Recently, some studies focused on experimental comparisons of AutoML

tools. In 2019, [17] and [32] compare a set of AutoML tools using diﬀerent

datasets and ML tasks. In 2020, a benchmark was conducted using publicly

available datasets from OpenML [33], comparing diﬀerent types of AutoML

tools, which were grouped by their capabilities [36]. None of the mentioned com-

parison studies considered the distributed ML capability for the AutoML tools.

Furthermore, none of the studies used datasets from the domain of telecommu-

nications risk management, such as churn prediction or fraud detection.

3 Proposed Architecture

This paper is part of “Intelligent Risk Management for the Digital Age” (IR-

MDA), a R&D project developed by a leading Portuguese company in the area

of software and analytics. The purpose of the project is to develop a ML system

to assist the company telecommunications clients. Both scalability and automa-

tion are central requirements to the ML system since the company has many

clients with diverse amounts of data (large or small) and that are typically non-

ML-experts.

The ML technological architecture that is proposed by this work identiﬁes

and automates all typical tasks of a common supervised ML application, with

minimum human input (only the dataset and the target column). Also, since

the architecture was developed to work within a cluster with several processing

nodes, the users can handle any size of datasets just by managing the number

of cluster nodes. The architecture is illustrated in Fig. 1.

4 Lu´ıs Ferreira et al.

Fig. 1. The proposed automated and scalable ML architecture (adapted from [14]).

3.1 Phases

The proposed architecture assumes two main phases (Fig. 1): a training phase

and a testing phase.

Training Phase: The training phase includes the creation of a pipeline instance

and the deﬁnition of its stages. The only human input needed by the user is the

selection of the training dataset and the identiﬁcation of the target column.

Depending on the dataset columns, the each module deﬁnes a set of stages for

the pipeline. Each stage either transforms data or also creates a model based

on the training data that will be used on the test phase to transform the data.

When all stages are deﬁned, the pipeline is ﬁtted to the training data, creating

a pipeline model. Finally, the pipeline model is exported to a ﬁle.

Testing Phase: The execution of the testing pipeline assumes the same trans-

formations that were applied to the training data. To execute the testing pipeline

the user only needs to specify the test data and a pipeline model (and a fore-

casting horizon in the case of time series forecasting task). The last stage of the

testing pipeline is the application of the best model obtained during training,

generating the predictions. Performance metrics are also computed and presented

to the user.

3.2 Components

The proposed architecture includes ﬁve main components: task detection, data

preprocessing, feature selection, model training (with the usage of AutoML),

and pipeline deployment.

Machine Learning Task Detection: Set to detect the ML task of the pipeline

(e.g., classiﬁcation, regression, time series). This detection is made by analyzing

the number of levels of the target column and the existence (or not) of a time

column.

A Scalable and Automated ML Framework 5

Data Preprocessing: Handles missing data, the encoding of categorical fea-

tures, and the standardization of numerical features. The applied transforma-

tions depend on the data type of the columns, number of levels, and number of

missing values.

Feature Selection: Deletes features from the dataset that may decrease the

predictive performance of the ML models, using ﬁltering methods. Filtering

methods are based on individual correlations between each feature and the tar-

get, removing several features that present the lowest correlations [4].

Model Training: Automatically trains and tunes a set of ML models using a

set of constraints (e.g., time limit, memory usage). The component also identiﬁes

the best model to be used on the test phase.

Pipeline Deployment: Manages the saving and loading of the pipelines to

and from ﬁles. This module saves the pipeline that will be used on a test set,

ensuring that the new data will pass through the same transformations as the

training data. Also, the component stores the best model obtained during the

training to make predictions, discarding all other ML models.

4 Materials and Methods

4.1 Experimental Evaluation

For the experimental evaluation, we ﬁrst examined the characteristics of the

open-source AutoML tools. Then, we used the tools that could be implemented

in our architecture to perform a benchmark study. In order to be considered for

the experimental evaluation, the tools have to implement distributed ML.

4.2 AutoML Tools

We ﬁrst analyzed eight recent open-source AutoML tools, to verify their com-

pliance with the project requirements.

Auto-Gluon: AutoGluon is an open-source AutoML toolkit with a focus on

Deep Learning. It is written in Python and runs on Linux operating system.

AutoGluon is divided into four main modules: tabular data, image classiﬁca-

tion, object detection, and text classiﬁcation [3]. In this article, only the tabular

prediction functionalities are being considered.

Auto-Keras: Auto-Keras is a Python library based on Keras [6] that imple-

ments AutoML methods with Deep Learning algorithms. The focus of Auto-

Keras is the automatic search for Deep Learning architectures and hyperparam-

eters, usually named Neural Architecture Search [13].

A scalable and automated machine learning framework to support risk management

Figures

Citations

Automated machine learning-based prediction of microplastics induced impacts on methane production in anaerobic digestion.

Explainable Preoperative Automated Machine Learning Prediction Model for Cardiac Surgery-Associated Acute Kidney Injury

Prediction and Evaluation of Indirect Carbon Emission from Electrical Consumption in Multiple Full-Scale Wastewater Treatment Plants via Automated Machine Learning-Based Analysis

Risk Management in Software Development Projects: A Systematic Literature Review

References

Scikit-learn: Machine Learning in Python

Scikit-learn: Machine Learning in Python

SMOTE: synthetic minority over-sampling technique

SMOTE: Synthetic Minority Over-sampling Technique

Selection of relevant features and examples in machine

Related Papers (5)

PredictionIO: a distributed machine learning server for practical software development

Machine learning activity detection using ML.Net

AI^2: Training a Big Data Machine to Defend

Active Learning for ML Enhanced Database Systems

Choosing Machine Learning Algorithms for Anomaly Detection in Smart Building IoT Scenarios

Frequently Asked Questions (2)

Q1. What are the contributions in "A scalable and automated machine learning framework to support risk management" ?

Q2. What are the future works mentioned in the paper "A scalable and automated machine learning framework to support risk management" ?