scispace - formally typeset
Open AccessBook ChapterDOI

A scalable and automated machine learning framework to support risk management

Reads0
Chats0
TLDR
In this paper, an automated and scalable framework for ML that requires minimum human input is presented for the domain of telecommunications risk management, where the modules included in the framework are task detection (to detect classification or regression), data preprocessing, feature selection, model training, and deployment.
Abstract
Due to the growth of data and widespread usage of Machine Learning (ML) by non-experts, automation and scalability are becoming key issues for ML. This paper presents an automated and scalable framework for ML that requires minimum human input. We designed the framework for the domain of telecommunications risk management. This domain often requires non-ML-experts to continuously update supervised learning models that are trained on huge amounts of data. Thus, the framework uses Automated Machine Learning (AutoML), to select and tune the ML models, and distributed ML, to deal with Big Data. The modules included in the framework are task detection (to detect classification or regression), data preprocessing, feature selection, model training, and deployment. In this paper, we focus the experiments on the model training module. We first analyze the capabilities of eight AutoML tools: Auto-Gluon, Auto-Keras, Auto-Sklearn, Auto-Weka, H2O AutoML, Rminer, TPOT, and TransmogrifAI. Then, to select the tool for model training, we performed a benchmark with the only two tools that address a distributed ML (H2O AutoML and TransmogrifAI). The experiments used three real-world datasets from the telecommunications domain (churn, event forecasting, and fraud detection), as provided by an analytics company. The experiments allowed us to measure the computational effort and predictive capability of the AutoML tools. Both tools obtained high-quality results and did not present substantial predictive differences. Nevertheless, H2O AutoML was selected by the analytics company for the model training module, since it was considered a more mature technology that presented a more interesting set of features (e.g., integration with more platforms). After choosing H2O AutoML for the ML training, we selected the technologies for the remaining components of the architecture (e.g., data preprocessing and web interface).

read more

Content maybe subject to copyright    Report

A Scalable and Automated Machine Learning
Framework to Support Risk Management
Lu´ıs Ferreira
1,2[0000000247905128]
, Andr´e Pilastri
2[0000000243803220]
,
Carlos Martins
3[0000000206784868]
, Pedro Santos
3[0000000242695838]
, and
Paulo Cortez
2[0000000279912090]
1
EPMQ - IT Engineering Maturity and Quality Lab, CCG ZGDV Institute,
Guimar˜aes, Portugal
{luis.ferreira, andre.pilastri}@ccg.pt
2
ALGORITMI Centre, Dep. Information Systems, University of Minho, Guimar˜aes,
Portugal
pcortez@dsi.uminho.pt
3
WeDo Technologies, Braga, Portugal
{pedro.santos, carlos.mmartins}@mobileum.com
Abstract. Due to the growth of data and widespread usage of Machine
Learning (ML) by non-experts, automation and scalability are becom-
ing key issues for ML. This paper presents an automated and scalable
framework for ML that requires minimum human input. We designed
the framework for the domain of telecommunications risk management.
This domain often requires non-ML-experts to continuously update su-
pervised learning models that are trained on huge amounts of data. Thus,
the framework uses Automated Machine Learning (AutoML), to select
and tune the ML models, and distributed ML, to deal with Big Data.
The modules included in the framework are task detection (to detect
classification or regression), data preprocessing, feature selection, model
training, and deployment. In this paper, we focus the experiments on
the model training module. We first analyze the capabilities of eight Au-
toML tools: Auto-Gluon, Auto-Keras, Auto-Sklearn, Auto-Weka, H2O
AutoML, Rminer, TPOT, and TransmogrifAI. Then, to select the tool
for model training, we performed a benchmark with the only two tools
that address a distributed ML (H2O AutoML and TransmogrifAI). The
experiments used three real-world datasets from the telecommunications
domain (churn, event forecasting, and fraud detection), as provided by an
analytics company. The experiments allowed us to measure the compu-
tational effort and predictive capability of the AutoML tools. Both tools
obtained high-quality results and did not present substantial predictive
differences. Nevertheless, H2O AutoML was selected by the analytics
company for the model training module, since it was considered a more
mature technology that presented a more interesting set of features (e.g.,
integration with more platforms). After choosing H2O AutoML for the
ML training, we selected the technologies for the remaining components
of the architecture (e.g., data preprocessing and web interface).
Keywords: Automated Machine Learning · Distributed Machine Learn-
ing · Supervised Learning · Risk Management.

2 Lu´ıs Ferreira et al.
1 Introduction
Nowadays, Machine Learning applications can make use of a great amount of
data, complex algorithms, and machines with great processing power to produce
effective predictions and forecasts [11]. Currently, two of the most important
features of real-world ML applications are distributed learning and AutoML.
Distributed learning is particularly useful for ML applications in the context of
Big Data or when there are hardware constraints. Distributed learning consists
of using multiple machines or processors to process parts of the ML algorithm
or parts of the data. The fact that it is possible to add new processing units
enables ML applications to surpass time and memory restrictions [29]. AutoML
intends to allow people that are not experts in ML to efficiently choose and
apply ML algorithms. AutoML is particularly relevant since there is a growing
number of non-specialists working with ML [31]. It is also important for real-
world applications that require constant updates to ML models.
In this paper, we propose a technological architecture that addresses these
two ML challenges. The architecture was adapted to the area of telecommunica-
tions risk management, which is a domain that mostly uses supervised learning
algorithms (e.g., for churn prediction). Moreover, the ML models are constantly
updated by people that are not experts in ML and may involve Big Data. Thus,
the proposed architecture delineates a set of steps to automate the typical work-
flow of a ML application that uses supervised learning. The architecture includes
modules for task detection, data preprocessing, feature selection, model training,
and deployment.
The focus of this work is the model training module of the architecture,
which was designed to use a distributed AutoML tool. In order to select the
ML tool for this module, we initially evaluated the characteristics of eight open-
source AutoML tools (Auto-Gluon, Auto-Keras, Auto-Sklearn, Auto-Weka, H2O
AutoML, Rminer, TPOT, and TransmogrifAI). We then performed a benchmark
to compare the two tools that allowed a distributed execution (H2O AutoML
and TransmogrifAI). The experiments used three real-world datasets from the
domain of telecommunications. These datasets were related to churn (regression),
event forecasting (time series), and fraud detection (binary classification).
This paper consists of an extended version of our previous work [14]. The
main novelty of this extended version is the technological architecture that is
presented in Section 6. This section describes the particular technologies that
were used to implement the components of the proposed AutoML distributed
framework apart from model training. Also, this section describes the REST
API that was developed to mediate the communication between the end-users
and the proposed framework.
The paper is organized as follows. Section 2 presents the related work. In
Section 3, we detail the proposed ML architecture. Nest, Section 4 describes the
analyzed AutoML technologies and the datasets used during the experimental
tests. Then, Section 5 discusses the experimental results. Section 6 details the
technological architecture. Finally, Section 7 presents the main conclusions and
future work directions.

A Scalable and Automated ML Framework 3
2 Related Work
In a Big Data context, it is critical to create and use scalable ML algorithms
to face the common constraints of memory and time [29]. To face that concern,
classical distributed ML distributes the work among different processors, each
performing part of the algorithm. Another current ML problem concerns the
choice of ML algorithms and hyperparameters for a given task. For ML experts,
this selection of algorithms and hyperparameters may use domain knowledge or
heuristics, but it is not an easy task for non-ML-experts. AutoML was developed
to combat this relevant issue [22]. The definition of AutoML can be described as
the search for the best algorithm and hyperparameters for a given dataset with
minimum human input.
In recent years, a large number of AutoML tools was developed, such as Auto-
Gluon [3], Auto-Keras [23], Auto-Sklearn [15], Auto-Weka [24], H2O AutoML
[21], Rminer [10], TPOT [27], and TransmogrifAI [30]. Within our knowledge,
few studies directly compare AutoML tools. Most studies compare one specific
AutoML framework with state-of-the-art ML algorithms [15], do not present
experimental tests [12, 35], or are related to ML automation challenges [18–20].
Recently, some studies focused on experimental comparisons of AutoML
tools. In 2019, [17] and [32] compare a set of AutoML tools using different
datasets and ML tasks. In 2020, a benchmark was conducted using publicly
available datasets from OpenML [33], comparing different types of AutoML
tools, which were grouped by their capabilities [36]. None of the mentioned com-
parison studies considered the distributed ML capability for the AutoML tools.
Furthermore, none of the studies used datasets from the domain of telecommu-
nications risk management, such as churn prediction or fraud detection.
3 Proposed Architecture
This paper is part of “Intelligent Risk Management for the Digital Age” (IR-
MDA), a R&D project developed by a leading Portuguese company in the area
of software and analytics. The purpose of the project is to develop a ML system
to assist the company telecommunications clients. Both scalability and automa-
tion are central requirements to the ML system since the company has many
clients with diverse amounts of data (large or small) and that are typically non-
ML-experts.
The ML technological architecture that is proposed by this work identifies
and automates all typical tasks of a common supervised ML application, with
minimum human input (only the dataset and the target column). Also, since
the architecture was developed to work within a cluster with several processing
nodes, the users can handle any size of datasets just by managing the number
of cluster nodes. The architecture is illustrated in Fig. 1.

4 Lu´ıs Ferreira et al.
Fig. 1. The proposed automated and scalable ML architecture (adapted from [14]).
3.1 Phases
The proposed architecture assumes two main phases (Fig. 1): a training phase
and a testing phase.
Training Phase: The training phase includes the creation of a pipeline instance
and the definition of its stages. The only human input needed by the user is the
selection of the training dataset and the identification of the target column.
Depending on the dataset columns, the each module defines a set of stages for
the pipeline. Each stage either transforms data or also creates a model based
on the training data that will be used on the test phase to transform the data.
When all stages are defined, the pipeline is fitted to the training data, creating
a pipeline model. Finally, the pipeline model is exported to a file.
Testing Phase: The execution of the testing pipeline assumes the same trans-
formations that were applied to the training data. To execute the testing pipeline
the user only needs to specify the test data and a pipeline model (and a fore-
casting horizon in the case of time series forecasting task). The last stage of the
testing pipeline is the application of the best model obtained during training,
generating the predictions. Performance metrics are also computed and presented
to the user.
3.2 Components
The proposed architecture includes five main components: task detection, data
preprocessing, feature selection, model training (with the usage of AutoML),
and pipeline deployment.
Machine Learning Task Detection: Set to detect the ML task of the pipeline
(e.g., classification, regression, time series). This detection is made by analyzing
the number of levels of the target column and the existence (or not) of a time
column.

A Scalable and Automated ML Framework 5
Data Preprocessing: Handles missing data, the encoding of categorical fea-
tures, and the standardization of numerical features. The applied transforma-
tions depend on the data type of the columns, number of levels, and number of
missing values.
Feature Selection: Deletes features from the dataset that may decrease the
predictive performance of the ML models, using filtering methods. Filtering
methods are based on individual correlations between each feature and the tar-
get, removing several features that present the lowest correlations [4].
Model Training: Automatically trains and tunes a set of ML models using a
set of constraints (e.g., time limit, memory usage). The component also identifies
the best model to be used on the test phase.
Pipeline Deployment: Manages the saving and loading of the pipelines to
and from files. This module saves the pipeline that will be used on a test set,
ensuring that the new data will pass through the same transformations as the
training data. Also, the component stores the best model obtained during the
training to make predictions, discarding all other ML models.
4 Materials and Methods
4.1 Experimental Evaluation
For the experimental evaluation, we first examined the characteristics of the
open-source AutoML tools. Then, we used the tools that could be implemented
in our architecture to perform a benchmark study. In order to be considered for
the experimental evaluation, the tools have to implement distributed ML.
4.2 AutoML Tools
We first analyzed eight recent open-source AutoML tools, to verify their com-
pliance with the project requirements.
Auto-Gluon: AutoGluon is an open-source AutoML toolkit with a focus on
Deep Learning. It is written in Python and runs on Linux operating system.
AutoGluon is divided into four main modules: tabular data, image classifica-
tion, object detection, and text classification [3]. In this article, only the tabular
prediction functionalities are being considered.
Auto-Keras: Auto-Keras is a Python library based on Keras [6] that imple-
ments AutoML methods with Deep Learning algorithms. The focus of Auto-
Keras is the automatic search for Deep Learning architectures and hyperparam-
eters, usually named Neural Architecture Search [13].

Citations
More filters
Journal ArticleDOI

Automated machine learning-based prediction of microplastics induced impacts on methane production in anaerobic digestion.

TL;DR: In this article , four automated machine learning (AutoML) approach was applied to model the effects of microplastics on anaerobic digestion processes, and integrated explainable analysis was explored to reveal the relationships between key variables (e.g., concentration, type, and size of micro-plastics) and methane production.
Journal ArticleDOI

Explainable Preoperative Automated Machine Learning Prediction Model for Cardiac Surgery-Associated Acute Kidney Injury

TL;DR: A preoperative autoML prediction model for CSA-AKI that provided high predictive performance that was comparable to RF and superior to other ML and multivariable logistic regression models is presented.
Journal ArticleDOI

Prediction and Evaluation of Indirect Carbon Emission from Electrical Consumption in Multiple Full-Scale Wastewater Treatment Plants via Automated Machine Learning-Based Analysis

TL;DR: Wang et al. as mentioned in this paper proposed an automated machine learning (AutoML)-based indirect carbon emission analysis (ACIA) approach and predicted the specific indirect carbon emissions from electrical consumption (SEe; kg CO2/m3) successfully in nine full-scale WWTPs (W1-W9) with different treatment configurations based on the historical operational data.
Journal ArticleDOI

Risk Management in Software Development Projects: A Systematic Literature Review

TL;DR: In this article , the authors take a systematic approach to review articles containing risk management in software development projects and find the most exciting topics for researchers in risk management, especially in software engineering projects.
References
More filters
Book ChapterDOI

Auto-WEKA 2.0: automatic model selection and hyperparameter optimization in WEKA

TL;DR: The new version of Auto-WEKA is described, a system designed to help novice users by automatically searching through the joint space of WEKA's learning algorithms and their respective hyperparameter settings to maximize performance, using a state-of-the-art Bayesian optimization method.
Proceedings ArticleDOI

Auto-Keras: An Efficient Neural Architecture Search System

TL;DR: In this article, the authors propose a novel framework enabling Bayesian optimization to guide the network morphism for efficient neural architecture search, which keeps the functionality of a neural network while changing its neural architecture, enabling more efficient training during the search.
Proceedings Article

Initializing bayesian hyperparameter optimization via meta-learning

TL;DR: This paper mimics a strategy human domain experts use: speed up optimization by starting from promising configurations that performed well on similar datasets, and substantially improves the state of the art for the more complex combined algorithm selection and hyperparameter optimization problem.
Journal ArticleDOI

The impact of microblogging data for stock market prediction: Using Twitter to predict returns, volatility, trading volume and survey sentiment indices

TL;DR: It was found that Twitter sentiment and posting volume were relevant for the forecasting of returns of S&P 500 index, portfolios of lower market capitalization and some industries, and KF sentiment was informative for the forecast of returns.
Book ChapterDOI

Automating Biomedical Data Science Through Tree-Based Pipeline Optimization

TL;DR: This work implements a Tree-based Pipeline Optimization Tool (TPOT) and shows that TPOT can build machine learning pipelines that achieve competitive classification accuracy and discover novel pipeline operators—such as synthetic feature constructors—that significantly improve classification accuracy on these data sets.
Related Papers (5)
Frequently Asked Questions (2)
Q1. What are the contributions in "A scalable and automated machine learning framework to support risk management" ?

This paper presents an automated and scalable framework for ML that requires minimum human input. In this paper, the authors focus the experiments on the model training module. The authors first analyze the capabilities of eight AutoML tools: Auto-Gluon, Auto-Keras, Auto-Sklearn, Auto-Weka, H2O AutoML, Rminer, TPOT, and TransmogrifAI. Then, to select the tool for model training, the authors performed a benchmark with the only two tools that address a distributed ML ( H2O AutoML and TransmogrifAI ). The experiments used three real-world datasets from the telecommunications domain ( churn, event forecasting, and fraud detection ), as provided by an analytics company. 

In future work, the authors intend to use more telecommunications datasets to provide additional benchmarks for the model training module. Finally, even though the framework was developed specifically for the telecommunications risk management domain, the authors intend to study the applicability of the framework to other areas. Moreover, new AutoML tools can be considered, as long as they provide distributed capabilities.