scispace - formally typeset
SciSpace - Your AI assistant to discover and understand research papers | Product Hunt

Book ChapterDOI

A scalable and automated machine learning framework to support risk management

22 Feb 2020-pp 291-307

AbstractDue to the growth of data and widespread usage of Machine Learning (ML) by non-experts, automation and scalability are becoming key issues for ML. This paper presents an automated and scalable framework for ML that requires minimum human input. We designed the framework for the domain of telecommunications risk management. This domain often requires non-ML-experts to continuously update supervised learning models that are trained on huge amounts of data. Thus, the framework uses Automated Machine Learning (AutoML), to select and tune the ML models, and distributed ML, to deal with Big Data. The modules included in the framework are task detection (to detect classification or regression), data preprocessing, feature selection, model training, and deployment. In this paper, we focus the experiments on the model training module. We first analyze the capabilities of eight AutoML tools: Auto-Gluon, Auto-Keras, Auto-Sklearn, Auto-Weka, H2O AutoML, Rminer, TPOT, and TransmogrifAI. Then, to select the tool for model training, we performed a benchmark with the only two tools that address a distributed ML (H2O AutoML and TransmogrifAI). The experiments used three real-world datasets from the telecommunications domain (churn, event forecasting, and fraud detection), as provided by an analytics company. The experiments allowed us to measure the computational effort and predictive capability of the AutoML tools. Both tools obtained high-quality results and did not present substantial predictive differences. Nevertheless, H2O AutoML was selected by the analytics company for the model training module, since it was considered a more mature technology that presented a more interesting set of features (e.g., integration with more platforms). After choosing H2O AutoML for the ML training, we selected the technologies for the remaining components of the architecture (e.g., data preprocessing and web interface).

Topics: Supervised learning (58%), Analytics (56%), Data pre-processing (56%), Big data (55%), Scalability (52%)

Summary (3 min read)

1 Introduction

  • Nowadays, Machine Learning applications can make use of a great amount of data, complex algorithms, and machines with great processing power to produce effective predictions and forecasts [11].
  • The fact that it is possible to add new processing units enables ML applications to surpass time and memory restrictions [29].
  • The experiments used three real-world datasets from the domain of telecommunications.
  • The main novelty of this extended version is the technological architecture that is presented in Section 6.
  • This section describes the particular technologies that were used to implement the components of the proposed AutoML distributed framework apart from model training.

3 Proposed Architecture

  • This paper is part of “Intelligent Risk Management for the Digital Age” , a R&D project developed by a leading Portuguese company in the area of software and analytics.
  • Both scalability and automation are central requirements to the ML system since the company has many clients with diverse amounts of data (large or small) and that are typically nonML-experts.
  • The ML technological architecture that is proposed by this work identifies and automates all typical tasks of a common supervised ML application, with minimum human input (only the dataset and the target column).
  • Also, since the architecture was developed to work within a cluster with several processing nodes, the users can handle any size of datasets just by managing the number of cluster nodes.

3.1 Phases

  • The proposed architecture assumes two main phases (Fig. 1): a training phase and a testing phase.
  • The only human input needed by the user is the selection of the training dataset and the identification of the target column.
  • When all stages are defined, the pipeline is fitted to the training data, creating a pipeline model.
  • The last stage of the testing pipeline is the application of the best model obtained during training, generating the predictions.
  • Performance metrics are also computed and presented to the user.

3.2 Components

  • The proposed architecture includes five main components: task detection, data preprocessing, feature selection, model training (with the usage of AutoML), and pipeline deployment.
  • The applied transformations depend on the data type of the columns, number of levels, and number of missing values.
  • Deletes features from the dataset that may decrease the predictive performance of the ML models, using filtering methods, also known as Feature Selection.
  • The component also identifies the best model to be used on the test phase.
  • This module saves the pipeline that will be used on a test set, ensuring that the new data will pass through the same transformations as the training data.

4.1 Experimental Evaluation

  • For the experimental evaluation, the authors first examined the characteristics of the open-source AutoML tools.
  • Then, the authors used the tools that could be implemented in their architecture to perform a benchmark study.
  • In order to be considered for the experimental evaluation, the tools have to implement distributed ML.

4.2 AutoML Tools

  • The authors first analyzed eight recent open-source AutoML tools, to verify their compliance with the project requirements.
  • Auto-Sklearn is an AutoML Python library based on ScikitLearn [28] that implements methods for automatic algorithm selection and hyperparameter tuning, also known as Auto-Sklearn.
  • H2O AutoML uses H2O’s infrastructure to provide functions to automate algorithm selection and hyperparameter optimization [21].
  • Rminer is a package for the R tool, intending to facilitate the use of Machine Learning algorithms, also known as Rminer.
  • The last two rows are related to the stacking ensembles implemented by H2O AutoML: all, which combines all trained algorithms; and best, which only combines the best algorithm per family.

4.3 Data

  • For the benchmark study, the authors used three real-world datasets from the domain of telecommunications, provided by the IRMDA project analytics company.
  • Table 3 describes each attribute of the churn dataset.
  • The only attributes are the timestamp and the number of events in that interval, as described in Table 4.
  • The dataset contains more than 1 million examples, which correspond to one day of phone calls from one of the company clients.

5.1 Experimental Setup

  • The benchmark consisted of several computational experiments that used three real-world datasets to compare the selected AutoML tools (H2O AutoML and TransmogrifAI).
  • Every AutoML execution implemented a 10-fold cross-validation during the training of the algorithms.
  • The first scenario (1) considered all the attributes of the dataset as input features for the ML algorithms.
  • For event forecasting, the authors transformed the dataset, creating time lags as inputs for a regression task.

5.2 Discussion

  • The experimental results show that both AutoML tools require a small execution time to select the best ML model, with the highest mean execution time being slightly higher than 7 minutes.
  • The low training time can be justified with the usage of distributed ML, datasets with small number of rows or columns, and the removal of Deep Learning algorithms.
  • TransmogrifAI obtained the best predictive results in two regression scenarios and two classification scenarios.
  • This choice was supported by two main reasons.
  • First, H2O AutoML obtained better predictive results for most of the scenarios.

6 Technological Architecture

  • After the comparative ML experiments, the analytics company selected the H2O AutoML tool for the model training component.
  • The remaining technological modules were then designed in cooperation with the company.
  • Given that H2O can be integrated with Apache Spark (using the Sparkling Water module) and that Spark provides functions for data processing, the authors relied on Spark’s Application Programming Interface (API) functions to implement the remaining components of the architecture.
  • The updated architecture, with references to the technologies used, is illustrated in Fig. 2.

6.1 Components

  • This subsection describes the current implementation of each module of the architecture.
  • These changes were related to feedback received from the analytics company or due to technological restrictions.
  • Currently, the preprocessing transformations (e.g., dealing with missing data, the encoding of categorical features, standardization of numerical features) are done using Apache Spark’s functions for extracting, transforming and selecting features [1], also known as Data Preprocessing.
  • This function replaces the unknown values of a column with its mean value.
  • For classification (binary or multi-class) and regression tasks, the authors use H2O AutoML to automatically find and tune the best model.

6.2 API

  • In order to facilitate the execution of the architecture, the authors also created a REST API to mediate the communication between the end-users and the pipelines.
  • Since the execution of each request consists of one Apache Spark job (using H2O’s capabilities through the Sparkling Water module), the API works as an intermediary between the end-user and the execution of the code inside Spark.
  • The server formats the response to the appropriate format (e.g., XML, JSON) and sends the response to the client interface.

7 Conclusions

  • This paper proposes a ML framework to automate the typical workflow of supervised ML applications without the need for human input.
  • The framework was developed within project IRMDA, a R&D project developed by a leading Portuguese software and analytics company that provides services for the domain of telecommunications risk management.
  • In order to assess the most appropriate AutoML tools for this model training module, the authors initially conducted a benchmark experiment.
  • The authors selected technologies with distributed capabilities for the remaining modules of the initially proposed framework.
  • Besides, the authors intend to add more ML tasks to the framework, such as ordinal classification, multi-target regression, or multivariate time series.

Did you find this useful? Give us your feedback

...read more

Content maybe subject to copyright    Report

A Scalable and Automated Machine Learning
Framework to Support Risk Management
Lu´ıs Ferreira
1,2[0000000247905128]
, Andr´e Pilastri
2[0000000243803220]
,
Carlos Martins
3[0000000206784868]
, Pedro Santos
3[0000000242695838]
, and
Paulo Cortez
2[0000000279912090]
1
EPMQ - IT Engineering Maturity and Quality Lab, CCG ZGDV Institute,
Guimar˜aes, Portugal
{luis.ferreira, andre.pilastri}@ccg.pt
2
ALGORITMI Centre, Dep. Information Systems, University of Minho, Guimar˜aes,
Portugal
pcortez@dsi.uminho.pt
3
WeDo Technologies, Braga, Portugal
{pedro.santos, carlos.mmartins}@mobileum.com
Abstract. Due to the growth of data and widespread usage of Machine
Learning (ML) by non-experts, automation and scalability are becom-
ing key issues for ML. This paper presents an automated and scalable
framework for ML that requires minimum human input. We designed
the framework for the domain of telecommunications risk management.
This domain often requires non-ML-experts to continuously update su-
pervised learning models that are trained on huge amounts of data. Thus,
the framework uses Automated Machine Learning (AutoML), to select
and tune the ML models, and distributed ML, to deal with Big Data.
The modules included in the framework are task detection (to detect
classification or regression), data preprocessing, feature selection, model
training, and deployment. In this paper, we focus the experiments on
the model training module. We first analyze the capabilities of eight Au-
toML tools: Auto-Gluon, Auto-Keras, Auto-Sklearn, Auto-Weka, H2O
AutoML, Rminer, TPOT, and TransmogrifAI. Then, to select the tool
for model training, we performed a benchmark with the only two tools
that address a distributed ML (H2O AutoML and TransmogrifAI). The
experiments used three real-world datasets from the telecommunications
domain (churn, event forecasting, and fraud detection), as provided by an
analytics company. The experiments allowed us to measure the compu-
tational effort and predictive capability of the AutoML tools. Both tools
obtained high-quality results and did not present substantial predictive
differences. Nevertheless, H2O AutoML was selected by the analytics
company for the model training module, since it was considered a more
mature technology that presented a more interesting set of features (e.g.,
integration with more platforms). After choosing H2O AutoML for the
ML training, we selected the technologies for the remaining components
of the architecture (e.g., data preprocessing and web interface).
Keywords: Automated Machine Learning · Distributed Machine Learn-
ing · Supervised Learning · Risk Management.

2 Lu´ıs Ferreira et al.
1 Introduction
Nowadays, Machine Learning applications can make use of a great amount of
data, complex algorithms, and machines with great processing power to produce
effective predictions and forecasts [11]. Currently, two of the most important
features of real-world ML applications are distributed learning and AutoML.
Distributed learning is particularly useful for ML applications in the context of
Big Data or when there are hardware constraints. Distributed learning consists
of using multiple machines or processors to process parts of the ML algorithm
or parts of the data. The fact that it is possible to add new processing units
enables ML applications to surpass time and memory restrictions [29]. AutoML
intends to allow people that are not experts in ML to efficiently choose and
apply ML algorithms. AutoML is particularly relevant since there is a growing
number of non-specialists working with ML [31]. It is also important for real-
world applications that require constant updates to ML models.
In this paper, we propose a technological architecture that addresses these
two ML challenges. The architecture was adapted to the area of telecommunica-
tions risk management, which is a domain that mostly uses supervised learning
algorithms (e.g., for churn prediction). Moreover, the ML models are constantly
updated by people that are not experts in ML and may involve Big Data. Thus,
the proposed architecture delineates a set of steps to automate the typical work-
flow of a ML application that uses supervised learning. The architecture includes
modules for task detection, data preprocessing, feature selection, model training,
and deployment.
The focus of this work is the model training module of the architecture,
which was designed to use a distributed AutoML tool. In order to select the
ML tool for this module, we initially evaluated the characteristics of eight open-
source AutoML tools (Auto-Gluon, Auto-Keras, Auto-Sklearn, Auto-Weka, H2O
AutoML, Rminer, TPOT, and TransmogrifAI). We then performed a benchmark
to compare the two tools that allowed a distributed execution (H2O AutoML
and TransmogrifAI). The experiments used three real-world datasets from the
domain of telecommunications. These datasets were related to churn (regression),
event forecasting (time series), and fraud detection (binary classification).
This paper consists of an extended version of our previous work [14]. The
main novelty of this extended version is the technological architecture that is
presented in Section 6. This section describes the particular technologies that
were used to implement the components of the proposed AutoML distributed
framework apart from model training. Also, this section describes the REST
API that was developed to mediate the communication between the end-users
and the proposed framework.
The paper is organized as follows. Section 2 presents the related work. In
Section 3, we detail the proposed ML architecture. Nest, Section 4 describes the
analyzed AutoML technologies and the datasets used during the experimental
tests. Then, Section 5 discusses the experimental results. Section 6 details the
technological architecture. Finally, Section 7 presents the main conclusions and
future work directions.

A Scalable and Automated ML Framework 3
2 Related Work
In a Big Data context, it is critical to create and use scalable ML algorithms
to face the common constraints of memory and time [29]. To face that concern,
classical distributed ML distributes the work among different processors, each
performing part of the algorithm. Another current ML problem concerns the
choice of ML algorithms and hyperparameters for a given task. For ML experts,
this selection of algorithms and hyperparameters may use domain knowledge or
heuristics, but it is not an easy task for non-ML-experts. AutoML was developed
to combat this relevant issue [22]. The definition of AutoML can be described as
the search for the best algorithm and hyperparameters for a given dataset with
minimum human input.
In recent years, a large number of AutoML tools was developed, such as Auto-
Gluon [3], Auto-Keras [23], Auto-Sklearn [15], Auto-Weka [24], H2O AutoML
[21], Rminer [10], TPOT [27], and TransmogrifAI [30]. Within our knowledge,
few studies directly compare AutoML tools. Most studies compare one specific
AutoML framework with state-of-the-art ML algorithms [15], do not present
experimental tests [12, 35], or are related to ML automation challenges [18–20].
Recently, some studies focused on experimental comparisons of AutoML
tools. In 2019, [17] and [32] compare a set of AutoML tools using different
datasets and ML tasks. In 2020, a benchmark was conducted using publicly
available datasets from OpenML [33], comparing different types of AutoML
tools, which were grouped by their capabilities [36]. None of the mentioned com-
parison studies considered the distributed ML capability for the AutoML tools.
Furthermore, none of the studies used datasets from the domain of telecommu-
nications risk management, such as churn prediction or fraud detection.
3 Proposed Architecture
This paper is part of “Intelligent Risk Management for the Digital Age” (IR-
MDA), a R&D project developed by a leading Portuguese company in the area
of software and analytics. The purpose of the project is to develop a ML system
to assist the company telecommunications clients. Both scalability and automa-
tion are central requirements to the ML system since the company has many
clients with diverse amounts of data (large or small) and that are typically non-
ML-experts.
The ML technological architecture that is proposed by this work identifies
and automates all typical tasks of a common supervised ML application, with
minimum human input (only the dataset and the target column). Also, since
the architecture was developed to work within a cluster with several processing
nodes, the users can handle any size of datasets just by managing the number
of cluster nodes. The architecture is illustrated in Fig. 1.

4 Lu´ıs Ferreira et al.
Fig. 1. The proposed automated and scalable ML architecture (adapted from [14]).
3.1 Phases
The proposed architecture assumes two main phases (Fig. 1): a training phase
and a testing phase.
Training Phase: The training phase includes the creation of a pipeline instance
and the definition of its stages. The only human input needed by the user is the
selection of the training dataset and the identification of the target column.
Depending on the dataset columns, the each module defines a set of stages for
the pipeline. Each stage either transforms data or also creates a model based
on the training data that will be used on the test phase to transform the data.
When all stages are defined, the pipeline is fitted to the training data, creating
a pipeline model. Finally, the pipeline model is exported to a file.
Testing Phase: The execution of the testing pipeline assumes the same trans-
formations that were applied to the training data. To execute the testing pipeline
the user only needs to specify the test data and a pipeline model (and a fore-
casting horizon in the case of time series forecasting task). The last stage of the
testing pipeline is the application of the best model obtained during training,
generating the predictions. Performance metrics are also computed and presented
to the user.
3.2 Components
The proposed architecture includes five main components: task detection, data
preprocessing, feature selection, model training (with the usage of AutoML),
and pipeline deployment.
Machine Learning Task Detection: Set to detect the ML task of the pipeline
(e.g., classification, regression, time series). This detection is made by analyzing
the number of levels of the target column and the existence (or not) of a time
column.

A Scalable and Automated ML Framework 5
Data Preprocessing: Handles missing data, the encoding of categorical fea-
tures, and the standardization of numerical features. The applied transforma-
tions depend on the data type of the columns, number of levels, and number of
missing values.
Feature Selection: Deletes features from the dataset that may decrease the
predictive performance of the ML models, using filtering methods. Filtering
methods are based on individual correlations between each feature and the tar-
get, removing several features that present the lowest correlations [4].
Model Training: Automatically trains and tunes a set of ML models using a
set of constraints (e.g., time limit, memory usage). The component also identifies
the best model to be used on the test phase.
Pipeline Deployment: Manages the saving and loading of the pipelines to
and from files. This module saves the pipeline that will be used on a test set,
ensuring that the new data will pass through the same transformations as the
training data. Also, the component stores the best model obtained during the
training to make predictions, discarding all other ML models.
4 Materials and Methods
4.1 Experimental Evaluation
For the experimental evaluation, we first examined the characteristics of the
open-source AutoML tools. Then, we used the tools that could be implemented
in our architecture to perform a benchmark study. In order to be considered for
the experimental evaluation, the tools have to implement distributed ML.
4.2 AutoML Tools
We first analyzed eight recent open-source AutoML tools, to verify their com-
pliance with the project requirements.
Auto-Gluon: AutoGluon is an open-source AutoML toolkit with a focus on
Deep Learning. It is written in Python and runs on Linux operating system.
AutoGluon is divided into four main modules: tabular data, image classifica-
tion, object detection, and text classification [3]. In this article, only the tabular
prediction functionalities are being considered.
Auto-Keras: Auto-Keras is a Python library based on Keras [6] that imple-
ments AutoML methods with Deep Learning algorithms. The focus of Auto-
Keras is the automatic search for Deep Learning architectures and hyperparam-
eters, usually named Neural Architecture Search [13].

References
More filters

Journal Article
TL;DR: Scikit-learn is a Python module integrating a wide range of state-of-the-art machine learning algorithms for medium-scale supervised and unsupervised problems, focusing on bringing machine learning to non-specialists using a general-purpose high-level language.
Abstract: Scikit-learn is a Python module integrating a wide range of state-of-the-art machine learning algorithms for medium-scale supervised and unsupervised problems. This package focuses on bringing machine learning to non-specialists using a general-purpose high-level language. Emphasis is put on ease of use, performance, documentation, and API consistency. It has minimal dependencies and is distributed under the simplified BSD license, encouraging its use in both academic and commercial settings. Source code, binaries, and documentation can be downloaded from http://scikit-learn.sourceforge.net.

33,540 citations


Posted Content
Abstract: Scikit-learn is a Python module integrating a wide range of state-of-the-art machine learning algorithms for medium-scale supervised and unsupervised problems. This package focuses on bringing machine learning to non-specialists using a general-purpose high-level language. Emphasis is put on ease of use, performance, documentation, and API consistency. It has minimal dependencies and is distributed under the simplified BSD license, encouraging its use in both academic and commercial settings. Source code, binaries, and documentation can be downloaded from this http URL.

28,898 citations


Journal ArticleDOI
Abstract: An approach to the construction of classifiers from imbalanced datasets is described. A dataset is imbalanced if the classification categories are not approximately equally represented. Often real-world data sets are predominately composed of "normal" examples with only a small percentage of "abnormal" or "interesting" examples. It is also the case that the cost of misclassifying an abnormal (interesting) example as a normal example is often much higher than the cost of the reverse error. Under-sampling of the majority (normal) class has been proposed as a good means of increasing the sensitivity of a classifier to the minority class. This paper shows that a combination of our method of over-sampling the minority (abnormal) class and under-sampling the majority (normal) class can achieve better classifier performance (in ROC space) than only under-sampling the majority class. This paper also shows that a combination of our method of over-sampling the minority class and under-sampling the majority class can achieve better classifier performance (in ROC space) than varying the loss ratios in Ripper or class priors in Naive Bayes. Our method of over-sampling the minority class involves creating synthetic minority class examples. Experiments are performed using C4.5, Ripper and a Naive Bayes classifier. The method is evaluated using the area under the Receiver Operating Characteristic curve (AUC) and the ROC convex hull strategy.

11,512 citations


Journal ArticleDOI
Abstract: An approach to the construction of classifiers from imbalanced datasets is described. A dataset is imbalanced if the classification categories are not approximately equally represented. Often real-world data sets are predominately composed of "normal" examples with only a small percentage of "abnormal" or "interesting" examples. It is also the case that the cost of misclassifying an abnormal (interesting) example as a normal example is often much higher than the cost of the reverse error. Under-sampling of the majority (normal) class has been proposed as a good means of increasing the sensitivity of a classifier to the minority class. This paper shows that a combination of our method of oversampling the minority (abnormal)cla ss and under-sampling the majority (normal) class can achieve better classifier performance (in ROC space)tha n only under-sampling the majority class. This paper also shows that a combination of our method of over-sampling the minority class and under-sampling the majority class can achieve better classifier performance (in ROC space)t han varying the loss ratios in Ripper or class priors in Naive Bayes. Our method of over-sampling the minority class involves creating synthetic minority class examples. Experiments are performed using C4.5, Ripper and a Naive Bayes classifier. The method is evaluated using the area under the Receiver Operating Characteristic curve (AUC)and the ROC convex hull strategy.

11,077 citations


01 Jan 1997
Abstract: In this survey, we review work in machine learning on methods for handling data sets containing large amounts of irrelevant information. We focus on two key issues: the problem of selecting relevant features, and the problem of selecting relevant examples. We describe the advances that have been made on these topics in both empirical and theoretical work in machine learning, and we present a general framework that we use to compare different methods. We close with some challenges for future work in this area. @ 1997 Elsevier Science B.V.

2,814 citations


Frequently Asked Questions (2)
Q1. What are the contributions in "A scalable and automated machine learning framework to support risk management" ?

This paper presents an automated and scalable framework for ML that requires minimum human input. In this paper, the authors focus the experiments on the model training module. The authors first analyze the capabilities of eight AutoML tools: Auto-Gluon, Auto-Keras, Auto-Sklearn, Auto-Weka, H2O AutoML, Rminer, TPOT, and TransmogrifAI. Then, to select the tool for model training, the authors performed a benchmark with the only two tools that address a distributed ML ( H2O AutoML and TransmogrifAI ). The experiments used three real-world datasets from the telecommunications domain ( churn, event forecasting, and fraud detection ), as provided by an analytics company. 

In future work, the authors intend to use more telecommunications datasets to provide additional benchmarks for the model training module. Finally, even though the framework was developed specifically for the telecommunications risk management domain, the authors intend to study the applicability of the framework to other areas. Moreover, new AutoML tools can be considered, as long as they provide distributed capabilities.