scispace - formally typeset
Search or ask a question
Book ChapterDOI

A scalable and automated machine learning framework to support risk management

22 Feb 2020-pp 291-307
TL;DR: In this paper, an automated and scalable framework for ML that requires minimum human input is presented for the domain of telecommunications risk management, where the modules included in the framework are task detection (to detect classification or regression), data preprocessing, feature selection, model training, and deployment.
Abstract: Due to the growth of data and widespread usage of Machine Learning (ML) by non-experts, automation and scalability are becoming key issues for ML. This paper presents an automated and scalable framework for ML that requires minimum human input. We designed the framework for the domain of telecommunications risk management. This domain often requires non-ML-experts to continuously update supervised learning models that are trained on huge amounts of data. Thus, the framework uses Automated Machine Learning (AutoML), to select and tune the ML models, and distributed ML, to deal with Big Data. The modules included in the framework are task detection (to detect classification or regression), data preprocessing, feature selection, model training, and deployment. In this paper, we focus the experiments on the model training module. We first analyze the capabilities of eight AutoML tools: Auto-Gluon, Auto-Keras, Auto-Sklearn, Auto-Weka, H2O AutoML, Rminer, TPOT, and TransmogrifAI. Then, to select the tool for model training, we performed a benchmark with the only two tools that address a distributed ML (H2O AutoML and TransmogrifAI). The experiments used three real-world datasets from the telecommunications domain (churn, event forecasting, and fraud detection), as provided by an analytics company. The experiments allowed us to measure the computational effort and predictive capability of the AutoML tools. Both tools obtained high-quality results and did not present substantial predictive differences. Nevertheless, H2O AutoML was selected by the analytics company for the model training module, since it was considered a more mature technology that presented a more interesting set of features (e.g., integration with more platforms). After choosing H2O AutoML for the ML training, we selected the technologies for the remaining components of the architecture (e.g., data preprocessing and web interface).

Summary (3 min read)

1 Introduction

  • Nowadays, Machine Learning applications can make use of a great amount of data, complex algorithms, and machines with great processing power to produce effective predictions and forecasts [11].
  • The fact that it is possible to add new processing units enables ML applications to surpass time and memory restrictions [29].
  • The experiments used three real-world datasets from the domain of telecommunications.
  • The main novelty of this extended version is the technological architecture that is presented in Section 6.
  • This section describes the particular technologies that were used to implement the components of the proposed AutoML distributed framework apart from model training.

3 Proposed Architecture

  • This paper is part of “Intelligent Risk Management for the Digital Age” , a R&D project developed by a leading Portuguese company in the area of software and analytics.
  • Both scalability and automation are central requirements to the ML system since the company has many clients with diverse amounts of data (large or small) and that are typically nonML-experts.
  • The ML technological architecture that is proposed by this work identifies and automates all typical tasks of a common supervised ML application, with minimum human input (only the dataset and the target column).
  • Also, since the architecture was developed to work within a cluster with several processing nodes, the users can handle any size of datasets just by managing the number of cluster nodes.

3.1 Phases

  • The proposed architecture assumes two main phases (Fig. 1): a training phase and a testing phase.
  • The only human input needed by the user is the selection of the training dataset and the identification of the target column.
  • When all stages are defined, the pipeline is fitted to the training data, creating a pipeline model.
  • The last stage of the testing pipeline is the application of the best model obtained during training, generating the predictions.
  • Performance metrics are also computed and presented to the user.

3.2 Components

  • The proposed architecture includes five main components: task detection, data preprocessing, feature selection, model training (with the usage of AutoML), and pipeline deployment.
  • The applied transformations depend on the data type of the columns, number of levels, and number of missing values.
  • Deletes features from the dataset that may decrease the predictive performance of the ML models, using filtering methods, also known as Feature Selection.
  • The component also identifies the best model to be used on the test phase.
  • This module saves the pipeline that will be used on a test set, ensuring that the new data will pass through the same transformations as the training data.

4.1 Experimental Evaluation

  • For the experimental evaluation, the authors first examined the characteristics of the open-source AutoML tools.
  • Then, the authors used the tools that could be implemented in their architecture to perform a benchmark study.
  • In order to be considered for the experimental evaluation, the tools have to implement distributed ML.

4.2 AutoML Tools

  • The authors first analyzed eight recent open-source AutoML tools, to verify their compliance with the project requirements.
  • Auto-Sklearn is an AutoML Python library based on ScikitLearn [28] that implements methods for automatic algorithm selection and hyperparameter tuning, also known as Auto-Sklearn.
  • H2O AutoML uses H2O’s infrastructure to provide functions to automate algorithm selection and hyperparameter optimization [21].
  • Rminer is a package for the R tool, intending to facilitate the use of Machine Learning algorithms, also known as Rminer.
  • The last two rows are related to the stacking ensembles implemented by H2O AutoML: all, which combines all trained algorithms; and best, which only combines the best algorithm per family.

4.3 Data

  • For the benchmark study, the authors used three real-world datasets from the domain of telecommunications, provided by the IRMDA project analytics company.
  • Table 3 describes each attribute of the churn dataset.
  • The only attributes are the timestamp and the number of events in that interval, as described in Table 4.
  • The dataset contains more than 1 million examples, which correspond to one day of phone calls from one of the company clients.

5.1 Experimental Setup

  • The benchmark consisted of several computational experiments that used three real-world datasets to compare the selected AutoML tools (H2O AutoML and TransmogrifAI).
  • Every AutoML execution implemented a 10-fold cross-validation during the training of the algorithms.
  • The first scenario (1) considered all the attributes of the dataset as input features for the ML algorithms.
  • For event forecasting, the authors transformed the dataset, creating time lags as inputs for a regression task.

5.2 Discussion

  • The experimental results show that both AutoML tools require a small execution time to select the best ML model, with the highest mean execution time being slightly higher than 7 minutes.
  • The low training time can be justified with the usage of distributed ML, datasets with small number of rows or columns, and the removal of Deep Learning algorithms.
  • TransmogrifAI obtained the best predictive results in two regression scenarios and two classification scenarios.
  • This choice was supported by two main reasons.
  • First, H2O AutoML obtained better predictive results for most of the scenarios.

6 Technological Architecture

  • After the comparative ML experiments, the analytics company selected the H2O AutoML tool for the model training component.
  • The remaining technological modules were then designed in cooperation with the company.
  • Given that H2O can be integrated with Apache Spark (using the Sparkling Water module) and that Spark provides functions for data processing, the authors relied on Spark’s Application Programming Interface (API) functions to implement the remaining components of the architecture.
  • The updated architecture, with references to the technologies used, is illustrated in Fig. 2.

6.1 Components

  • This subsection describes the current implementation of each module of the architecture.
  • These changes were related to feedback received from the analytics company or due to technological restrictions.
  • Currently, the preprocessing transformations (e.g., dealing with missing data, the encoding of categorical features, standardization of numerical features) are done using Apache Spark’s functions for extracting, transforming and selecting features [1], also known as Data Preprocessing.
  • This function replaces the unknown values of a column with its mean value.
  • For classification (binary or multi-class) and regression tasks, the authors use H2O AutoML to automatically find and tune the best model.

6.2 API

  • In order to facilitate the execution of the architecture, the authors also created a REST API to mediate the communication between the end-users and the pipelines.
  • Since the execution of each request consists of one Apache Spark job (using H2O’s capabilities through the Sparkling Water module), the API works as an intermediary between the end-user and the execution of the code inside Spark.
  • The server formats the response to the appropriate format (e.g., XML, JSON) and sends the response to the client interface.

7 Conclusions

  • This paper proposes a ML framework to automate the typical workflow of supervised ML applications without the need for human input.
  • The framework was developed within project IRMDA, a R&D project developed by a leading Portuguese software and analytics company that provides services for the domain of telecommunications risk management.
  • In order to assess the most appropriate AutoML tools for this model training module, the authors initially conducted a benchmark experiment.
  • The authors selected technologies with distributed capabilities for the remaining modules of the initially proposed framework.
  • Besides, the authors intend to add more ML tasks to the framework, such as ordinal classification, multi-target regression, or multivariate time series.

Did you find this useful? Give us your feedback

Content maybe subject to copyright    Report

A Scalable and Automated Machine Learning
Framework to Support Risk Management
Lu´ıs Ferreira
1,2[0000000247905128]
, Andr´e Pilastri
2[0000000243803220]
,
Carlos Martins
3[0000000206784868]
, Pedro Santos
3[0000000242695838]
, and
Paulo Cortez
2[0000000279912090]
1
EPMQ - IT Engineering Maturity and Quality Lab, CCG ZGDV Institute,
Guimar˜aes, Portugal
{luis.ferreira, andre.pilastri}@ccg.pt
2
ALGORITMI Centre, Dep. Information Systems, University of Minho, Guimar˜aes,
Portugal
pcortez@dsi.uminho.pt
3
WeDo Technologies, Braga, Portugal
{pedro.santos, carlos.mmartins}@mobileum.com
Abstract. Due to the growth of data and widespread usage of Machine
Learning (ML) by non-experts, automation and scalability are becom-
ing key issues for ML. This paper presents an automated and scalable
framework for ML that requires minimum human input. We designed
the framework for the domain of telecommunications risk management.
This domain often requires non-ML-experts to continuously update su-
pervised learning models that are trained on huge amounts of data. Thus,
the framework uses Automated Machine Learning (AutoML), to select
and tune the ML models, and distributed ML, to deal with Big Data.
The modules included in the framework are task detection (to detect
classification or regression), data preprocessing, feature selection, model
training, and deployment. In this paper, we focus the experiments on
the model training module. We first analyze the capabilities of eight Au-
toML tools: Auto-Gluon, Auto-Keras, Auto-Sklearn, Auto-Weka, H2O
AutoML, Rminer, TPOT, and TransmogrifAI. Then, to select the tool
for model training, we performed a benchmark with the only two tools
that address a distributed ML (H2O AutoML and TransmogrifAI). The
experiments used three real-world datasets from the telecommunications
domain (churn, event forecasting, and fraud detection), as provided by an
analytics company. The experiments allowed us to measure the compu-
tational effort and predictive capability of the AutoML tools. Both tools
obtained high-quality results and did not present substantial predictive
differences. Nevertheless, H2O AutoML was selected by the analytics
company for the model training module, since it was considered a more
mature technology that presented a more interesting set of features (e.g.,
integration with more platforms). After choosing H2O AutoML for the
ML training, we selected the technologies for the remaining components
of the architecture (e.g., data preprocessing and web interface).
Keywords: Automated Machine Learning · Distributed Machine Learn-
ing · Supervised Learning · Risk Management.

2 Lu´ıs Ferreira et al.
1 Introduction
Nowadays, Machine Learning applications can make use of a great amount of
data, complex algorithms, and machines with great processing power to produce
effective predictions and forecasts [11]. Currently, two of the most important
features of real-world ML applications are distributed learning and AutoML.
Distributed learning is particularly useful for ML applications in the context of
Big Data or when there are hardware constraints. Distributed learning consists
of using multiple machines or processors to process parts of the ML algorithm
or parts of the data. The fact that it is possible to add new processing units
enables ML applications to surpass time and memory restrictions [29]. AutoML
intends to allow people that are not experts in ML to efficiently choose and
apply ML algorithms. AutoML is particularly relevant since there is a growing
number of non-specialists working with ML [31]. It is also important for real-
world applications that require constant updates to ML models.
In this paper, we propose a technological architecture that addresses these
two ML challenges. The architecture was adapted to the area of telecommunica-
tions risk management, which is a domain that mostly uses supervised learning
algorithms (e.g., for churn prediction). Moreover, the ML models are constantly
updated by people that are not experts in ML and may involve Big Data. Thus,
the proposed architecture delineates a set of steps to automate the typical work-
flow of a ML application that uses supervised learning. The architecture includes
modules for task detection, data preprocessing, feature selection, model training,
and deployment.
The focus of this work is the model training module of the architecture,
which was designed to use a distributed AutoML tool. In order to select the
ML tool for this module, we initially evaluated the characteristics of eight open-
source AutoML tools (Auto-Gluon, Auto-Keras, Auto-Sklearn, Auto-Weka, H2O
AutoML, Rminer, TPOT, and TransmogrifAI). We then performed a benchmark
to compare the two tools that allowed a distributed execution (H2O AutoML
and TransmogrifAI). The experiments used three real-world datasets from the
domain of telecommunications. These datasets were related to churn (regression),
event forecasting (time series), and fraud detection (binary classification).
This paper consists of an extended version of our previous work [14]. The
main novelty of this extended version is the technological architecture that is
presented in Section 6. This section describes the particular technologies that
were used to implement the components of the proposed AutoML distributed
framework apart from model training. Also, this section describes the REST
API that was developed to mediate the communication between the end-users
and the proposed framework.
The paper is organized as follows. Section 2 presents the related work. In
Section 3, we detail the proposed ML architecture. Nest, Section 4 describes the
analyzed AutoML technologies and the datasets used during the experimental
tests. Then, Section 5 discusses the experimental results. Section 6 details the
technological architecture. Finally, Section 7 presents the main conclusions and
future work directions.

A Scalable and Automated ML Framework 3
2 Related Work
In a Big Data context, it is critical to create and use scalable ML algorithms
to face the common constraints of memory and time [29]. To face that concern,
classical distributed ML distributes the work among different processors, each
performing part of the algorithm. Another current ML problem concerns the
choice of ML algorithms and hyperparameters for a given task. For ML experts,
this selection of algorithms and hyperparameters may use domain knowledge or
heuristics, but it is not an easy task for non-ML-experts. AutoML was developed
to combat this relevant issue [22]. The definition of AutoML can be described as
the search for the best algorithm and hyperparameters for a given dataset with
minimum human input.
In recent years, a large number of AutoML tools was developed, such as Auto-
Gluon [3], Auto-Keras [23], Auto-Sklearn [15], Auto-Weka [24], H2O AutoML
[21], Rminer [10], TPOT [27], and TransmogrifAI [30]. Within our knowledge,
few studies directly compare AutoML tools. Most studies compare one specific
AutoML framework with state-of-the-art ML algorithms [15], do not present
experimental tests [12, 35], or are related to ML automation challenges [18–20].
Recently, some studies focused on experimental comparisons of AutoML
tools. In 2019, [17] and [32] compare a set of AutoML tools using different
datasets and ML tasks. In 2020, a benchmark was conducted using publicly
available datasets from OpenML [33], comparing different types of AutoML
tools, which were grouped by their capabilities [36]. None of the mentioned com-
parison studies considered the distributed ML capability for the AutoML tools.
Furthermore, none of the studies used datasets from the domain of telecommu-
nications risk management, such as churn prediction or fraud detection.
3 Proposed Architecture
This paper is part of “Intelligent Risk Management for the Digital Age” (IR-
MDA), a R&D project developed by a leading Portuguese company in the area
of software and analytics. The purpose of the project is to develop a ML system
to assist the company telecommunications clients. Both scalability and automa-
tion are central requirements to the ML system since the company has many
clients with diverse amounts of data (large or small) and that are typically non-
ML-experts.
The ML technological architecture that is proposed by this work identifies
and automates all typical tasks of a common supervised ML application, with
minimum human input (only the dataset and the target column). Also, since
the architecture was developed to work within a cluster with several processing
nodes, the users can handle any size of datasets just by managing the number
of cluster nodes. The architecture is illustrated in Fig. 1.

4 Lu´ıs Ferreira et al.
Fig. 1. The proposed automated and scalable ML architecture (adapted from [14]).
3.1 Phases
The proposed architecture assumes two main phases (Fig. 1): a training phase
and a testing phase.
Training Phase: The training phase includes the creation of a pipeline instance
and the definition of its stages. The only human input needed by the user is the
selection of the training dataset and the identification of the target column.
Depending on the dataset columns, the each module defines a set of stages for
the pipeline. Each stage either transforms data or also creates a model based
on the training data that will be used on the test phase to transform the data.
When all stages are defined, the pipeline is fitted to the training data, creating
a pipeline model. Finally, the pipeline model is exported to a file.
Testing Phase: The execution of the testing pipeline assumes the same trans-
formations that were applied to the training data. To execute the testing pipeline
the user only needs to specify the test data and a pipeline model (and a fore-
casting horizon in the case of time series forecasting task). The last stage of the
testing pipeline is the application of the best model obtained during training,
generating the predictions. Performance metrics are also computed and presented
to the user.
3.2 Components
The proposed architecture includes five main components: task detection, data
preprocessing, feature selection, model training (with the usage of AutoML),
and pipeline deployment.
Machine Learning Task Detection: Set to detect the ML task of the pipeline
(e.g., classification, regression, time series). This detection is made by analyzing
the number of levels of the target column and the existence (or not) of a time
column.

A Scalable and Automated ML Framework 5
Data Preprocessing: Handles missing data, the encoding of categorical fea-
tures, and the standardization of numerical features. The applied transforma-
tions depend on the data type of the columns, number of levels, and number of
missing values.
Feature Selection: Deletes features from the dataset that may decrease the
predictive performance of the ML models, using filtering methods. Filtering
methods are based on individual correlations between each feature and the tar-
get, removing several features that present the lowest correlations [4].
Model Training: Automatically trains and tunes a set of ML models using a
set of constraints (e.g., time limit, memory usage). The component also identifies
the best model to be used on the test phase.
Pipeline Deployment: Manages the saving and loading of the pipelines to
and from files. This module saves the pipeline that will be used on a test set,
ensuring that the new data will pass through the same transformations as the
training data. Also, the component stores the best model obtained during the
training to make predictions, discarding all other ML models.
4 Materials and Methods
4.1 Experimental Evaluation
For the experimental evaluation, we first examined the characteristics of the
open-source AutoML tools. Then, we used the tools that could be implemented
in our architecture to perform a benchmark study. In order to be considered for
the experimental evaluation, the tools have to implement distributed ML.
4.2 AutoML Tools
We first analyzed eight recent open-source AutoML tools, to verify their com-
pliance with the project requirements.
Auto-Gluon: AutoGluon is an open-source AutoML toolkit with a focus on
Deep Learning. It is written in Python and runs on Linux operating system.
AutoGluon is divided into four main modules: tabular data, image classifica-
tion, object detection, and text classification [3]. In this article, only the tabular
prediction functionalities are being considered.
Auto-Keras: Auto-Keras is a Python library based on Keras [6] that imple-
ments AutoML methods with Deep Learning algorithms. The focus of Auto-
Keras is the automatic search for Deep Learning architectures and hyperparam-
eters, usually named Neural Architecture Search [13].

Citations
More filters
Journal ArticleDOI
TL;DR: In this article , four automated machine learning (AutoML) approach was applied to model the effects of microplastics on anaerobic digestion processes, and integrated explainable analysis was explored to reveal the relationships between key variables (e.g., concentration, type, and size of micro-plastics) and methane production.

12 citations

Journal ArticleDOI
TL;DR: A preoperative autoML prediction model for CSA-AKI that provided high predictive performance that was comparable to RF and superior to other ML and multivariable logistic regression models is presented.
Abstract: Background: We aimed to develop and validate an automated machine learning (autoML) prediction model for cardiac surgery-associated acute kidney injury (CSA-AKI). Methods: Using 69 preoperative variables, we developed several models to predict post-operative AKI in adult patients undergoing cardiac surgery. Models included autoML and non-autoML types, including decision tree (DT), random forest (RF), extreme gradient boosting (XGBoost), and artificial neural network (ANN), as well as a logistic regression prediction model. We then compared model performance using area under the receiver operating characteristic curve (AUROC) and assessed model calibration using Brier score on the independent testing dataset. Results: The incidence of CSA-AKI was 36%. Stacked ensemble autoML had the highest predictive performance among autoML models, and was chosen for comparison with other non-autoML and multivariable logistic regression models. The autoML had the highest AUROC (0.79), followed by RF (0.78), XGBoost (0.77), multivariable logistic regression (0.77), ANN (0.75), and DT (0.64). The autoML had comparable AUROC with RF and outperformed the other models. The autoML was well-calibrated. The Brier score for autoML, RF, DT, XGBoost, ANN, and multivariable logistic regression was 0.18, 0.18, 0.21, 0.19, 0.19, and 0.18, respectively. We applied SHAP and LIME algorithms to our autoML prediction model to extract an explanation of the variables that drive patient-specific predictions of CSA-AKI. Conclusion: We were able to present a preoperative autoML prediction model for CSA-AKI that provided high predictive performance that was comparable to RF and superior to other ML and multivariable logistic regression models. The novel approaches of the proposed explainable preoperative autoML prediction model for CSA-AKI may guide clinicians in advancing individualized medicine plans for patients under cardiac surgery.

2 citations

Journal ArticleDOI
15 Dec 2022
TL;DR: Wang et al. as mentioned in this paper proposed an automated machine learning (AutoML)-based indirect carbon emission analysis (ACIA) approach and predicted the specific indirect carbon emissions from electrical consumption (SEe; kg CO2/m3) successfully in nine full-scale WWTPs (W1-W9) with different treatment configurations based on the historical operational data.
Abstract: The indirect carbon emission from electrical consumption of wastewater treatment plants (WWTPs) accounts for large proportions of their total carbon emissions, which deserves intensive attention. This work proposed an automated machine learning (AutoML)-based indirect carbon emission analysis (ACIA) approach and predicted the specific indirect carbon emission from electrical consumption (SEe; kg CO2/m3) successfully in nine full-scale WWTPs (W1–W9) with different treatment configurations based on the historical operational data. The stacked ensemble models generated by the AutoML accurately predicted the SEe (mean absolute error = 0.02232–0.02352, R2 = 0.65107–0.67509). Then, the variable importance and Shapley additive explanations (SHAP) summary plots qualitatively revealed that the influent volume and the types of secondary and tertiary treatment processes were the most important variables associated with SEe prediction. The interpretation results of partial dependence and individual conditional expectation further verified quantitative relationships between input variables and SEe. Also, low energy efficiency with high indirect carbon emission of WWTPs was distinguished. Compared with traditional carbon emission analysis and prediction methods, the ACIA method could accurately evaluate and predict SEe of WWTPs with different treatment scales and processes with easily available variables and reveal qualitative and quantitative relationships inside datasets simultaneously, which is a powerful tool to benefit the “carbon neutrality” of WWTPs.

2 citations

Journal ArticleDOI
TL;DR: In this article , the authors take a systematic approach to review articles containing risk management in software development projects and find the most exciting topics for researchers in risk management, especially in software engineering projects.
Abstract: Risk Management is an integral part of every project. Risk management must estimate the risks’ significance, especially in the SDLC process, and mitigate those risks. Since 2016, many papers and journals have researched planning, design, and risk control in software development projects over the last five years. This study aims to find the most exciting topics for researchers in risk management, especially in software engineering projects. This paper takes a systematic approach to reviewing articles containing risk management in software development projects. This study collects papers and journals included in the international online library database, then summarizes them according to the stages of the PICOC methodology. This paper results in the focus of research in the last five years on Agile methods. The current issue is that many researchers are trying to explicitly integrate risk management into the Agile development process by creating a comprehensive risk management framework. This SLR helps future research get a theoretical basis to solve the studied problem. The SLR explains the focuses of previous research, analysis of research results, and the weaknesses of the investigation. For further study, take one of the topic papers, do a critical review, and find research gaps.
References
More filters
Proceedings ArticleDOI
15 Aug 2019
TL;DR: This paper investigates the current state of AutoML tools aiming to automate repetitive tasks in ML pipelines, such as data pre-processing, feature engineering, model selection, hyperparameter optimization, and prediction result analysis.
Abstract: There has been considerable growth and interest in industrial applications of machine learning (ML) in recent years. ML engineers, as a consequence, are in high demand across the industry, yet improving the efficiency of ML engineers remains a fundamental challenge. Automated machine learning (AutoML) has emerged as a way to save time and effort on repetitive tasks in ML pipelines, such as data pre-processing, feature engineering, model selection, hyperparameter optimization, and prediction result analysis. In this paper, we investigate the current state of AutoML tools aiming to automate these tasks. We conduct various evaluations of the tools on many datasets, in different data segments, to examine their performance, and compare their advantages and disadvantages on different test cases.

118 citations

Book ChapterDOI
01 Jan 2019
TL;DR: This chapter analyzes the results of a machine learning competition of progressive difficulty, which was followed by a one-round AutoML challenge (PAKDD 2018), and provides details about the datasets, which were not revealed to the participants.
Abstract: The ChaLearn AutoML Challenge (The authors are in alphabetical order of last name, except the first author who did most of the writing and the second author who produced most of the numerical analyses and plots.) (NIPS 2015 – ICML 2016) consisted of six rounds of a machine learning competition of progressive difficulty, subject to limited computational resources. It was followed bya one-round AutoML challenge (PAKDD 2018). The AutoML setting differs from former model selection/hyper-parameter selection challenges, such as the one we previously organized for NIPS 2006: the participants aim to develop fully automated and computationally efficient systems, capable of being trained and tested without human intervention, with code submission. This chapter analyzes the results of these competitions and provides details about the datasets, which were not revealed to the participants. The solutions of the winners are systematically benchmarked over all datasets of all rounds and compared with canonical machine learning algorithms available in scikit-learn. All materials discussed in this chapter (data and code) have been made publicly available at http://automl.chalearn.org/.

113 citations

Proceedings ArticleDOI
12 Jul 2015
TL;DR: The AutoML contest for IJCNN 2015 challenges participants to solve classification and regression problems without any human intervention, and will push the state of the art in fully automatic machine learning on a wide range of real-world problems.
Abstract: ChaLearn is organizing the Automatic Machine Learning (AutoML) contest for IJCNN 2015, which challenges participants to solve classification and regression problems without any human intervention Participants' code is automatically run on the contest servers to train and test learning machines However, there is no obligation to submit code; half of the prizes can be won by submitting prediction results only Datasets of progressively increasing difficulty are introduced throughout the six rounds of the challenge (Participants can enter the competition in any round) The rounds alternate phases in which learners are tested on datasets participants have not seen, and phases in which participants have limited time to tweak their algorithms on those datasets to improve performance This challenge will push the state of the art in fully automatic machine learning on a wide range of real-world problems The platform will remain available beyond the termination of the challenge

105 citations

Book ChapterDOI
01 Jan 2019

77 citations

Proceedings Article
04 Dec 2016
TL;DR: This competition contributes to the development of fully automated environments by challenging practitioners to solve problems under speci c constraints and sharing their approaches; the platform will remain available for post-challenge submissions at http://codalab.org/AutoML.
Abstract: The ChaLearn AutoML Challenge team conducted a large scale evaluation of fully auto- matic, black-box learning machines for feature-based classi cation and regression problems. The test bed was composed of 30 data sets from a wide variety of application domains and ranged across di erent types of complexity. Over six rounds, participants succeeded in delivering AutoML software capable of being trained and tested without human intervention. Although improvements can still be made to close the gap between human-tweaked and AutoML models, this competition contributes to the development of fully automated environments by challenging practitioners to solve problems under speci c constraints and sharing their approaches; the platform will remain available for post-challenge submissions at http://codalab.org/AutoML.

68 citations

Frequently Asked Questions (2)
Q1. What are the contributions in "A scalable and automated machine learning framework to support risk management" ?

This paper presents an automated and scalable framework for ML that requires minimum human input. In this paper, the authors focus the experiments on the model training module. The authors first analyze the capabilities of eight AutoML tools: Auto-Gluon, Auto-Keras, Auto-Sklearn, Auto-Weka, H2O AutoML, Rminer, TPOT, and TransmogrifAI. Then, to select the tool for model training, the authors performed a benchmark with the only two tools that address a distributed ML ( H2O AutoML and TransmogrifAI ). The experiments used three real-world datasets from the telecommunications domain ( churn, event forecasting, and fraud detection ), as provided by an analytics company. 

In future work, the authors intend to use more telecommunications datasets to provide additional benchmarks for the model training module. Finally, even though the framework was developed specifically for the telecommunications risk management domain, the authors intend to study the applicability of the framework to other areas. Moreover, new AutoML tools can be considered, as long as they provide distributed capabilities.