scispace - formally typeset
Search or ask a question
Journal ArticleDOI

A framework for increasing the value of predictive data-driven models by enriching problem domain characterization with novel features

01 Jun 2017-Neural Computing and Applications (Springer London)-Vol. 28, Iss: 6, pp 1515-1523
TL;DR: This work proposes a framework drawn on three feature selection strategies, where the goal is to unveil novel features that can effectively increase the value of data by providing a richer characterization of the problem domain.
Abstract: The need to leverage knowledge through data mining has driven enterprises in a demand for more data. However, there is a gap between the availability of data and the application of extracted knowledge for improving decision support. In fact, more data do not necessarily imply better predictive data-driven marketing models, since it is often the case that the problem domain requires a deeper characterization. Aiming at such characterization, we propose a framework drawn on three feature selection strategies, where the goal is to unveil novel features that can effectively increase the value of data by providing a richer characterization of the problem domain. Such strategies involve encompassing context (e.g., social and economic variables), evaluating past history, and disaggregate the main problem into smaller but interesting subproblems. The framework is evaluated through an empirical analysis for a real bank telemarketing application, with the results proving the benefits of such approach, as the area under the receiver operating characteristic curve increased with each stage, improving previous model in terms of predictive performance.

Summary (2 min read)

1 Introduction

  • In a world capable of generating exponential amounts of data, the term big data has become a common denominator for every enterprise in every industry.
  • Such issue has been also a subject of debate in the scientific community in the past few years, generating a large research effort in both theory and practice [18].
  • The next stage represents the current challenge for most companies, which basically is the extraction of valuable knowledge from data for effectively leveraging business decision support, whether in the form of understandable reports for explaining past events, or by translating insightful data-driven predictions into decisions that can feed operational systems.
  • Interestingly, the later form has been foreseen by Bucklin et al. [1] in what the authors denominated as marketing decision automation systems.
  • Such issue led to the emergence of an increasingly relevant branch in machine learning known as feature selection [20].

2 Proposed Framework and Method

  • The design of the experiments that ultimately led to the framework here presented focused mainly in adding value to data for data mining applications, considering data is the key ingredient for any successful data-driven project.
  • The framework designed is based on three simple but highly relevant strategies: Include context features; Evaluate past history; Divide and conquer strategy.
  • The data based sensitivity analysis (DSA) attempts to capture all input interactions between the F features, but with less computational effort, through the usage of random samples taken from the original dataset instead of analyzing the whole dataset [9].
  • The new enriched list of features is then evaluated through modeling, for measuring its impact in predictive performance and the relevance of the newly proposed features, in a procedure similar to the one of the previous strategy.
  • A feature reduction procedure using domain knowledge and DSA, similar to the one used in the first strategy helps to reduce the list of features to a manageable number.

3 Problem Description

  • The problem of selecting the best customers for targeting with promotional offers from an initial database is a complex task.
  • Nevertheless, such problem is typical in the marketing domain.
  • All of them were available for building the initial dataset to conduct modeling.
  • While a campaign may communicate the same or a different deposit than previous campaigns, the characteristics of the deposit type are incorporated in terms of features (e.g., interest rate offered, term period) into the dataset.
  • A typical characteristic of targeting problems is the low successful rate.

4 Results and discussion

  • The bank telemarketing case study detailed in the previous section was chosen for testing the results of applying the framework.
  • Next, a tuning procedure adapted for each strategy leads to a reduced number of highly relevant features according to DSA, from which some of the previously proposed are among the chosen ones (the number of those is identified in the “Nr. new features included” box).
  • Given these results, the ensemble of neural networks was the technique chosen for the remaining strategies.
  • Previous most relevant five features among the 27 were evaluated by a domain expert in bank telemarketing management, which led to the decision of splitting the problem and corresponding dataset in outbound and inbound telemarketing.
  • One of the most interesting aspects of the framework is its extensiveness.

5 Conclusions

  • Data-driven research is conditioned to the available data.
  • Even in this age of big data, in a real-world environment often happens the case that the available features for characterizing the addressed domain do not cover all aspects that affect the problem domain.
  • The framework proposed transports all features used for the tuned model in the previous strategy to the next, and then tries to propose novel features according to the current strategy.
  • An interesting case for applying the proposed framework would be for profiling online travel agency clients.

Did you find this useful? Give us your feedback

Content maybe subject to copyright    Report

Repositório ISCTE-IUL
Deposited in
Repositório ISCTE-IUL
:
2018-03-20
Deposited version:
Post-print
Peer-review status of attached file:
Peer-reviewed
Citation for published item:
Moro, S., Cortez, P. & Rita, P. (2017). A framework for increasing the value of predictive data-driven
models by enriching problem domain characterization with novel features. Neural Computing and
Applications. 28 (6), 1515-1523
Further information on publisher's website:
10.1007/s00521-015-2157-8
Publisher's copyright statement:
This is the peer reviewed version of the following article: Moro, S., Cortez, P. & Rita, P. (2017). A
framework for increasing the value of predictive data-driven models by enriching problem domain
characterization with novel features. Neural Computing and Applications. 28 (6), 1515-1523, which
has been published in final form at https://dx.doi.org/10.1007/s00521-015-2157-8. This article may
be used for non-commercial purposes in accordance with the Publisher's Terms and Conditions for
self-archiving.
Use policy
Creative Commons CC BY 4.0
The full-text may be used and/or reproduced, and given to third parties in any format or medium, without prior permission or
charge, for personal research or study, educational, or not-for-profit purposes provided that:
• a full bibliographic reference is made to the original source
• a link is made to the metadata record in the Repository
• the full-text is not changed in any way
The full-text must not be sold in any format or medium without the formal permission of the copyright holders.
Serviços de Informação e Documentação, Instituto Universitário de Lisboa (ISCTE-IUL)
Av. das Forças Armadas, Edifício II, 1649-026 Lisboa Portugal
Phone: +(351) 217 903 024 | e-mail: administrador.repositorio@iscte-iul.pt
https://repositorio.iscte-iul.pt

Neural Computing and Applications manuscript No.
(will be inserted by the editor)
A framework for increasing the value of predictive
data-driven models by enriching problem domain
characterization with novel features
ergio Moro · Paulo Cortez · Paulo Rita
Received: 2015 / Accepted: 2015
Abstract The need to leverage knowledge through data mining has driven
enterprises in a dema nd for more da ta. However, there is a gap between the
availability of data and the a pplication of extracted knowledge for improving
decision supp ort. In fact, more data does not necessa rily imply better predic-
tive data-dr iven marketing models, since it is often the case that the problem
domain requires a deepe r characteriz ation. Aiming at such characterization,
we propose a framework drawn on three feature selection strategies, where the
goal is to unveil novel features that can effectively incre ase the value of data
by providing a richer characterization of the problem domain. Such str ategies
involve encompassing context (e.g., social and economic variables), evaluating
past history, and disaggregate the main problem into smalle r but interesting
sub-problems. The framework is evaluated through an empirical analys is for
a real bank telemarketing application, with the results proving the benefits of
such approach, as the area under the Receiver Oper ating Chara cteristic curve
increased with each stage, improving previous model in terms of predictive
performance.
Keywords Feature Selection · Decision Supp ort · Data Mining · Telemar-
keting · Bank Marketing
S. Moro
ALGORITMI Research Centre, University of Minho, 4800-058 Guimar˜aes, Portugal, and
Business Research Unit (BRU-UNIDE), ISCTE - University Institute of Lisbon, 1649-026
Lisboa, Portugal
E-mail: scmoro@gmail.com
P. Cortez
Dep. of Information Systems/ALGORITMI Research Centre, University of Minho, 4800-058
Guimar˜aes, Portugal
P. Rita
Business Research Unit (BRU-UNIDE), ISCTE - University Institute of Lisbon, 1649-026
Lisboa, Portugal

2 S´ergio Moro et al.
1 Introduction
In a world capable of generating exponential amounts of data, the term big
data has become a common denominator for every enterprise in every indus-
try. Such issue has been also a subject of debate in the s cientific community
in the past few years, generating a large research effort in both theory and
practice [18]. When companies are able to overcome the initial stage of acquir-
ing large size and high performance although somewhat still expensive data
store solutions, they find themselves grappling with large amounts of data and
struggling for taking real advantage of data [3, 22]. The next stag e represents
the current challe nge for most companies, which basically is the extraction
of va luable knowledge from data for effectively leveraging business dec ision
support, whether in the form of understandable reports for explaining past
events, o r by tra nslating insightful data-driven predictions into decisions tha t
can feed operational systems. Interestingly, the later form has been foreseen
by Bucklin et al. [1] in what the authors denominated as marketing decision
automation systems. Wang [35] defines this stage as the move toward analytic
intellige nc e , where advanced artificial intelligence, machine learning and data
mining techniques are incorporated in decision support systems for taking the
most advantage from data. In fact, data mining had a profound impact in
selecting the best clients to co ntact, thus changing the pa radigm of marketing
[16, 31].
Data-driven analytical approaches are typically comprised of a knowledge
discovery solution including steps such as business and data analysis, data min-
ing on a pre-prepared dataset, and validation of results, whereas such proces s
may be fed with recent results for providing adaptation to a changing reality,
in a continuous cycle [24]. Nevertheless, successfully finding hidden patterns
requires that the features characteriz ing each event conceal relations with the
problem addressed, more specifically, with the modeled outcome features. With
the advent of big data, a larg e number of features are usually available for
modeling. Often, learning algorithms cannot by themselves straightforwardly
disentangle useful data attributes (also known as features or variables) from ir-
relevant ones, particularly when ther e is a very large number of attributes that
can potentially be used. Ins tead, algo rithms will make a precious effort in find-
ing relations between input features and the outcome to model, befo re finally
realizing a large portion of the features are useless. Since computational time
and memory requirements typically increase exponentially with the number of
variables, pruning the number of input variables to a manage able number of
them is mandatory. Such issue led to the emergence of an incr e asingly relevant
branch in machine learning known as feature selection [20].
Featur e selection and engineering is particularly relevant for database mar-
keting problems [23, 33]. The creativity associated with marketing manage-
ment implies that a vast number of characterizing features may influence a
given problem, posing the difficult challenge of discovering them.
Generic con-
text features are known to be of high value for mo deling problems through
data mining. As an example, the research of Saarenpaa et al. [29] takes ad-

A framework for increasing the value of predictive data-driven models 3
vantage of generic demographic indicators for distribution network planning.
Associating context features that may potentially affect an instance of a prob-
lem has been shown to be highly beneficial for spatiotemporal problems [30].
The same study ass ociates situational events fo r affecting the outcome of as-
sociation rules. However, the contextual features may be used for feeding con-
tinuously a model in the same way as regular problem features do, through
input variables. Following this trend, the work of Moro et al. [26] pioneered
the introduction of social and economic context features for improving the
prediction of telemarketing contacts. Usually, real problems addressed by data
mining applications encompass the temporal dimension, as the insta nc e s of
the problem are occurring in different moments in time. This type of problems
are typically influenced by their historic past events, as it happens in stock ex-
change markets [19], in retail s ales [5], in fraud detection [13], and in marketing
campaigns [25]. Traditionally, history information has been used for the mar-
keting and sales domains in the form of the RFM (Recency, Frequency, Mone-
tary) indicators [4]. However, other historic features specific of a problem can
be incorporated for improving model perfo rmance, encompassing metrics for
measuring customer lifetime value (LTV) [36]. It often occurs that a problem
being studied is vast in its complexity, with a wide range of features influenc-
ing it in numerous ways. Data mining applications use techniques for reducing
such complexity; the de cision tree mo de ling is a perfect ex ample where this
happens [28]. A few articles have conducted resea rch on automated approaches
for dividing the proble m in smalle r and more manageable s ub-problems to re-
duce the feature selection search space [21]. However, none was found using
a mixed approach of domain exp ert knowledge embedded in a general data
mining method.
The majority of research studies on feature selection focus
on finding the smallest feature subset from the original set of features given a
certain generalization error, while making an effort for optimizing model p er-
formance [34]. However , few studies have looked at the problem of generalizing
methods for extending the boundaries of the feature set beyond the original
dataset that is being explored for predictive analytics purposes. Even when
a large number of features are available, often happens the ca se that s everal
relevant features are missing, mainly becaus e a rea l world problem is affected
by a myriad of variables with intrins ic relations between each other and the
problem. In this article, a framework is proposed for enriching datasets used
for data mining procedures by unveiling previously unforeseen fea tures that
increase the value of the original dataset in terms of modeling performance.
For validation purposes , a problem of selling long-term bank deposits through
telemarketing campaigns is addressed. The re sults show a consistent increase
in model performance, acclaiming the benefits of the suggested framework.
2 Proposed Framework and Method
The design of the exp e riments that ultimately led to the framework here pre-
sented focused mainly in adding value to data for data mining applications,

4 S´ergio Moro et al.
considering data is the key ingredient for any succ e ssful data-driven project.
Starting with a typical dataset for any given problem to address with a data
mining approach, the emphasis is on finding previously undiscovered features
that can add value to the data, and at the same time to reduce the number of
features to a smaller and mor e manageable number for allowing computation-
ally feasible models considering a re asonable amount of time and memory. For
a proper validation of the proposed fra mework, some o f the newly proposed
features should arise in terms of feature relevance when compared to the re-
maining features while at the same time the model conceived must excel the
performance of the baseline model w ithout the new features. The framework
designed is based on three simple but highly relevant strategies:
Include context features;
Evaluate past history;
Divide and conquer strategy.
Figure 1 displays the schematic flow for finding novel features, materialized
in the proposed framework. Traditional data mining projects encompass a pre-
modeling stage for evaluating from the availa ble features which are the ones
that may have an impact on the problem being modeled, in a process named
feature selection. The main goal of the framework presented is to extend this
problem features list by adding new features for improving the predic tion
performance of a model base d on this enriched list of features. Furthermore,
some of the added features should have played a significant ro le in training
the model. Thus another goal is to extract the relevance of some of the added
features for assessing its impact.
The three strategies listed above are represented in rectangle boxes in the
schematic framework. Any typical da ta mining problem starts with a list of
features which characterize the problem, represented by ser ie s of occurrences
for building the model. Then, the model is evaluated in terms of its predictive
performance. The first strategy consists in assessing the context surrounding
and likely affecting the problem through a domain expert. This leads to a
list of newly proposed features related to problem c ontext for enriching the
initial dataset. Then, a feature r eduction procedure takes place using a s e mi-
automated method based on domain expertize and a sensitivity analysis [11]
for assessing feature relevance.
The sensitivity analys is measures the effect
that changing the input features through their range of possible values has on
the outcome of a model. Because such analysis is completely independent on
the model (it is based solely on the inputs and output values), it can virtually
be applied to any supervised learning method, including complex black-box”
models (e.g., neural ne twork, support vector machine). There are a few pos-
sible variations for analyzing the sensitivity of a model. In its simplest form,
all input features ar e kept constant except one, which is changed through its
range of values, for measuring the effect on the outcome [17]. The most com-
plex and co mputationally demanding sensitivity analysis method uses a set of
F features that simultaneously vary with L levels [8 ]. The data base d sensi-
tivity analysis (DSA) attempts to ca pture all input interactions between the

Citations
More filters
Journal ArticleDOI
TL;DR: A decision process flow from the “Lifetime Post Consumers” model is drawn, which by complementing the sensitivity analysis information may be used to support manager's decisions on whether to publish a post.

174 citations


Cites background from "A framework for increasing the valu..."

  • ...Enriching the data set with such features may result in an increase in model's accuracy (Moro et al., 2016)....

    [...]

  • ...It would be interesting in a future study to consider feature enrichment strategies for improving the accuracy in predicting visualizations, as the viral reach is becoming more relevant in brand awareness (Moro et al., 2016)....

    [...]

Journal ArticleDOI
01 Jun 2017
TL;DR: This study proves that including standard telematics variables significantly improves the risk assessment of customers, and suggests that if a manager wants to implement Usage-Based-Insurances, Pay-As-You-Drive related variables are most valuable to tailor the premium to the risk.
Abstract: The advent of the Internet of Things enables companies to collect an increasing amount of sensor generated data which creates plenty of new business opportunities. This study investigates how this sensor data can improve the risk selection process in an insurance company. More specifically, several risk assessment models based on three different data mining techniques are augmented with driving behaviour data collected from In-Vehicle Data Recorders. This study proves that including standard telematics variables significantly improves the risk assessment of customers. As a result, insurers will be better able to tailor their products to the customers' risk profile. Moreover, this research illustrates the importance of including industry knowledge, combined with data expertise, in the variable creation process. Especially when a regulator forces the use of easily interpretable data mining techniques, expert-based telematics variables are able to improve the risk assessment model in addition to the standard telematics variables. Further, the results suggest that if a manager wants to implement Usage-Based-Insurances, Pay-As-You-Drive related variables are most valuable to tailor the premium to the risk. Finally, the study illustrates that this new type of telematics-based insurance product can quickly be implemented since three months of data is already sufficient to obtain the best risk estimations. This study proves the value of telematics-based data in the risk selection process of an insurance companyIt compares the performance of three models in this context: a logistic regression, random forests and artificial neural networks modelThis research illustrates the importance of industry knowledge in the variable creation processThree months of data is sufficient to obtain the best risk estimations

100 citations

Journal ArticleDOI
TL;DR: The findings unveiled user features related to TripAdvisor membership experience play a key role in influencing the scores granted, clearly surpassing hotel features.

60 citations

Journal ArticleDOI
TL;DR: This paper provides a comprehensive review of the state-of-the-art methods and practice reported in the literature dealing with many different aspects of data-informed inverse design by reviewing the origins and common practice of inverse problems in engineering design.
Abstract: A significant body of knowledge exists on inverse problems and extensive research has been conducted on data-driven design in the past decade. This paper provides a comprehensive review of the state-of-the-art methods and practice reported in the literature dealing with many different aspects of data-informed inverse design. By reviewing the origins and common practice of inverse problems in engineering design, the paper presents a closed-loop decision framework of product usage data-informed inverse design. Specifically reviewed areas of focus include data-informed inverse requirement analysis by user generated content, data-informed inverse conceptual design for product innovation, data-informed inverse embodiment design for product families and product platforming, data-informed inverse analysis and optimization in detailed design, along with prevailing techniques for product usage data collection and analytics. The paper also discusses the challenges of data-informed inverse design and the prospects for future research.

54 citations

Journal ArticleDOI
TL;DR: This article reviews the latest research works to determine the most effective features that were investigated for spam detection in the literature and reveals the important role of some features like the reputation of the account, average length of the tweet, average mention per tweet, age of the accounts, and the average time between posts in the process of identifying spammers in the social network.
Abstract: Twitter is a social networking website that has gained a lot of popularity around the world in the last decade. This popularity made Twitter a common target for spammers and malicious users to spre...

46 citations

References
More filters
Book
28 Jul 2013
TL;DR: In this paper, the authors describe the important ideas in these areas in a common conceptual framework, and the emphasis is on concepts rather than mathematics, with a liberal use of color graphics.
Abstract: During the past decade there has been an explosion in computation and information technology. With it have come vast amounts of data in a variety of fields such as medicine, biology, finance, and marketing. The challenge of understanding these data has led to the development of new tools in the field of statistics, and spawned new areas such as data mining, machine learning, and bioinformatics. Many of these tools have common underpinnings but are often expressed with different terminology. This book describes the important ideas in these areas in a common conceptual framework. While the approach is statistical, the emphasis is on concepts rather than mathematics. Many examples are given, with a liberal use of color graphics. It is a valuable resource for statisticians and anyone interested in data mining in science or industry. The book's coverage is broad, from supervised learning (prediction) to unsupervised learning. The many topics include neural networks, support vector machines, classification trees and boosting---the first comprehensive treatment of this topic in any book. This major new edition features many topics not covered in the original, including graphical models, random forests, ensemble methods, least angle regression and path algorithms for the lasso, non-negative matrix factorization, and spectral clustering. There is also a chapter on methods for ``wide'' data (p bigger than n), including multiple testing and false discovery rates. Trevor Hastie, Robert Tibshirani, and Jerome Friedman are professors of statistics at Stanford University. They are prominent researchers in this area: Hastie and Tibshirani developed generalized additive models and wrote a popular book of that title. Hastie co-developed much of the statistical modeling software and environment in R/S-PLUS and invented principal curves and surfaces. Tibshirani proposed the lasso and is co-author of the very successful An Introduction to the Bootstrap. Friedman is the co-inventor of many data-mining tools including CART, MARS, projection pursuit and gradient boosting.

19,261 citations

Journal ArticleDOI
TL;DR: The purpose of this article is to serve as an introduction to ROC graphs and as a guide for using them in research.

17,017 citations


"A framework for increasing the valu..." refers background in this paper

  • ...The receiver operating characteristic curve shows the performance of a two-class classifier across the range of possible threshold values, plotting one minus the specificity versus the sensitivity [32], while the lift cumulative curve is a popular measure of performance in marketing applications, providing an order by dividing the dataset in ten fractions and including the most likely subscribers in the top deciles [33]....

    [...]

Journal ArticleDOI
TL;DR: The contributions of this special issue cover a wide range of aspects of variable selection: providing a better definition of the objective function, feature construction, feature ranking, multivariate feature selection, efficient search methods, and feature validity assessment methods.
Abstract: Variable and feature selection have become the focus of much research in areas of application for which datasets with tens or hundreds of thousands of variables are available. These areas include text processing of internet documents, gene expression array analysis, and combinatorial chemistry. The objective of variable selection is three-fold: improving the prediction performance of the predictors, providing faster and more cost-effective predictors, and providing a better understanding of the underlying process that generated the data. The contributions of this special issue cover a wide range of aspects of such problems: providing a better definition of the objective function, feature construction, feature ranking, multivariate feature selection, efficient search methods, and feature validity assessment methods.

14,509 citations


"A framework for increasing the valu..." refers methods in this paper

  • ..., the ‘‘is gender relevant?’’ hypothesis was characterized by three features, related to the gender of the banking agent, client and client–agent difference, 0 if same sex; 1 else), and then an automated feature reduction approach based on an adapted forward selection method based on testing the impact on the model resulted in the selection of the 22 features that maximized subscription performance, from which seven of them were chosen among the newly proposed [34]....

    [...]

Journal ArticleDOI
TL;DR: This introduction to the MIS Quarterly Special Issue on Business Intelligence Research first provides a framework that identifies the evolution, applications, and emerging research areas of BI&A, and introduces and characterized the six articles that comprise this special issue in terms of the proposed BI &A research framework.
Abstract: Business intelligence and analytics (BI&A) has emerged as an important area of study for both practitioners and researchers, reflecting the magnitude and impact of data-related problems to be solved in contemporary business organizations. This introduction to the MIS Quarterly Special Issue on Business Intelligence Research first provides a framework that identifies the evolution, applications, and emerging research areas of BI&A. BI&A 1.0, BI&A 2.0, and BI&A 3.0 are defined and described in terms of their key characteristics and capabilities. Current research in BI&A is analyzed and challenges and opportunities associated with BI&A research and education are identified. We also report a bibliometric study of critical BI&A publications, researchers, and research topics based on more than a decade of related academic and industry publications. Finally, the six articles that comprise this special issue are introduced and characterized in terms of the proposed BI&A research framework.

4,610 citations


"A framework for increasing the valu..." refers background in this paper

  • ...amounts of data and struggling for taking real advantage of data [2, 3]....

    [...]

Journal Article
TL;DR: The Elements of Statistical LearningAn Introduction to Statistical LearningPattern Recognition and Machine LearningData Mining IVStatistics for Machine LearningStatistical Learning for Biomedical DataGeocomputation with RThe Science of Bradley Efron

1,570 citations

Frequently Asked Questions (2)
Q1. What have the authors contributed in "A framework for increasing the value of predictive data-driven models by enriching problem domain characterization with novel features" ?

Aiming at such characterization, the authors propose a framework drawn on three feature selection strategies, where the goal is to unveil novel features that can effectively increase the value of data by providing a richer characterization of the problem domain. 

In the future, the framework could be applied to on other types of marketing problem domains to test if the results are consisted with the achieved for the bank telemarketing case study.