Author
Juan Eduardo Pérez
Other affiliations: University of the Andes, Chile
Bio: Juan Eduardo Pérez is an academic researcher from University of Los Andes. The author has contributed to research in topics: Multinomial logistic regression & Support vector machine. The author has an hindex of 6, co-authored 11 publications receiving 180 citations. Previous affiliations of Juan Eduardo Pérez include University of the Andes, Chile.
Papers
More filters
TL;DR: Mixed-integer linear programming models are proposed for constructing classifiers that constrain acquisition costs while classifying adequately and demonstrating the effectiveness of the methods in terms of predictive performance at a low cost compared to well-known feature selection approaches.
Abstract: In this work we propose two formulations based on Support Vector Machines for simultaneous classification and feature selection that explicitly incorporate attribute acquisition costs. This is a challenging task for two main reasons: the estimation of the acquisition costs is not straightforward and may depend on multivariate factors, and the inter-dependence between variables must be taken into account for the modelling process since companies usually acquire groups of related variables rather than acquiring them individually. Mixed-integer linear programming models are proposed for constructing classifiers that constrain acquisition costs while classifying adequately. Experimental results using credit scoring datasets demonstrate the effectiveness of our methods in terms of predictive performance at a low cost compared to well-known feature selection approaches.
110 citations
01 Dec 2017
TL;DR: The proposal incorporates a group penalty function in the SVM formulation in order to penalize the variables simultaneously that belong to the same group, assuming that companies often acquire groups of related variables for a given cost rather than acquiring them individually.
Abstract: In this paper, we propose a profit-driven approach for classifier construction and simultaneous variable selection based on linear Support Vector Machines. The main goal is to incorporate business-related information such as the variable acquisition costs, the Types I and II error costs, and the profit generated by correctly classified instances, into the modeling process. Our proposal incorporates a group penalty function in the SVM formulation in order to penalize the variables simultaneously that belong to the same group, assuming that companies often acquire groups of related variables for a given cost rather than acquiring them individually. The proposed framework was studied in a credit scoring problem for a Chilean bank, and led to superior performance with respect to business-related goals.
75 citations
TL;DR: A pricing and retail location model using the constrained multinomial logit (CMNL), which takes into account customers’ utility and maximum willingness to pay via cut-off soft-constraints is reported.
Abstract: The purpose of this paper is to report a pricing and retail location model using the constrained multinomial logit (CMNL), which takes into account customers’ utility and maximum willingness to pay via cut-off soft-constraints. The proposed model is probabilistic and non-linear, therefore a PSO metaheuristic approach was designed to determine the most suitable price, store locations and demand segmentation. The results obtained in test-cases showed a close relationship between price and location decisions. In addition, the results suggest that not only price, but also location decisions are affected when the consumers’ maximum willingness to pay is considered.
23 citations
TL;DR: The results show that by optimally reallocating the resources a 10–30% increase can be achieved in the number of emergency calls that are attended to with an adequate response in time and number of vehicles, without the need for additional fire stations or vehicles.
Abstract: The geographical distribution of the population of the city of Santiago, Chile, has changed significantly in recent years. In spite of this fact, the location of the fire stations has remained unchanged. We propose a model for the optimal location of the fire stations and a fleet assignment for the Santiago Fire Department (SFD), aimed at maximising the number of events attended to with a predefined standard response. The results of the model are compared with respect to the current location of fire stations and fleet assignment in the SFD. There are different types of resources (stations and vehicles), and different types of events in which the same types of vehicles are used. We analyse various possible current and future scenarios, using a forecast based on historical data. Our results show that by optimally reallocating the resources a 10–30% increase can be achieved in the number of emergency calls that are attended to with an adequate response in time and number of vehicles, without the need for add...
19 citations
TL;DR: In this paper, the authors propose a fleet assignment model for the Santiago Fire Department to maximize the number of incidents successfully attended (standard responses) by optimizing the existing siting via dynamic reallocation.
Abstract: The Santiago Fire Department (from here referred to as SFD) lacks a fleet management strategy since their vehicles remain allocated in fixed fire stations, while the presence of seasonal patterns suggests that the frequency of events changes according to their geographical distribution. This fact has led to inequitable service in terms of response times among the nine zones of the SFD. In this empirical study we propose a fleet assignment model for the Santiago Fire Department to maximize the number of incidents successfully attended (standard responses). Results suggest that the implementation of the fleet management proposal will lead to an increase in the number of standard responses of between 6% and 20% with respect to the current situation. This increase in performance is especially important since it does not require new vehicles; it just optimizes the existing siting via dynamic reallocation.
18 citations
Cited by
More filters
01 Jan 2013
TL;DR: In this article, the authors proposed a hierarchical density-based hierarchical clustering method, which provides a clustering hierarchy from which a simplified tree of significant clusters can be constructed, and demonstrated that their approach outperforms the current, state-of-the-art, densitybased clustering methods.
Abstract: We propose a theoretically and practically improved density-based, hierarchical clustering method, providing a clustering hierarchy from which a simplified tree of significant clusters can be constructed. For obtaining a “flat” partition consisting of only the most significant clusters (possibly corresponding to different density thresholds), we propose a novel cluster stability measure, formalize the problem of maximizing the overall stability of selected clusters, and formulate an algorithm that computes an optimal solution to this problem. We demonstrate that our approach outperforms the current, state-of-the-art, density-based clustering methods on a wide variety of real world data.
556 citations
01 Jan 2006
368 citations
TL;DR: In this article, the authors proposed a new hybrid algorithm, the logit leaf model (LLM), which consists of two stages: a segmentation phase and a prediction phase, where in the first stage customer segments are identified using decision rules and in the second stage a model is created for every leaf of this tree.
Abstract: Decision trees and logistic regression are two very popular algorithms in customer churn prediction with strong predictive performance and good comprehensibility. Despite these strengths, decision trees tend to have problems to handle linear relations between variables and logistic regression has difficulties with interaction effects between variables. Therefore a new hybrid algorithm, the logit leaf model (LLM), is proposed to better classify data. The idea behind the LLM is that different models constructed on segments of the data rather than on the entire dataset lead to better predictive performance while maintaining the comprehensibility from the models constructed in the leaves. The LLM consists of two stages: a segmentation phase and a prediction phase. In the first stage customer segments are identified using decision rules and in the second stage a model is created for every leaf of this tree. This new hybrid approach is benchmarked against decision trees, logistic regression, random forests and logistic model trees with regards to the predictive performance and comprehensibility. The area under the receiver operating characteristics curve (AUC) and top decile lift (TDL) are used to measure the predictive performance for which LLM scores significantly better than its building blocks logistic regression and decision trees and performs at least as well as more advanced ensemble methods random forests and logistic model trees. Comprehensibility is addressed by a case study for which we observe some key benefits using the LLM compared to using decision trees or logistic regression.
298 citations
TL;DR: A novel approach to feature selection in credit scoring applications is proposed, called Information Gain Directed Feature Selection algorithm (IGDFS), which performs the ranking of features based on information gain, propagates the top m features through the GA wrapper (GAW) algorithm using three classical machine learning algorithms of KNN, Naive Bayes and Support Vector Machine for credit scoring.
Abstract: Financial credit scoring is one of the most crucial processes in the finance industry sector to be able to assess the credit-worthiness of individuals and enterprises. Various statistics-based machine learning techniques have been employed for this task. “Curse of Dimensionality” is still a significant challenge in machine learning techniques. Some research has been carried out on Feature Selection (FS) using genetic algorithm as wrapper to improve the performance of credit scoring models. However, the challenge lies in finding an overall best method in credit scoring problems and improving the time-consuming process of feature selection. In this study, the credit scoring problem is investigated through feature selection to improve classification performance. This work proposes a novel approach to feature selection in credit scoring applications, called as Information Gain Directed Feature Selection algorithm (IGDFS), which performs the ranking of features based on information gain, propagates the top m features through the GA wrapper (GAW) algorithm using three classical machine learning algorithms of KNN, Naive Bayes and Support Vector Machine (SVM) for credit scoring. The first stage of information gain guided feature selection can help reduce the computing complexity of GA wrapper, and the information gain of features selected with the IGDFS can indicate their importance to decision making. Regarding the classification accuracy, SVM accuracy is always better than KNN and NB for Baseline techniques, GAW and IGDFS. Also, we can conclude that the IGDFS achieved better performance than generic GAW, and GAW obtained better performance than the corresponding single classifiers (baseline) for almost all cases, except for the German Credit dataset, IGDFS + KNN has worse performance than generic GAW and the single classifier KNN. Removing features with low information gain could produce conflict with the original data structure for KNN, and thus affect the performance of IGDFS + KNN. Regarding the ROC performance, for the German Credit Dataset, the three classic machine learning algorithms, SVM, KNN and Naive Bayes in the wrapper of IGDFS GA obtained almost the same performance. For the Australian credit dataset and the Taiwan Credit dataset, the IGDFS + Naive Bayes achieved the largest area under ROC curves.
191 citations
TL;DR: This article employs a systematic literature survey approach to systematically review statistical and machine learning models in credit scoring, to identify limitations in literature, to propose a guiding machine learning framework, and to point to emerging directions.
Abstract: In practice, as a well-known statistical method, the logistic regression model is used to evaluate the credit-worthiness of borrowers due to its simplicity and transparency in predictions. However, in literature, sophisticated machine learning models can be found that can replace the logistic regression model. Despite the advances and applications of machine learning models in credit scoring, there are still two major issues: the incapability of some of the machine learning models to explain predictions; and the issue of imbalanced datasets. As such, there is a need for a thorough survey of recent literature in credit scoring. This article employs a systematic literature survey approach to systematically review statistical and machine learning models in credit scoring, to identify limitations in literature, to propose a guiding machine learning framework, and to point to emerging directions. This literature survey is based on 74 primary studies, such as journal and conference articles, that were published between 2010 and 2018. According to the meta-analysis of this literature survey, we found that in general, an ensemble of classifiers performs better than single classifiers. Although deep learning models have not been applied extensively in credit scoring literature, they show promising results.
141 citations