scispace - formally typeset
Search or ask a question
Author

Claudio Conversano

Bio: Claudio Conversano is an academic researcher from University of Cagliari. The author has contributed to research in topics: Statistical model & Sentiment analysis. The author has an hindex of 11, co-authored 65 publications receiving 317 citations. Previous affiliations of Claudio Conversano include University of Cassino & University of Naples Federico II.


Papers
More filters
Journal ArticleDOI
TL;DR: A new algorithm is proposed—Simultaneous Threshold Interaction Modeling Algorithm (STIMA)—to estimate a regression trunk model that is more general and more efficient than the initial one (RTA) and is implemented in the R-package stima.
Abstract: Additive models and tree-based regression models are two main classes of statistical models used to predict the scores on a continuous response variable. It is known that additive models become very complex in the presence of higher order interaction effects, whereas some tree-based models, such as CART, have problems capturing linear main effects of continuous predictors. To overcome these drawbacks, the regression trunk model has been proposed: a multiple regression model with main effects and a parsimonious amount of higher order interaction effects. The interaction effects can be represented by a small tree: a regression trunk. This article proposes a new algorithm-Simultaneous Threshold Interaction Modeling Algorithm (STIMA)-to estimate a regression trunk model that is more general and more efficient than the initial one (RTA) and is implemented in the R-package stima. Results from a simulation study show that the performance of STIMA is satisfactory for sample sizes of 200 or higher. For sample sizes of 300 or higher, the 0.50 SE rule is the best pruning rule for a regression trunk in terms of power and Type I error. For sample sizes of 200, the 0.80 SE rule is recommended. Results from a comparative study of eight regression methods applied to ten benchmark datasets suggest that STIMA and GUIDE are the best performers in terms of cross-validated prediction error. STIMA appeared to be the best method for datasets containing many categorical variables. The characteristics of a regression trunk model are illustrated using the Boston house price dataset. Supplemental materials for this article, including the R-package stima, are available online. © 2010 American Statistical Association, Institute of Mathematical Statistics, and Interface Foundation of North America.

70 citations

Journal ArticleDOI
TL;DR: An incremental procedure based on the iterative use of tree-based method is proposed and a suitable Incremental Imputation Algorithm is introduced to define a lexicographic ordering of cases and variables so that conditional mean imputation via binary trees can be performed incrementally.
Abstract: In the framework of incomplete data analysis, this paper provides a nonparametric approach to missing data imputation based on Information Retrieval. In particular, an incremental procedure based on the iterative use of tree-based method is proposed and a suitable Incremental Imputation Algorithm is introduced. The key idea is to define a lexicographic ordering of cases and variables so that conditional mean imputation via binary trees can be performed incrementally. A simulation study and real data applications are carried out to describe the advantages and the performance with respect to standard approaches.

32 citations

Journal ArticleDOI
TL;DR: An integrated approach is proposed, which identifies a long list of key quality indicators (KQI), defines their properties, involves experts to elicit judgments for each KQI, evaluates the long list, and points out the most promising set.
Abstract: Recent interests in transit services have captured attention of experts on the monitoring of public transport quality. Previous research focused on relevant models and methods to monitor the quality of transit services and showed where and when different service quality levels occur. However, there was little attention to detect objectively a pool of key quality indicators (KQI) to be monitored, from a large set. This paper covers this gap by the proposal of an integrated approach, which identifies a long list of KQI, defines their properties, involves experts to elicit judgments for each KQI, evaluates the long list, and points out the most promising set. This integrated approach is demonstrated with an application based on an international survey and a Monte Carlo simulation method. Moreover, a restricted and relevant set of 9 overlapping KQI is derived by linking these results with those obtained from two different approaches.

28 citations

Journal ArticleDOI
TL;DR: The main methodological features and the goals of pharmacoeconomic models that are classified in three major categories: regression models, decision trees, and Markov models are presented and decision makers are advised to interpret the results with extreme caution.
Abstract: We present an overview of the main methodological features and the goals of pharmacoeconomic models that are classified in three major categories: regression models, decision trees, and Markov models. In particular, we focus on Markov models and define a semi-Markov model on the cost utility of a vaccine for Dengue fever discussing the key components of the model and the interpretation of its results. Next, we identify some criticalities of the decision rule arising from a possible incorrect interpretation of the model outcomes. Specifically, we focus on the difference between median and mean ICER and on handling the willingness-to-pay thresholds. We also show that the life span of the model and an incorrect hypothesis specification can lead to very different outcomes. Finally, we analyse the limit of Markov model when a large number of states is considered and focus on the implementation of tools that can bypass the lack of memory condition of Markov models. We conclude that decision makers should interpret the results of these models with extreme caution before deciding to fund any health care policy and give some recommendations about the appropriate use of these models.

20 citations

Book ChapterDOI
01 Jan 2009
TL;DR: Decision Tree Induction is a tool to induce a classification or regression model from (usually large) datasets characterized by n objects (records), each one containing a set x of numerical or nominal attributes, and a special feature y designed as its outcome.
Abstract: Decision Tree Induction (DTI) is a tool to induce a classification or regression model from (usually large) datasets characterized by n objects (records), each one containing a set x of numerical or nominal attributes, and a special feature y designed as its outcome. Statisticians use the terms “predictors” to identify attributes and “response variable” for the outcome. DTI builds a model that summarizes the underlying relationships between x and y. Actually, two kinds of model can be estimated using decision trees: classification trees if y is nominal, and regression trees if y is numerical. Hereinafter we refer to classification trees to show the main features of DTI. For a detailed insight into the characteristics of regression trees see Hastie et al. (2001). As an example of classification tree, let us consider a sample of patients with prostate cancer on which data

18 citations


Cited by
More filters
Journal ArticleDOI

3,152 citations

Book
29 Mar 2012
TL;DR: The problem of missing data concepts of MCAR, MAR and MNAR simple solutions that do not (always) work multiple imputation in a nutshell and some dangers, some do's and some don'ts are covered.
Abstract: Basics Introduction The problem of missing data Concepts of MCAR, MAR and MNAR Simple solutions that do not (always) work Multiple imputation in a nutshell Goal of the book What the book does not cover Structure of the book Exercises Multiple imputation Historic overview Incomplete data concepts Why and when multiple imputation works Statistical intervals and tests Evaluation criteria When to use multiple imputation How many imputations? Exercises Univariate missing data How to generate multiple imputations Imputation under the normal linear normal Imputation under non-normal distributions Predictive mean matching Categorical data Other data types Classification and regression trees Multilevel data Non-ignorable methods Exercises Multivariate missing data Missing data pattern Issues in multivariate imputation Monotone data imputation Joint Modeling Fully Conditional Specification FCS and JM Conclusion Exercises Imputation in practice Overview of modeling choices Ignorable or non-ignorable? Model form and predictors Derived variables Algorithmic options Diagnostics Conclusion Exercises Analysis of imputed data What to do with the imputed data? Parameter pooling Statistical tests for multiple imputation Stepwise model selection Conclusion Exercises Case studies Measurement issues Too many columns Sensitivity analysis Correct prevalence estimates from self-reported data Enhancing comparability Exercises Selection issues Correcting for selective drop-out Correcting for non-response Exercises Longitudinal data Long and wide format SE Fireworks Disaster Study Time raster imputation Conclusion Exercises Extensions Conclusion Some dangers, some do's and some don'ts Reporting Other applications Future developments Exercises Appendices: Software R S-Plus Stata SAS SPSS Other software References Author Index Subject Index

2,156 citations

Journal ArticleDOI
TL;DR: This article surveys the developments and briefly reviews the key ideas behind some of the major algorithms in regression tree algorithms.
Abstract: Fifty years have passed since the publication of the first regression tree algorithm. New techniques have added capabilities that far surpass those of the early methods. Modern classification trees can partition the data with linear splits on subsets of variables and fit nearest neighbor, kernel density, and other models in the partitions. Regression trees can fit almost every kind of traditional statistical model, including least-squares, quantile, logistic, Poisson, and proportional hazards models, as well as models for longitudinal and multiresponse data. Greater availability and affordability of software (much of which is free) have played a significant role in helping the techniques gain acceptance and popularity in the broader scientific community. This article surveys the developments and briefly reviews the key ideas behind some of the major algorithms.

437 citations

Journal ArticleDOI
TL;DR: The authors present a nonparametric approach for implementing multiple imputation via chained equations by using sequential regression trees as the conditional models and demonstrate that the method can result in more plausible imputations, and hence more reliable inferences, in complex settings than the naive application of standard sequential regression imputation techniques.
Abstract: Multiple imputation is particularly well suited to deal with missing data in large epidemiologic studies, because typically these studies support a wide range of analyses by many data users. Some of these analyses may involve complex modeling, including interactions and nonlinear relations. Identifying such relations and encoding them in imputation models, for example, in the conditional regressions for multiple imputation via chained equations, can be daunting tasks with large numbers of categorical and continuous variables. The authors present a nonparametric approach for implementing multiple imputation via chained equations by using sequential regression trees as the conditional models. This has the potential to capture complex relations with minimal tuning by the data imputer. Using simulations, the authors demonstrate that the method can result in more plausible imputations, and hence more reliable inferences, in complex settings than the naive application of standard sequential regression imputation techniques. They apply the approach to impute missing values in data on adverse birth outcomes with more than 100 clinical and survey variables. They evaluate the imputations using posterior predictive checks with several epidemiologic analyses of interest.

247 citations