scispace - formally typeset
Search or ask a question

Showing papers by "Jerome H. Friedman published in 2004"


Journal ArticleDOI
TL;DR: A new procedure is proposed for clustering attribute value data that encourages those algorithms to detect automatically subgroups of objects that preferentially cluster on subsets of the attribute variables rather than on all of them simultaneously.
Abstract: [Read before The Royal Statistical Society at a meeting organized by the Research Section on Wednesday, May 5th, 2004, Professor J. T. Kent in the Chair ] Summary. A new procedure is proposed for clustering attribute value data. When used in conjunction with conventional distance-based clustering algorithms this procedure encourages those algorithms to detect automatically subgroups of objects that preferentially cluster on subsets of the attribute variables rather than on all of them simultaneously. The relevant attribute subsets for each individual cluster can be different and partially (or completely) overlap with those of other clusters. Enhancements for increasing sensitivity for detecting especially low cardinality groups clustering on a small subset of variables are discussed. Applications in different domains, including gene expression arrays, are presented.

440 citations


ReportDOI
26 Jan 2004
TL;DR: In this article, classification learning machines are used to do multivariate goodness-of-fit and two-sample testing, and it is shown how classification learning machine can be used to perform two-sampled testing.
Abstract: It is shown how classification learning machines can be used to do multivariate goodness-of-fit and two-sample testing.

76 citations


01 Jan 2004
TL;DR: Characteristics of popular ensemble methods such as bagging, random forests and boosting are examined and leveraged to create new predictive methodology, leading to accurate and interpretable RuleFit models.
Abstract: The goal of this dissertation is to study and develop automatic prediction technology that is accurate, fast and interpretable. The focus here is on decision tree ensembles methods and extensions. Characteristics of popular ensemble methods such as bagging, random forests and boosting are examined and leveraged to create new predictive methodology. The classic ensembles are integrated in an unifying paradigm, the Important Sampled Learning Ensembles. This framework explains some of the properties of these ensembles and suggests modifications that can significantly enhance their accuracy while dramatically improving their computational performance. The ISLES are two-stage algorithms having at the front-end a base learners ensemble generation routine followed by post-processing algorithms that perform a fast gradient directed regularized fit for regression, robust regression and classification. The post-processing algorithms developed here can also serve as a stand-alone toolkit for fitting large linear systems. Decision tree ensembles can generate rules that are fit together with the gradient directed regularized linear algorithms, leading to accurate and interpretable RuleFit models. ISLE and RuleFit are flexible methodologies, able to automatically handle non-linearities and interactions, mixtures of categorical and continuous variables with missing data, as well as feature selection.

8 citations


Journal ArticleDOI
TL;DR: In this article, Jiang et al. discuss process consistency for AdaBoost and the Bayes-risk consistency of regularized boosting methods, including convex risk minimization, and statistical behavior and consistency of classification methods.
Abstract: Discussions of: "Process consistency for AdaBoost" [Ann. Statist. 32 (2004), no. 1, 13-29] by W. Jiang; "On the Bayes-risk consistency of regularized boosting methods" [ibid., 30-55] by G. Lugosi and N. Vayatis; and "Statistical behavior and consistency of classification methods based on convex risk minimization" [ibid., 56-85] by T. Zhang. Includes rejoinders by the authors.

7 citations


01 Jan 2004
TL;DR: This paper presents the motivation for clustering objects on subsets of attributes (COSA), and the weights that are crucial in the COSA procedure but that were rather underexposed as diagnostics in the original paper.
Abstract: The motivation for clustering objects on subsets of attributes (COSA) was given by consideration of data where the number of attributes is much larger than the number of objects. Obvious application is in systems biology (genomics, proteomics, and metabolomics). When we have a large numbers of attributes, ob- jects might cluster on some attributes, and be far apart on all others. Common data analysis approaches in systems biology are to cluster the attributes first, and only after having reduced the original many-attribute data set to a much smaller one, one tries to cluster the objects. The problem here, of course, is that we would like to select those attributes that discriminate most among the objects (so we have to do this while regarding all attributes multivariately), and it is usually not good enough to inspect each attribute univariately. Therefore, two tasks have to be carried out simultaneously: cluster the objects into homogeneous groups, while se- lecting dierent subsets of variables (one for each group of objects). The attribute subset for any discovered group may be completely, partially or nonoverlapping with those for other groups. The notorious local optima problem is dealt with by starting with the inverse exponential mean (rather than the arithmetic mean) of the separate attribute distances. By using a homotopy strategy, the algorithm creates a smooth transition of the inverse exponential distance to the mean of the ordinary Euclidean distances over attributes. New insight will be presented for the homotopy strategy, and the weights that are crucial in the COSA procedure but that were rather underexposed as diagnostics in the original paper.

1 citations