An Exploratory Technique for Investigating Large Quantities of Categorical Data

doi:10.2307/2986296

Home
/
Papers
/
An Exploratory Technique for Investigating Large Quantities of Categorical Data

Journal Article•DOI•

An Exploratory Technique for Investigating Large Quantities of Categorical Data

01 Jun 1980-Applied statistics (JSTOR)-Vol. 29, Iss: 2, pp 119-127

TL;DR: The technique set out in the paper, CHAID, is an offshoot of AID (Automatic Interaction Detection) designed for a categorized dependent variable with built-in significance testing, multi-way splits, and a new type of predictor which is especially useful in handling missing information.

read less

Abstract: SUMMARY The technique set out in the paper, CHAID, is an offshoot of AID (Automatic Interaction Detection) designed for a categorized dependent variable. Some important modifications which are relevant to standard AID include: built-in significance testing with the consequence of using the most significant predictor (rather than the most explanatory), multi-way splits (in contrast to binary) and a new type of predictor which is especially useful in handling missing information.

...read moreread less

Citations

PDF

Open Access

More filters

Book•

Data Mining: Concepts and Techniques

[...]

Jiawei Han¹, Micheline Kamber², Jian Pei²•Institutions (2)

University of Illinois at Urbana–Champaign¹, Simon Fraser University²

08 Sep 2000

TL;DR: This book presents dozens of algorithms and implementation examples, all in pseudo-code and suitable for use in real-world, large-scale data mining projects, and provides a comprehensive, practical look at the concepts and techniques you need to get the most out of real business data.

...read moreread less

Abstract: The increasing volume of data in modern business and science calls for more complex and sophisticated tools. Although advances in data mining technology have made extensive data collection much easier, it's still always evolving and there is a constant need for new techniques and tools that can help us transform this data into useful information and knowledge. Since the previous edition's publication, great advances have been made in the field of data mining. Not only does the third of edition of Data Mining: Concepts and Techniques continue the tradition of equipping you with an understanding and application of the theory and practice of discovering patterns hidden in large data sets, it also focuses on new, important topics in the field: data warehouses and data cube technology, mining stream, mining social networks, and mining spatial, multimedia and other complex data. Each chapter is a stand-alone guide to a critical topic, presenting proven algorithms and sound implementations ready to be used directly or with strategic modification against live data. This is the resource you need if you want to apply today's most powerful data mining techniques to meet real business challenges. * Presents dozens of algorithms and implementation examples, all in pseudo-code and suitable for use in real-world, large-scale data mining projects. * Addresses advanced topics such as mining object-relational databases, spatial databases, multimedia databases, time-series databases, text databases, the World Wide Web, and applications in several fields. *Provides a comprehensive, practical look at the concepts and techniques you need to get the most out of real business data

...read moreread less

23,600 citations

Journal Article•DOI•

Classification and regression trees

[...]

Wei-Yin Loh¹•Institutions (1)

University of Wisconsin-Madison¹

01 Jan 2011-Wiley Interdisciplinary Reviews-Data Mining and Knowledge Discovery

TL;DR: This article gives an introduction to the subject of classification and regression trees by reviewing some widely available algorithms and comparing their capabilities, strengths, and weakness in two examples.

...read moreread less

Abstract: Classification and regression trees are machine-learning methods for constructing prediction models from data. The models are obtained by recursively partitioning the data space and fitting a simple prediction model within each partition. As a result, the partitioning can be represented graphically as a decision tree. Classification trees are designed for dependent variables that take a finite number of unordered values, with prediction error measured in terms of misclassification cost. Regression trees are for dependent variables that take continuous or ordered discrete values, with prediction error typically measured by the squared difference between the observed and predicted values. This article gives an introduction to the subject by reviewing some widely available algorithms and comparing their capabilities, strengths, and weakness in two examples. © 2011 John Wiley & Sons, Inc. WIREs Data Mining Knowl Discov 2011 1 14-23 DOI: 10.1002/widm.8 This article is categorized under: Technologies > Classification Technologies > Machine Learning Technologies > Prediction Technologies > Statistical Fundamentals

...read moreread less

16,974 citations

Cites methods from "An Exploratory Technique for Invest..."

...All except C4.5 accept user-specified misclassification costs and all except C4.5 and CHAID accept user-specified class prior probabilities....
[...]
...5 (Quinlan, 1993), CHAID (Kass, 1980), CTREE (Hothorn et al....
[...]
...CHAID (Kass, 1980) • Extends AID to categorical and ordered dependent variables • Uses a direct stopping rule; no pruning • Uses significance tests to select split variables and split points • Uses Bonferroni method to control for multiple testing • Can split each node into more than two subnodes...
[...]
...Then, CHAID uses significance tests and Bonferroni corrections to try to iteratively merge pairs of child nodes....
[...]
...CHAID15 employs yet another strategy....
[...]

Journal Article•DOI•

Unbiased Recursive Partitioning: A Conditional Inference Framework

[...]

Torsten Hothorn¹, Kurt Hornik¹, Achim Zeileis¹•Institutions (1)

University of Erlangen-Nuremberg¹

01 Sep 2006-Journal of Computational and Graphical Statistics

TL;DR: A unified framework for recursive partitioning is proposed which embeds tree-structured regression models into a well defined theory of conditional inference procedures and it is shown that the predicted accuracy of trees with early stopping is equivalent to the prediction accuracy of pruned trees with unbiased variable selection.

...read moreread less

Abstract: Recursive binary partitioning is a popular tool for regression analysis. Two fundamental problems of exhaustive search procedures usually applied to fit such models have been known for a long time: overfitting and a selection bias towards covariates with many possible splits or missing values. While pruning procedures are able to solve the overfitting problem, the variable selection bias still seriously affects the interpretability of tree-structured regression models. For some special cases unbiased procedures have been suggested, however lacking a common theoretical foundation. We propose a unified framework for recursive partitioning which embeds tree-structured regression models into a well defined theory of conditional inference procedures. Stopping criteria based on multiple test procedures are implemented and it is shown that the predictive performance of the resulting trees is as good as the performance of established exhaustive search procedures. It turns out that the partitions and therefore the...

...read moreread less

3,246 citations

Cites background or methods from "An Exploratory Technique for Invest..."

...Unbiased Recursive Partitioning:...
[...]
...This bias is induced by maximizing a splitting criterion over all possible splits simultaneously and was identified as a problem by many researchers (e.g., Kass 1980; Segal 1988; Breiman et al. 1984, p. 42)....
[...]
...Theχ2 automated interaction detection algorithm (CHAID, Kass 1980) is the first approach based on statistical significance tests for contingency tables....
[...]

Data Mining: Concepts and Techniques (2nd edition)

[...]

Jiawei Han, Micheline Kamber

01 Jan 2006

TL;DR: There have been many data mining books published in recent years, including Predictive Data Mining by Weiss and Indurkhya [WI98], Data Mining Solutions: Methods and Tools for Solving Real-World Problems by Westphal and Blaxton [WB98], Mastering Data Mining: The Art and Science of Customer Relationship Management by Berry and Linofi [BL99].

...read moreread less

Abstract: The book Knowledge Discovery in Databases, edited by Piatetsky-Shapiro and Frawley [PSF91], is an early collection of research papers on knowledge discovery from data. The book Advances in Knowledge Discovery and Data Mining, edited by Fayyad, Piatetsky-Shapiro, Smyth, and Uthurusamy [FPSSe96], is a collection of later research results on knowledge discovery and data mining. There have been many data mining books published in recent years, including Predictive Data Mining by Weiss and Indurkhya [WI98], Data Mining Solutions: Methods and Tools for Solving Real-World Problems by Westphal and Blaxton [WB98], Mastering Data Mining: The Art and Science of Customer Relationship Management by Berry and Linofi [BL99], Building Data Mining Applications for CRM by Berson, Smith, and Thearling [BST99], Data Mining: Practical Machine Learning Tools and Techniques by Witten and Frank [WF05], Principles of Data Mining (Adaptive Computation and Machine Learning) by Hand, Mannila, and Smyth [HMS01], The Elements of Statistical Learning by Hastie, Tibshirani, and Friedman [HTF01], Data Mining: Introductory and Advanced Topics by Dunham, and Data Mining: Multimedia, Soft Computing, and Bioinformatics by Mitra and Acharya [MA03]. There are also books containing collections of papers on particular aspects of knowledge discovery, such as Machine Learning and Data Mining: Methods and Applications edited by Michalski, Brakto, and Kubat [MBK98], and Relational Data Mining edited by Dzeroski and Lavrac [De01], as well as many tutorial notes on data mining in major database, data mining and machine learning conferences.

...read moreread less

2,591 citations

Journal Article•DOI•

X-tile: a new bio-informatics tool for biomarker assessment and outcome-based cut-point optimization

[...]

Robert L. Camp¹, Marisa Dolled-Filhart, David L. Rimm•Institutions (1)

Yale University¹

01 Nov 2004-Clinical Cancer Research

TL;DR: A graphical method is developed, the X-tile plot, that illustrates the presence of substantial tumor subpopulations and shows the robustness of the relationship between a biomarker and outcome by construction of a two dimensional projection of every possible subpopulation.

...read moreread less

Abstract: The ability to parse tumors into subsets based on biomarker expression has many clinical applications; however, there is no global way to visualize the best cut-points for creating such divisions. We have developed a graphical method, the X-tile plot that illustrates the presence of substantial tumor subpopulations and shows the robustness of the relationship between a biomarker and outcome by construction of a two dimensional projection of every possible subpopulation. We validate X-tile plots by examining the expression of several established prognostic markers (human epidermal growth factor receptor-2, estrogen receptor, p53 expression, patient age, tumor size, and node number) in cohorts of breast cancer patients and show how X-tile plots of each marker predict population subsets rooted in the known biology of their expression.

...read moreread less

2,551 citations

1
2
3
4
…
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200

Collapse

References

PDF

Open Access

More filters

Book•

An introduction to probability theory and its applications

[...]

William Feller

01 Jan 1950

31,532 citations

Journal Article•DOI•

An Introduction to Probability Theory and Its Applications

[...]

David A. Freedman, William Feller

01 Jun 1958-Biometrika

16,450 citations

Journal Article•DOI•

An Introduction to Probability Theory and Its Applications.

[...]

A. T. Bharucha-Reid, William Feller

01 Apr 1952-American Mathematical Monthly

11,456 citations

Book•

Practical Nonparametric Statistics

[...]

W. J. Conover

01 Jan 1971

TL;DR: Probability Theory. Statistical Inference. Contingency Tables. Appendix Tables. Answers to Odd-Numbered Exercises and Answers to Answers to Answer Questions as discussed by the authors.

...read moreread less

Abstract: Probability Theory. Statistical Inference. Some Tests Based on the Binomial Distribution. Contingency Tables. Some Methods Based on Ranks. Statistics of the Kolmogorov-Smirnov Type. References. Appendix Tables. Answers to Odd-Numbered Exercises. Index.

...read moreread less

10,382 citations

Journal Article•DOI•

The Advanced Theory of Statistics

[...]

Maurice G. Kendall, Alan Stuart

01 Apr 1963-Population

6,420 citations