Data analysis, computation and mathematics

doi:10.1090/QAM/99740

QUARTERLY OF APPLIED MATHEMATICS 51

APRIL, 1972

SPECIAL ISSUE: SYMPOSIUM ON

"THE FUTURE OF APPLIED MATHEMATICS"

DATA ANALYSIS, COMPUTATION AND MATHEMATICS*

BY

JOHN W. TUKEY

Bell Telephone Laboratories, Murray Hill, and Princeton University

Abstract. "Data analysis" instead of "statistics" is a name that allows us to use

probability where it is needed and avoid it when we should. Data analysis has to analyze

real data. Most real data calls for data investigation, while almost all statistical theory

is concerned with data processing. This can be borne, in part because large segments

of data investigation are, by themselves, data processing. Summarizing a batch of

20 numbers is a convenient paradigm for more complex aims in data analysis. A partic-

ular summary, highly competitive among those known and known about in August 1971,

is a hybrid between two moderately complex summaries. Data investigation comes

in three stages: exploratory data analysis (no probability), rough confirmatory data

analysis (sign test procedures and the like), mustering and borrowing strength (the best

of modern robust techniques, and an art of knowing when to stop). Exploratory data

analysis can be improved by being made more resistant, either with medians or with

fancier summaries. Rough confirmatory data analysis can be improved by facing up to

the issues surrounding the choice of what is to be confirmed or disaffirmed. Borrowing

strength is imbedded in our classical procedures, though we often forget this. Mustering

strength calls for the best in robust summaries we can supply. The sampling behavior

of such a summary as the hybrid mentioned above is not going to be learned through

the mathematics of certainty, at least as we know it today, especially if we are realistic

about the diversity of non-Gaussian situations that are studied. The mathematics of

simulation, inevitably involving the mathematically sound "swindles" of Monte Carlo,

will be our trust and reliance. I illustrate results for a few summaries, including the

hybrid mentioned above. Bayesian techniques are still a problem to the author, mainly

because there seems to be no agreement on what their essence is. From my own point

of view, some statements of their essence are wholly acceptable and others are equally

unacceptable. The use of exogeneous information in analyzing a given body of data is

a very different thing (a) depending on sample size and (b) depending on just how the

exogeneous information is used. It would be a very fine thing if the questions that

practical data analysis has to have answered could be answered by the mathematics

of certainty. For my own part, I see no escape, for the next decade or so at least, from

a dependence on the mathematics of simulation, in which we should heed von Neumann's

aphorism as much as we can.

As one who was once a Brown chemist, I am happy to be back, and honored to

take part in this celebration.

* Prepared in part in connection with research at Princeton University sponsored by the Army

Research Office (Durham).

52 JOHN W. TUKEY

1. Names, what is in them? My title speaks of "data analysis" not "statistics",

and of "computation" not "computing science"; it does speak of "mathematics", but

only last. Why? The answers to these questions need a substructure of understanding

to which this talk will be devoted.

My brother-in-squared-law, Francis J. Anscombe [2, footnote to page 3] has com-

mented on my use of "data analysis" in the following words:

Whereas the content of Tukey's remarks is always worth pondering, some

of his terminology is hard to take. He seems to identify "statistics" with the

grotesque phenomenon generally known as "mathematical statistics", and finds

it necessary to replace "statistical analysis" by "data analysis". The change is a

little unfortunate because the statistician's data are the observer's facta, and

sometimes observer and statistician are the same person, in which case he is no

doubt primarily observer. Perhaps "facta analysis" is the answer.

One reason for being careful about names has been made clear by Gower and Ross

[5, pages 55-56] who say:

It is often argued that for a method to be statistical it should have some

probabilistic basis, but many methods profitably used by practicing statisticians

do not have this basis. In others (for example, analysis of variance) it is arguable

that the probabilistic features are not fundamental to the method.

Many of those who use the words "data analysis" adhere to the view that "It is well

to understand what you can do before you learn how to measure how well you seem

able to do it" [13]. I shall stick to this attitude today, and shall continue to use the words

"data analysis", in part to indicate that we can take probability seriously, or leave it

alone, as may from time to time be appropriate or necessary.

Data analysis is in important ways an antithesis of pure mathematics. I well remem-

ber a conversation in 1945 or 1946 with the late Walther Mayer, then at the Institute for

Advanced Study, who wondered at my interest in continuing at Bell Telephone Lab-

oratories, which he thought of as quite applied. He indicated how important it was for

him to know "that if I say gik has certain properties, it really does". He knew that such

a fiat need not rule an application. A similar antithesis holds for many, perhaps all,

branches of applied mathematics, but often in a very much weaker form.

The practicing data analyst has to analyze data. The techniques that the theorizing

data analyst—-often the same person—thinks about have to be used to analyze data.

It is never enough to be able to analyze simplified cases.

The membrane theory of shells did not have to design buildings, though hopefully

it guided their designing. The solution to the travelling salesman problem did not have

to take account of airline schedules. Early work in population genetics did not have to

consider the geographic structure and connectivity of each important type of ecological

niche. One analogue to these latter fields, where oversimplification has taught us much,

is statistical theory, which ought to do its share in guiding data analysis.

All too often, statistical theory is miscalled mathematical statistics, about which

too many practitioners (even some of them Englishmen!) take the dangerous view that

work can be good mathematical statistics without being either good mathematics or

good statistics. (Fortunately the number who do this are few.)

DATA ANALYSIS, COMPUTATION AND MATHEMATICS 53

It will be our purpose to try to see how data analysis is reasonably practiced, how

statistical theory helps to guide it, and why computing will have to play a major role

in the development of newer and deeper guidance.

2. Flow charts, with or without switches? Procedures, theoretical or actual, for

passing from data to results fall naturally into two broad categories:

1. Those whose flow patterns involve significant switching points where the details

of the data determine, often by human intervention, what is to be done next to that

particular set of data—-these are well called "data investigation".

2. Those whose flow patterns involve no significant data-driven switching (at least

to the hasty eye)—these we shall call "data processing".

It is a harsh fact, but true, that most data call for data investigation, while almost

all statistical theory is concerned with data processing. Harsh, but not quite as harsh

as it might seem.

I recall Professor Ducasse giving a paper to the philosophy seminar, a few hundred

yards west of here in Rhode Island Hall, in which he said deduction and induction could

be completely separated, that each of us was doing one or the other, not both. The tides

of controversy rose high, but ebbed away again once we all recognized that he meant

this separation to be instant by instant, rather than minute by minute or process by

process.

Statistical theory can, has, and will give very useful guidance to data investigation.

In most cases, however, this will be because its results are used to guide only some part,

smaller or larger, of the data-investigative process, a part that comes at least close

to being data processing.

If statistical theory really encompassed all the practical problems of data analysis,

and if we were able to implement all its derived precepts effectively, then there would

be no data investigation, only data processing. We are far from that tarnished Utopia

today—-and not likely to attain it tomorrow.

3. Summarizing a batch—in various styles. The oldest problem of data analysis

is summarizing a batch of numbers—-where a batch is a set of numbers, all of which

are taken as telling us about the same thing. When these numbers are the ages at death

for twelve children of a single parentage, they are not expected to be the same and often

differ widely. When these numbers are the precisely measured values of the same physical

constant obtained by twelve skilled observers they again ought not to be expected to

be the same—-though too many forget this—-but they often differ very little. In either

case summarization may be in order.

Until rather late in our intellectual history [4], summarization of such a set of obser-

vations was by picking out a good observation—and, I fear, claiming that it was the

best. By the time of Gauss, however, the use of the arithmetic mean was common,

and much effort was spent on trying to show that it was, in fact, the best. To be best

required that what was chosen as a summary could be compared with some "reality"

beyond the data. For this to be sensible, there had to be a process that generated varying

sets of data. The simplest situation that would produce sets of data showing many of

the idiosyncrasies of real data sets was random sampling from an infinite population.

Thus the supposed quality of the arithmetic mean was used to establish the Gaussian

distribution—-and the supposed reality of the Gaussian distribution was used to es-

tablish the optimal nature of the arithmetic mean. No doubt the circularity was clear

to many who worked in this field, but there may have been more than a trace of the

54 JOHN W. TUKEY

attitude expressed by Hermann Weyl, who, when asked about his attitude to classical

and intuitionistic mathematics, said that he was only certain of what could be established

by intuitionistic methods, but that he liked to obtain results. It would have been a

great loss for mathematics had "the great, the noble, the holy Hermann" taken any

other view.

By the early 1930s, when I first met it, the practice of data analysis, at least in the

most skilled hands, had advanced to the point where summarization was data investiga-

tion, in the sense that apparently well-behaved data would be summarized by its means,

while apparently ill-behaved sets of numbers would be summarized in a more resistant

way, perhaps by their medians. Such a branching need not mean that the summary

cannot be a fixed function of the data; it does mean that this fixed function is not going

to be completely simple to write down.

If we can specify a general rule for the branching, we have effectively chosen a fixed

function that pleases us. When we go further and use this function routinely, we are

likely to reconvert summarization from data investigation to data processing. This will

surely be true if any switch-like character of the fixed function is sufficiently concealed.

Even when this is done, however, the microprocess of summarization is still likely to be

contained in a macroprocess of data investigation.

It has been nearly 30 years since I met data analysts smart enough to avoid the

arithmetic means most of the time. Yet our books and lectures still concentrate on it,

as if it were the good thing to do, and almost all our more sophisticated calculations—•

analyses of variance, multiple regressions (even factor analyses)—use analogues of the

arithmetic mean rather than analogues of something safer and better. Progress has

been slow.

4. An example—the summary of the month. Since these lines are written on 31

August, I can designate a particular form of summary as the summary for August 1971

without claiming too much about the future, about how it will compare with those

summaries that will come to our attention in September, and in the months to come.

For simplicity—and because I happen to have better numbers for this special case—I

am going to restrict the definition to batches of 20. (A reasonable extension to all n is

easy; a really good extension may take thought and experimentation.)

Let us then consider 3 different summaries (central values) of 20 x<'s (i — 1, 2, • • • , 20),

namely CXO = ^(21B) + |(C02), where 21B is defined implicitly, following a pattern

proposed by Hampel, while CO2 is a sit mean (where "sit" is for skip-mto-<rim). (21B

was chosen here for simplicity of mathematical description; a closely related estimate-

perhaps the one presently designated 21E, the first step of a Newton-Raphson approxi-

mation to 215, starting at C02—would tend to save computing time without appreciable

loss of performance.)

Let be a polygonal function of a real variable defined by:

M, 0 < \x\ < 2.1

2.1, 2.1 < |a;| < 4.1

^21 b(x) = (sign x)>

(2.1) 9-1 W , 4.1 < |x| < 9.1

0, 9.1 < \x\

DATA ANALYSIS, COMPUTATION AND MATHEMATICS 55

Then the value T of the estimate 21B is the solution of

Z hU(.x< - T)/s) = 0,

i

where s is the median of the absolute deviations \x{ — x\ of the £; from their medians x.

(Replacing 2.1 etc. by 2.0 etc. would cost little. Results happen to be available for 2.1.)

Let next x(i) be the x, rearranged in increasing order so that x(i) < x(j) for i < j,

where i and j run from 1 to 20. Let the hinges be L = |x(5) + Jx(6), U = |x(15) +

%x(l6) (these are a form of quartile), and let the corners be C" = L — 2(U — L) =

SL — 2U, C* = U + 2(U — L) = 3(7 — 2L. Identify and count any x(i) that are

<C~ or >C+. Such values will be called "detached". To calculate the estimate C02,

proceed as follows:

1) if no observations are detached, form the mean of all observations.

2) if exactly one observation is detached, set it aside (skipping), and then set aside

two more at each end of the remaining list (trimming); form the mean of those not set

aside (here 15 in number).

3) if more than one observation is detached, set them aside (skipping), and then set

aside 4 more at each end of the remaining list (trimming); form the mean of those not

set aside (here 10 to 4 in number).

We shall return below to assessing the quality of these three estimates. Once we do,

it will be very hard to justify the arithmetic mean as a way of summarizing batches

(specifically for a batch of 20 numbers, actually for batches of more than 2 or perhaps 3).

The switching character of C02 is overt, that of 21B is covert. Both, as we shall see

later, perform well, while their mean, CXO, performs even better. Once a general com-

puting routine is coded, and we agree to use CXO "come hell or high water", which

would not be unwise in August or September 1971, we will (or would) have made this

kind of summarization into data processing again, at least for a time. If, as is so often

the case, however, we need to look at the data to see whether we want to summarize

x = y, x = \/y or x = log y, where y represents the numbers given to us, this piece of

data processing, called summarization, is still embedded in an only slightly larger piece

of data investigation.

The class of estimators to which 215 belongs has been called "hampels" [1] and

that to which C02 belongs has been called "sit means". It is thus natural to call the

class to which CXO belongs "sitha estimates". The best comment about such estimates

that I know of was made 75 years in advance by Rudyard Kipling (1897) who wrote

" 'sitha', said he softly, 'thot's better than owt,' ..

We are, of course, quite likely, as we learn more, to come to like some other sitha

estimate even better than CXO.

5. Three stages—of data investigation. As we come to think over the process of

analyzing data, when done well, we can hardly fail to identify the unrealism of the

descriptions given or implied in our texts and lectures. The description I am about to

give emphasizes three kinds of stages. It is more realistic than the description we are

accustomed to but we dare not think it (or anything else) the ultimate in realism.

The first stage is exploratory data analysis, which does not need probability, signi-

ficance, or confidence, and which, when there is much data, may need to handle only

either a portion or a sample of what is available. That there is still much to be said and

Data analysis, computation and mathematics

Citations

The Hat Matrix in Regression and ANOVA

Principles and procedures of exploratory data analysis.

Do Robust Estimators Work with Real Data

Spline Functions in Data Analysis

Influence-matrix diagnostic of a data assimilation system

References

Robust Estimation of a Location Parameter

Minimum Spanning Trees and Single Linkage Cluster Analysis

Inference and disputed authorship : The Federalist

A new Monte Carlo technique: antithetic variates

Topics in the Investigation of Linear Relations Fitted by the Method of Least Squares

Related Papers (5)

Exploratory data analysis

Understanding robust and exploratory data analysis

Principles and procedures of exploratory data analysis.

Data Analysis and Regression: A Second Course in Statistics

The Future of Data Analysis