scispace - formally typeset
Open AccessJournal ArticleDOI

Data analysis, computation and mathematics

John W. Tukey
- 01 Jan 1972 - 
- Vol. 30, Iss: 1, pp 51-65
About
This article is published in Quarterly of Applied Mathematics.The article was published on 1972-01-01 and is currently open access. It has received 62 citations till now. The article focuses on the topics: Symbolic-numeric computation & Symbolic computation.

read more

Content maybe subject to copyright    Report

QUARTERLY OF APPLIED MATHEMATICS 51
APRIL, 1972
SPECIAL ISSUE: SYMPOSIUM ON
"THE FUTURE OF APPLIED MATHEMATICS"
DATA ANALYSIS, COMPUTATION AND MATHEMATICS*
BY
JOHN W. TUKEY
Bell Telephone Laboratories, Murray Hill, and Princeton University
Abstract. "Data analysis" instead of "statistics" is a name that allows us to use
probability where it is needed and avoid it when we should. Data analysis has to analyze
real data. Most real data calls for data investigation, while almost all statistical theory
is concerned with data processing. This can be borne, in part because large segments
of data investigation are, by themselves, data processing. Summarizing a batch of
20 numbers is a convenient paradigm for more complex aims in data analysis. A partic-
ular summary, highly competitive among those known and known about in August 1971,
is a hybrid between two moderately complex summaries. Data investigation comes
in three stages: exploratory data analysis (no probability), rough confirmatory data
analysis (sign test procedures and the like), mustering and borrowing strength (the best
of modern robust techniques, and an art of knowing when to stop). Exploratory data
analysis can be improved by being made more resistant, either with medians or with
fancier summaries. Rough confirmatory data analysis can be improved by facing up to
the issues surrounding the choice of what is to be confirmed or disaffirmed. Borrowing
strength is imbedded in our classical procedures, though we often forget this. Mustering
strength calls for the best in robust summaries we can supply. The sampling behavior
of such a summary as the hybrid mentioned above is not going to be learned through
the mathematics of certainty, at least as we know it today, especially if we are realistic
about the diversity of non-Gaussian situations that are studied. The mathematics of
simulation, inevitably involving the mathematically sound "swindles" of Monte Carlo,
will be our trust and reliance. I illustrate results for a few summaries, including the
hybrid mentioned above. Bayesian techniques are still a problem to the author, mainly
because there seems to be no agreement on what their essence is. From my own point
of view, some statements of their essence are wholly acceptable and others are equally
unacceptable. The use of exogeneous information in analyzing a given body of data is
a very different thing (a) depending on sample size and (b) depending on just how the
exogeneous information is used. It would be a very fine thing if the questions that
practical data analysis has to have answered could be answered by the mathematics
of certainty. For my own part, I see no escape, for the next decade or so at least, from
a dependence on the mathematics of simulation, in which we should heed von Neumann's
aphorism as much as we can.
As one who was once a Brown chemist, I am happy to be back, and honored to
take part in this celebration.
* Prepared in part in connection with research at Princeton University sponsored by the Army
Research Office (Durham).

52 JOHN W. TUKEY
1. Names, what is in them? My title speaks of "data analysis" not "statistics",
and of "computation" not "computing science"; it does speak of "mathematics", but
only last. Why? The answers to these questions need a substructure of understanding
to which this talk will be devoted.
My brother-in-squared-law, Francis J. Anscombe [2, footnote to page 3] has com-
mented on my use of "data analysis" in the following words:
Whereas the content of Tukey's remarks is always worth pondering, some
of his terminology is hard to take. He seems to identify "statistics" with the
grotesque phenomenon generally known as "mathematical statistics", and finds
it necessary to replace "statistical analysis" by "data analysis". The change is a
little unfortunate because the statistician's data are the observer's facta, and
sometimes observer and statistician are the same person, in which case he is no
doubt primarily observer. Perhaps "facta analysis" is the answer.
One reason for being careful about names has been made clear by Gower and Ross
[5, pages 55-56] who say:
It is often argued that for a method to be statistical it should have some
probabilistic basis, but many methods profitably used by practicing statisticians
do not have this basis. In others (for example, analysis of variance) it is arguable
that the probabilistic features are not fundamental to the method.
Many of those who use the words "data analysis" adhere to the view that "It is well
to understand what you can do before you learn how to measure how well you seem
able to do it" [13]. I shall stick to this attitude today, and shall continue to use the words
"data analysis", in part to indicate that we can take probability seriously, or leave it
alone, as may from time to time be appropriate or necessary.
Data analysis is in important ways an antithesis of pure mathematics. I well remem-
ber a conversation in 1945 or 1946 with the late Walther Mayer, then at the Institute for
Advanced Study, who wondered at my interest in continuing at Bell Telephone Lab-
oratories, which he thought of as quite applied. He indicated how important it was for
him to know "that if I say gik has certain properties, it really does". He knew that such
a fiat need not rule an application. A similar antithesis holds for many, perhaps all,
branches of applied mathematics, but often in a very much weaker form.
The practicing data analyst has to analyze data. The techniques that the theorizing
data analyst—-often the same person—thinks about have to be used to analyze data.
It is never enough to be able to analyze simplified cases.
The membrane theory of shells did not have to design buildings, though hopefully
it guided their designing. The solution to the travelling salesman problem did not have
to take account of airline schedules. Early work in population genetics did not have to
consider the geographic structure and connectivity of each important type of ecological
niche. One analogue to these latter fields, where oversimplification has taught us much,
is statistical theory, which ought to do its share in guiding data analysis.
All too often, statistical theory is miscalled mathematical statistics, about which
too many practitioners (even some of them Englishmen!) take the dangerous view that
work can be good mathematical statistics without being either good mathematics or
good statistics. (Fortunately the number who do this are few.)

DATA ANALYSIS, COMPUTATION AND MATHEMATICS 53
It will be our purpose to try to see how data analysis is reasonably practiced, how
statistical theory helps to guide it, and why computing will have to play a major role
in the development of newer and deeper guidance.
2. Flow charts, with or without switches? Procedures, theoretical or actual, for
passing from data to results fall naturally into two broad categories:
1. Those whose flow patterns involve significant switching points where the details
of the data determine, often by human intervention, what is to be done next to that
particular set of data—-these are well called "data investigation".
2. Those whose flow patterns involve no significant data-driven switching (at least
to the hasty eye)—these we shall call "data processing".
It is a harsh fact, but true, that most data call for data investigation, while almost
all statistical theory is concerned with data processing. Harsh, but not quite as harsh
as it might seem.
I recall Professor Ducasse giving a paper to the philosophy seminar, a few hundred
yards west of here in Rhode Island Hall, in which he said deduction and induction could
be completely separated, that each of us was doing one or the other, not both. The tides
of controversy rose high, but ebbed away again once we all recognized that he meant
this separation to be instant by instant, rather than minute by minute or process by
process.
Statistical theory can, has, and will give very useful guidance to data investigation.
In most cases, however, this will be because its results are used to guide only some part,
smaller or larger, of the data-investigative process, a part that comes at least close
to being data processing.
If statistical theory really encompassed all the practical problems of data analysis,
and if we were able to implement all its derived precepts effectively, then there would
be no data investigation, only data processing. We are far from that tarnished Utopia
today—-and not likely to attain it tomorrow.
3. Summarizing a batch—in various styles. The oldest problem of data analysis
is summarizing a batch of numbers—-where a batch is a set of numbers, all of which
are taken as telling us about the same thing. When these numbers are the ages at death
for twelve children of a single parentage, they are not expected to be the same and often
differ widely. When these numbers are the precisely measured values of the same physical
constant obtained by twelve skilled observers they again ought not to be expected to
be the same—-though too many forget this—-but they often differ very little. In either
case summarization may be in order.
Until rather late in our intellectual history [4], summarization of such a set of obser-
vations was by picking out a good observation—and, I fear, claiming that it was the
best. By the time of Gauss, however, the use of the arithmetic mean was common,
and much effort was spent on trying to show that it was, in fact, the best. To be best
required that what was chosen as a summary could be compared with some "reality"
beyond the data. For this to be sensible, there had to be a process that generated varying
sets of data. The simplest situation that would produce sets of data showing many of
the idiosyncrasies of real data sets was random sampling from an infinite population.
Thus the supposed quality of the arithmetic mean was used to establish the Gaussian
distribution—-and the supposed reality of the Gaussian distribution was used to es-
tablish the optimal nature of the arithmetic mean. No doubt the circularity was clear
to many who worked in this field, but there may have been more than a trace of the

54 JOHN W. TUKEY
attitude expressed by Hermann Weyl, who, when asked about his attitude to classical
and intuitionistic mathematics, said that he was only certain of what could be established
by intuitionistic methods, but that he liked to obtain results. It would have been a
great loss for mathematics had "the great, the noble, the holy Hermann" taken any
other view.
By the early 1930s, when I first met it, the practice of data analysis, at least in the
most skilled hands, had advanced to the point where summarization was data investiga-
tion, in the sense that apparently well-behaved data would be summarized by its means,
while apparently ill-behaved sets of numbers would be summarized in a more resistant
way, perhaps by their medians. Such a branching need not mean that the summary
cannot be a fixed function of the data; it does mean that this fixed function is not going
to be completely simple to write down.
If we can specify a general rule for the branching, we have effectively chosen a fixed
function that pleases us. When we go further and use this function routinely, we are
likely to reconvert summarization from data investigation to data processing. This will
surely be true if any switch-like character of the fixed function is sufficiently concealed.
Even when this is done, however, the microprocess of summarization is still likely to be
contained in a macroprocess of data investigation.
It has been nearly 30 years since I met data analysts smart enough to avoid the
arithmetic means most of the time. Yet our books and lectures still concentrate on it,
as if it were the good thing to do, and almost all our more sophisticated calculations—•
analyses of variance, multiple regressions (even factor analyses)—use analogues of the
arithmetic mean rather than analogues of something safer and better. Progress has
been slow.
4. An example—the summary of the month. Since these lines are written on 31
August, I can designate a particular form of summary as the summary for August 1971
without claiming too much about the future, about how it will compare with those
summaries that will come to our attention in September, and in the months to come.
For simplicity—and because I happen to have better numbers for this special case—I
am going to restrict the definition to batches of 20. (A reasonable extension to all n is
easy; a really good extension may take thought and experimentation.)
Let us then consider 3 different summaries (central values) of 20 x<'s (i 1, 2, , 20),
namely CXO = ^(21B) + |(C02), where 21B is defined implicitly, following a pattern
proposed by Hampel, while CO2 is a sit mean (where "sit" is for skip-mto-<rim). (21B
was chosen here for simplicity of mathematical description; a closely related estimate-
perhaps the one presently designated 21E, the first step of a Newton-Raphson approxi-
mation to 215, starting at C02—would tend to save computing time without appreciable
loss of performance.)
Let be a polygonal function of a real variable defined by:
M, 0 < \x\ < 2.1
2.1, 2.1 < |a;| < 4.1
^21 b(x) = (sign x)>
(2.1) 9-1 W , 4.1 < |x| < 9.1
0, 9.1 < \x\

DATA ANALYSIS, COMPUTATION AND MATHEMATICS 55
Then the value T of the estimate 21B is the solution of
Z hU(.x< - T)/s) = 0,
i
where s is the median of the absolute deviations \x{ x\ of the £; from their medians x.
(Replacing 2.1 etc. by 2.0 etc. would cost little. Results happen to be available for 2.1.)
Let next x(i) be the x, rearranged in increasing order so that x(i) < x(j) for i < j,
where i and j run from 1 to 20. Let the hinges be L = |x(5) + Jx(6), U = |x(15) +
%x(l6) (these are a form of quartile), and let the corners be C" = L 2(U L) =
SL 2U, C* = U + 2(U L) = 3(7 2L. Identify and count any x(i) that are
<C~ or >C+. Such values will be called "detached". To calculate the estimate C02,
proceed as follows:
1) if no observations are detached, form the mean of all observations.
2) if exactly one observation is detached, set it aside (skipping), and then set aside
two more at each end of the remaining list (trimming); form the mean of those not set
aside (here 15 in number).
3) if more than one observation is detached, set them aside (skipping), and then set
aside 4 more at each end of the remaining list (trimming); form the mean of those not
set aside (here 10 to 4 in number).
We shall return below to assessing the quality of these three estimates. Once we do,
it will be very hard to justify the arithmetic mean as a way of summarizing batches
(specifically for a batch of 20 numbers, actually for batches of more than 2 or perhaps 3).
The switching character of C02 is overt, that of 21B is covert. Both, as we shall see
later, perform well, while their mean, CXO, performs even better. Once a general com-
puting routine is coded, and we agree to use CXO "come hell or high water", which
would not be unwise in August or September 1971, we will (or would) have made this
kind of summarization into data processing again, at least for a time. If, as is so often
the case, however, we need to look at the data to see whether we want to summarize
x = y, x = \/y or x = log y, where y represents the numbers given to us, this piece of
data processing, called summarization, is still embedded in an only slightly larger piece
of data investigation.
The class of estimators to which 215 belongs has been called "hampels" [1] and
that to which C02 belongs has been called "sit means". It is thus natural to call the
class to which CXO belongs "sitha estimates". The best comment about such estimates
that I know of was made 75 years in advance by Rudyard Kipling (1897) who wrote
" 'sitha', said he softly, 'thot's better than owt,' ..
We are, of course, quite likely, as we learn more, to come to like some other sitha
estimate even better than CXO.
5. Three stages—of data investigation. As we come to think over the process of
analyzing data, when done well, we can hardly fail to identify the unrealism of the
descriptions given or implied in our texts and lectures. The description I am about to
give emphasizes three kinds of stages. It is more realistic than the description we are
accustomed to but we dare not think it (or anything else) the ultimate in realism.
The first stage is exploratory data analysis, which does not need probability, signi-
ficance, or confidence, and which, when there is much data, may need to handle only
either a portion or a sample of what is available. That there is still much to be said and

Citations
More filters
Journal ArticleDOI

The Hat Matrix in Regression and ANOVA

TL;DR: A projection matrix known as the hat matrix contains this information and, together with the Studentized residuals, provides a means of identifying exceptional data points and simplifies the calculations involved in removing a data point.
Journal ArticleDOI

Principles and procedures of exploratory data analysis.

TL;DR: The central heuristics and computational tools of EDA are introduced and it is shown how these tools complement the use of significance and hypothesis tests used in confirmatory data analysis (CDA).
Journal ArticleDOI

Do Robust Estimators Work with Real Data

TL;DR: In this paper, a comparison of the performances of eleven estimators using real data sets is presented, with the current values of these physical constants being compared with the estimators' realized values.
Journal ArticleDOI

Spline Functions in Data Analysis

TL;DR: The use of spline functions in the analysis of empirical two-dimensional data (y i, x i) is described in this paper, where the authors define spline function as piecewise polynomials with continuity conditions, which give them unique properties as empirical function.
Journal ArticleDOI

Influence-matrix diagnostic of a data assimilation system

TL;DR: An approximate method to compute the diagonal elements of the influence matrix (the self-sensitivities) has been developed for a large-dimension variational data assimilation system (the four-dimensional variational system of the European Centre for Medium-Range Weather Forecasts).
References
More filters
Journal ArticleDOI

Robust Estimation of a Location Parameter

TL;DR: In this article, a new approach toward a theory of robust estimation is presented, which treats in detail the asymptotic theory of estimating a location parameter for contaminated normal distributions, and exhibits estimators that are asyptotically most robust (in a sense to be specified) among all translation invariant estimators.
Journal ArticleDOI

Minimum Spanning Trees and Single Linkage Cluster Analysis

TL;DR: Minimum spanning trees (MST) and single linkage cluster analysis (SLCA) are explained and it is shown that all the information required for the SLCA of a set of points is contained in their MST.
Book

Inference and disputed authorship : The Federalist

TL;DR: Inference and Disputed Authorship of the Federalist Papers as discussed by the authors is a classic work that applies mathematics, including the once-controversial Bayesian analysis, to the heart of a literary and historical problem by studying frequently used words in the texts.
Journal ArticleDOI

A new Monte Carlo technique: antithetic variates

TL;DR: The main concern in Monte Carlo work is to achieve without inordinate labour a respectably small standard error in the final result as mentioned in this paper, and it is profitable to allow some increase in the labour if that produces an overwhelming decrease in the variance.