scispace - formally typeset
Search or ask a question

Showing papers in "Journal of Computational and Graphical Statistics in 1993"


Journal ArticleDOI
TL;DR: A comparison and evaluation of recent proposals for multivariate matched sampling in observational studies finds that optimal matching is sometimes noticeably better than greedy matching in the sense of producing closely matched pairs, sometimes only marginally better, but it is no better than greed matching in terms of producing balanced matched samples.
Abstract: A comparison and evaluation is made of recent proposals for multivariate matched sampling in observational studies, where the following three questions are answered: (1) Algorithms: In current statistical practice, matched samples are formed using “nearest available” matching, a greedy algorithm. Greedy matching does not minimize the total distance within matched pairs, though good algorithms exist for optimal matching that do minimize the total distance. How much better is optimal matching than greedy matching? We find that optimal matching is sometimes noticeably better than greedy matching in the sense of producing closely matched pairs, sometimes only marginally better, but it is no better than greedy matching in the sense of producing balanced matched samples. (2) Structures: In common practice, treated units are matched to one control, called pair matching or 1–1 matching, or treated units are matched to two controls, called 1–2 matching, and so on. It is known, however, that the optimal st...

635 citations


Journal ArticleDOI
TL;DR: In this article, the properties of normal/independent distributions are reviewed and several new results are presented for adaptive, robust regression with non-normal error distributions, such as the t, slash, and contaminated normal families.
Abstract: Maximum likelihood estimation with nonnormal error distributions provides one method of robust regression. Certain families of normal/independent distributions are particularly attractive for adaptive, robust regression. This article reviews the properties of normal/independent distributions and presents several new results. A major virtue of these distributions is that they lend themselves to EM algorithms for maximum likelihood estimation. EM algorithms are discussed for least Lp regression and for adaptive, robust regression based on the t, slash, and contaminated normal families. Four concrete examples illustrate the performance of the different methods on real data.

321 citations


Journal ArticleDOI
TL;DR: An attempt to isolate the mechanisms that produce similar features in dotplots of artificial sequences is made, and an approximation that makes the calculation ofdotplots practical for use in an interactive browser is introduced.
Abstract: An interactive program, dotplot, has been developed for browsing millions of lines of text and source code, using an approach borrowed from biology for studying homology (self-similarity) in DNA sequences. With conventional browsing tools such as a screen editor, it is difficult to identify structures that are too big to fit on the screen. In contrast, with dotplots we find that many of these structures show up as diagonals, squares, textures, and other visually recognizable features, as will be illustrated in examples selected from biology and two new application domains, text (AP news, Canadian Hansards) and source code (5ESS®). In an attempt to isolate the mechanisms that produce these features, we have synthesized similar features in dotplots of artificial sequences. We also introduce an approximation that makes the calculation of dotplots practical for use in an interactive browser.

150 citations


Journal ArticleDOI
TL;DR: The use of a mode tree in adaptive multimodality investigations is proposed, and an example is given to show the value in using a normal kernel, as opposed to the biweight or other kernels, in such investigations.
Abstract: Recognition and extraction of features in a nonparametric density estimate are highly dependent on correct calibration. The data-driven choice of bandwidth h in kernel density estimation is a difficult one that is compounded by the fact that the globally optimal h is not generally optimal for all values of x. In recognition of this fact a new type of graphical tool, the mode tree, is proposed. The basic mode tree plot relates the locations of modes in density estimates with the bandwidths of those estimates. Additional information can be included on the plot indicating factors such as the size of modes, how modes split, and the locations of antimodes and bumps. The use of a mode tree in adaptive multimodality investigations is proposed, and an example is given to show the value in using a normal kernel, as opposed to the biweight or other kernels, in such investigations. Examples of such investigations are provided for Ahrens's chondrite data and van Winkle's Hidalgo stamp data. Finally, the biva...

146 citations


Journal ArticleDOI
TL;DR: In this article, a multivariate smoothing spline estimate of a function of several variables, based on an ANOVA decomposition as sums of main effect functions (one variable), two-factor interaction functions (of two variables), etc.
Abstract: We study a multivariate smoothing spline estimate of a function of several variables, based on an ANOVA decomposition as sums of main effect functions (of one variable), two-factor interaction functions (of two variables), etc. We derive the Bayesian “confidence intervals” for the components of this decomposition and demonstrate that, even with multiple smoothing parameters, they can be efficiently computed using the publicly available code RKPACK, which was originally designed just to compute the estimates. We carry out a small Monte Carlo study to see how closely the actual properties of these component-wise confidence intervals match their nominal confidence levels. Lastly, we analyze some lake acidity data as a function of calcium concentration, latitude, and longitude, using both polynomial and thin plate spline main effects in the same model.

124 citations


Journal ArticleDOI
TL;DR: The Legendre and Hermite indexes as discussed by the authors are weighted L 2 distance between the density of the projected data and a standard normal density, which is a general form for this type of index that encompasses both indexes.
Abstract: Projection pursuit describes a procedure for searching high-dimensional data for “interesting” low-dimensional projections via the optimization of a criterion function called the projection pursuit index. By empirically examining the optimization process for several projection pursuit indexes, we observed differences in the types of structure that maximized each index. We were especially curious about differences between two indexes based on expansions in terms of orthogonal polynomials, the Legendre index, and the Hermite index. Being fast to compute, these indexes are ideally suited for dynamic graphics implementations. Both Legendre and Hermite indexes are weighted L 2 distances between the density of the projected data and a standard normal density. A general form for this type of index is introduced that encompasses both indexes. The form clarifies the effects of the weight function on the index's sensitivity to differences from normality, highlighting some conceptual problems with the Legen...

104 citations


Journal ArticleDOI
TL;DR: In this paper, the performance of three Monte Carlo Markov chain samplers (the Gibbs sampler, the Hit-and-Run (HR) and the Metropolis sampler) was investigated.
Abstract: We consider the performance of three Monte Carlo Markov-chain samplers—the Gibbs sampler, which cycles through coordinate directions; the Hit-and-Run (HR and the Metropolis sampler, which moves with a probability that is a ratio of likelihoods. We obtain several analytical results. We provide a sufficient condition of the geometric convergence on a bounded region S for the H&R sampler. For a general region S, we review the Schervish and Carlin sufficient geometric convergence condition for the Gibbs sampler. We show that for a multivariate normal distribution this Gibbs sufficient condition holds and for a bivariate normal distribution the Gibbs marginal sample paths are each an AR(1) process, and we obtain the standard errors of sample means and sample variances, which we later use to verify empirical Monte Carlo results. We empirically compare the Gibbs and H&R samplers on bivariate normal examples. For zero correlation, the Gibbs sampler provid...

93 citations


Journal ArticleDOI
William S. Cleveland1
TL;DR: A model has been developed to provide a framework for the study of visual decoding and a specification of visual operations that are employed to carry out pattern perception and table look-up is developed.
Abstract: A method of statistical graphics consists of two parts: a selection of statistical information to be displayed and a selection of a visual display method to encode the information. Some display methods lead to efficient, accurate visual decoding of encoded information, and others lead to inefficient, inaccurate decoding. It is only through rigorous studies of visual decoding that informed judgments can be made about how to choose display methods. A model has been developed to provide a framework for the study of visual decoding. The model consists of three parts: (1) a two-way classification of information on displays—quantitative-scale, quantitative-physical, categorical-scale, and categorical-physical; (2) a division of the visual processing of graphical displays into pattern perception and table look-up; (3) a specification of visual operations that are employed to carry out pattern perception and table look-up. Display methods are assessed by studying the visual operations to which they lead....

77 citations


Journal ArticleDOI
TL;DR: Methods of heuristic search are applied to the MVE estimator, including simulated annealing, genetic algorithms, and tabu search, and the results are compared to the undirected random search algorithm that is often cited.
Abstract: A method of robust estimation of multivariate location and shape that has attracted a lot of attention recently is Rousseeuw's minimum volume ellipsoid estimator (MVE). This estimator has a high breakdown point but is difficult to compute successfully. In this article, we apply methods of heuristic search to this problem, including simulated annealing, genetic algorithms, and tabu search, and compare the results to the undirected random search algorithm that is often cited. Heuristic search provides several effective algorithms that are far more computationally efficient than random search. Furthermore, random search, as currently implemented, is shown to be ineffective for larger problems.

67 citations


Journal ArticleDOI
TL;DR: Empirical likelihood methods have been developed for constructing confidence bands in problems of nonparametric density estimation as discussed by the authors, which have an advantage over more conventional methods in that the shape of the bands is determined solely by the data.
Abstract: Empirical likelihood methods are developed for constructing confidence bands in problems of nonparametric density estimation These techniques have an advantage over more conventional methods in that the shape of the bands is determined solely by the data We show how to construct an empirical likelihood functional, rather than a function, and contour it to produce the confidence bands Analogs of Wilks's theorem are established in this infinite-parameter setting and may be used to select the appropriate contour An alternative calibration, based on the bootstrap, is also suggested Large-sample theory is developed to show that the bands have asymptotically correct coverage, and a numerical example is presented to demonstrate the technique Comparisons are made with the use of bootstrap replications to choose both the shape and size of the bands

58 citations


Journal ArticleDOI
Neal Thomas1
TL;DR: Asymptotic corrections are used to compute the means and the variance-covariance matrix of multivariate posterior distributions that are formed from a normal prior distribution and a likelihood function that factors into separate functions for each variable in the posterior distribution as discussed by the authors.
Abstract: Asymptotic corrections are used to compute the means and the variance-covariance matrix of multivariate posterior distributions that are formed from a normal prior distribution and a likelihood function that factors into separate functions for each variable in the posterior distribution. The approximations are illustrated using data from the National Assessment of Educational Progress (NAEP). These corrections produce much more accurate approximations than those produced by two different normal approximations. In a second potential application, the computational methods are applied to logistic regression models for severity adjustment of hospital-specific mortality rates.

Journal ArticleDOI
TL;DR: In this article, a technique for stable calculation of the orthogonal tapers from the basic defining equation is described, which is difficult due to the instability of the calculations and the eigenproblem is poorly conditioned.
Abstract: Spectral estimation using a set of orthogonal tapers is becoming widely used and appreciated in scientific research. It produces direct spectral estimates with more than 2 df at each Fourier frequency, resulting in spectral estimators with reduced variance. Computation of the orthogonal tapers from the basic defining equation is difficult, however, due to the instability of the calculations—the eigenproblem is very poorly conditioned. In this article the severe numerical instability problems are illustrated and then a technique for stable calculation of the tapers—namely, inverse iteration—is described. Each iteration involves the solution of a matrix equation. Because the matrix has Toeplitz form, the Levinson recursions are used to rapidly solve the matrix equation. FORTRAN code for this method is available through the Statlib archive. An alternative stable method is also briefly reviewed.

Journal ArticleDOI
TL;DR: Individual (inCI) and simultaneous (siCI) confidence intervals for individual values (determinations), and related but different structures (interference notches) for (pairwise) comparisons provide an important area of application for linkages.
Abstract: The graphical display of two or more numerical aspects, in comparing several circumstances, has widespread applications Like almost all graphic display, emphasis is on comparison and on phenomena (on appearances describable in nonnumerical words) Impact of the highest quality attainable is important Digital information can, and should, be included when helpful A variety of schemes—linkages—for displaying first two and then several numerical aspects are explored Circular forms for the elements involved do relatively well Next we look at linkages in a substantial and moderately diversified field of applications Individual (inCI) and simultaneous (siCI) confidence intervals for individual values (determinations), and related but different structures (interference notches) for (pairwise) comparisons provide an important area of application for linkages, both simple and of moderate complexity The same desire to simplify the viewer's task that contributed to the introduction of (simple) simulta

Journal ArticleDOI
TL;DR: In this paper, the authors developed stochastic shape and color models for detecting defects in color images of potato images, which can be used to classify an unknown object as either "acceptable potato" or "unacceptable potato".
Abstract: Automatic defect detection in color images of potatoes is complicated by the random variability ordinarily observed in a collection of normal potatoes. Stochastic models are developed that explicitly describe the random nature of the shape and color texture observed in normal potatoes. Shape and color simulations based on these models are realistic. Statistical tests based on the stochastic shape and color models are developed. The tests are capable of classifying an unknown object as either “acceptable potato” or “unacceptable potato.” Twelve potatoes are analyzed, and the experimental results are presented.

Journal ArticleDOI
TL;DR: The comments of the discussants reflect a thorough knowledge of visual perception and add substance to the issues raised in the article, speaking well of the ability of a new journal to attract researchers in vision, bringing greater depth to the important area of graphical perception.
Abstract: (1993). Rejoinder: A Model for Studying Display Methods of Statistical Graphics. Journal of Computational and Graphical Statistics: Vol. 2, No. 4, pp. 361-364.

Journal ArticleDOI
TL;DR: The interface is a convention by which plots communicate with data sets, allowing plots to be independent of the actual data representation, and the same strategy may be used to deal with the dependence of model-fitting procedures on data.
Abstract: Statistical software systems include modules for manipulating data sets, model fitting, and graphics. Because plots display data, and models are fit to data, both the model-fitting and graphics modules depend on the data. Today's statistical environments allow the analyst to choose or even build a suitable data structure for storing the data and to implement new kinds of plots. The multiplicity problem caused by many plot varieties and many data representations is avoided by constructing a plot-data interface. The interface is a convention by which plots communicate with data sets, allowing plots to be independent of the actual data representation. This article describes the components of such a plot-data interface. The same strategy may be used to deal with the dependence of model-fitting procedures on data.

Journal ArticleDOI
TL;DR: In this article, the problem of generating a random hyperrectangle in a unit hypercube such that each point of the hypercube has probability p of being covered is studied, and it is shown that no constant length solution exists.
Abstract: We look at the problem of generating a random hyperrectangle in a unit hypercube such that each point of the hypercube has probability p of being covered. For random intervals of [0, 1], we compare various methods based either on the distribution of the length or on the distribution of the midpoint. It is shown that no constant length solution exists. Nice versatility is achieved when a random interval is chosen from among the spacings defined by a Dirichlet process. A small simulation is included.


Journal ArticleDOI
TL;DR: A method for the computer calculation of Edgeworth expansions for a smooth function model accurate in the O(n –1) term for the particular case of the studentized mean is described.
Abstract: In this article I describe, in detail, a method for the computer calculation of Edgeworth expansions for a smooth function model accurate in the O(n –1) term. For such models, these expansions are an important tool for the analysis of normalizing transformations, the correction of an approximately normally distributed quantity for skewness, and the comparison of bootstrap inference procedures. The method is straightforward and is efficient in a sense described in the article. The implementation of the method in general is clear from its implementation in the Mathematica program (available through StatLib) for the particular case of the studentized mean.

Journal ArticleDOI
TL;DR: This article by Tukey addresses both the selection of information and methods of display, but the emphasis is on the latter, and the points made apply, at least broadly, to the other display methods discussed byTukey.
Abstract: A graphical method for data analysis has two components. The first component is information to be displayed, a statistic that is sometimes just the data themselves. The second component is a visual part, a selection of a display method that encodes the information by position, geometric magnitude, and other rendering attributes such as color. For a boxplot (Tukey 1977) the statistical information is a five-number summary of the empirical distribution together with end values; the display method is the familiar box with a line or filled circle inside, with two appendages, and with plotting symbols to show end values. This article by Tukey addresses both the selection of information and methods of display, but the emphasis is on the latter. It continues a current of thinking begun earlier by Tukey (1990), and so we will discuss both articles. Because they are so rich in ideas, and because the discussion must as a practical matter be limited, we will confine our comments to the display method of boxplots. Devoting all of the available space to the boxplot is entirely reasonable because it has become a standard of statistical graphics. And the points we make apply, at least broadly, to the other display methods discussed by Tukey. We will strike two themes: 1. The success of a display method depends on the effectiveness of the visual decoding of the encoded information. Thus to create display methods with maximal "impact," as Tukey would put it, we must study visual perception. Success cannot be assessed without explicit attention to visual perception. For example, to determine whether on a boxplot it is better to encode the median by a horizontal line segment or a filled circle, we must appeal to visual perception. 2. As a practical matter, it is healthy to regard a display method as incomplete until software exists that implements it for the full spectrum of data types to which it is meant to apply. The reason is that code provides a complete specification; issues of visual perception that are easy to gloss over in a discussion with a few examples loom large when it comes time to write and test code that is meant to be general. One example is Tukey's suggestion that numbers go inside the bars of bar charts. As we will argue later, this is easy to suggest and tricky to fully specify.

Journal ArticleDOI
TL;DR: This article applies Bill Cleveland's thinking to graphical composites reference grids, plotting symbols, and aspect ratios to examine a variety of approaches to evaluating the use of these composites.
Abstract: Social psychologist Susan Fiske has shown that common sense notions about human behavior are as often wrong as right. For example, the popular maxim "opposites attract" is generally false. Numerous experiments by another psychologist, Amos Tversky, have demonstrated that we-novices and Bayesian statisticians alike-are poor judges of quantity and chance. Finally, perceptual experiments, including some by Bill Cleveland himself, have shown that statisticians and other humans succumb to visual illusions when viewing statistical graphs. Martin Gardner, Richard Feynman, and the Amazing Randi have all shown how easy it is for scientists to fool themselves. Because we all think we are expert psychologists, we are at greatest risk when we study ourselves and our perceptions. Bill Cleveland has done a service to statisticians by grounding discussions about graphics in experimentation. His ideas on graphical elements, based on a series of experiments (Cleveland 1985), have influenced statistics packages (e.g., S+, STATA, and SYSTAT) and have inspired further experimentation in graphical perception (see Spence and Lewandowsky 1990). An experimental viewpoint should not diminish the work of graphic designers and others who have creative instincts for good displays. Bertin and Tufte, for example, showed that effective graphs need not be dull or lack style. We must be suspicious of any design prescription that is not supported by experimental results, however. Good design does not always lead to effective statistical graphics. Most of Cleveland's early experiments concerned graphical elements-lines, angles, areas, volumes, and colors. This article applies his thinking to graphical compositesreference grids, plotting symbols, and aspect ratios. Although its title is "A Model for Studying Display Methods of Statistical Graphics," it is really more concerned with a variety of approaches to evaluating the use of these composites. While I cannot disagree with most of his conclusions about good usage, I find the model itself somewhat restrictive. Cleveland wishes to stay grounded in perceptual psychology, but the topics he discusses also involve areas of higher cognition. Cognitive psychologists such as Pinker (1990), Simken and Hastie (1987), and Kosslyn (1989) have proposed process models for graphical perceptual processing. Statisticians, on the other hand, like to think of the meaning of a graph as predefined: Construct

Journal ArticleDOI
Susan Holmes1
TL;DR: The left-brain/right-brain metaphor is used to stress that the brain does two (at least) different things at the same time and in a completely different way: table look-up and pattern recognition characterize two separate brain functions that I will label, for historical reasons, left- brain and right-brain functions.
Abstract: I find Bill Cleveland's paper interesting because it argues in favor of a dichotomous framework for collection of visual information that we can develop to improve methods for graphical data analysis. We would all like to believe in statistical holography; that is, that there is some instrument for "looking at data" (microscope, macroscope, tweezers, telescope ...) that would enable us to reconstruct the whole picture from a fragment or a distorted one, just as a piece of a holographic plate contains (in harder to read, fuzzier form) the whole, or original picture. The instrument we have at hand for casting some useful light on the data is a system formed by a computer screen with several graphical windows: a command window that accesses a statistical toolbox, a keyboard and a mouse in front of which lies the key to the system, and the best neural net available-a brain. It is important to establish how to optimize the whole system, and taking into account interfaces and some things we know about the brain helps. Table look-up and pattern recognition characterize two separate brain functions that I will label, for historical reasons, left-brain and right-brain functions. Although today's picture of the brain is a mixture between patchwork and network, I will use the left-brain/right-brain metaphor-as illustrated in Figure 1-to stress that the brain does two (at least) different things at the same time and in a completely different way:

Journal ArticleDOI
TL;DR: In this article, the authors introduce an approach for characterizing the classes of empirical distributions that satisfy certain positive dependence notions. And they introduce algorithms for constructively generating and enumerating the elements of various of these subsets of SN.
Abstract: This article introduces an approach for characterizing the classes of empirical distributions that satisfy certain positive dependence notions. Mathematically, this can be expressed as studying certain subsets of the class SN of permutations of 1, …, N, where each subset corresponds to some positive dependence notions. Explicit techniques for it-eratively characterizing subsets of SN that satisfy certain positive dependence concepts are obtained and various counting formulas are given. Based on these techniques, graph-theoretic methods are used to introduce new and more efficient algorithms for constructively generating and enumerating the elements of various of these subsets of SN. For example, the class of positively quadrant dependent permutations in SN is characterized in this fashion.

Journal ArticleDOI
TL;DR: John Tukey's driveway curves sharply off Arreton Drive before swooping past some yews tightly wrapped in chain-link fence, then it pauses briefly by a traditional white house before rushing to halt, finally, in a clatter by the front of his workshop, situated in the rear.
Abstract: John Tukey's driveway curves sharply off Arreton Drive before swooping past some yews tightly wrapped in chain-link fence. Then it pauses briefly by a traditional white house before rushing to halt, finally, in a clatter by the front of his workshop, situated in the rear. It has been almost 20 years to the day since I last visited here (Wainer 1974) and I noticed some changes as I got out of my car and walked up to the door. The basic structure of Tukey's workshop seemed the same. The roof was still bellshaped, although the top seemed lower than I remembered and the walls farther apart. The greenhouse was still attached on the right side, and I could see all of John's stems and leaves growing there in profusion, neatly categorized in their respective boxplots. They were of a different design than I remembered, appearing to have been cinched in at the middle. Otherwise, things looked more or less the same. In front of the workshop was the same old refuse container, overflowing with outlying points that had been trimmed away. I rang the doorbell and, as I waited for John to answer, I noticed for the first time that a large cube-shaped addition with windows looking out in all directions had been built on the left side of the workshop. "I wonder what John's doing with the extra space?" The door opened and I was greeted by Elizabeth Tukey, who told me that John was in his new study working on four- dimensional displays. "Ah hah," I thought, "the new study must be the addition." She led me past John's worktable on which sat his trustworthy jackknife, open and ready for use. We squeezed by a ladder of transformations that leaned against one wall and went alongside a shelf full of hinges and associated midspreads of various sizes. Finally, we reached a doorway that led to the new addition. Elizabeth opened it and we entered a large room with windows on all four sides. It was almost completely empty. Elizabeth said something about the addition having only just been completed and consequently they hadn't had time to furnish it fully. I remarked on how handsome it looked and she replied, "John designed it. He calls it 'Tess's Act.' I don't know what

Journal ArticleDOI
TL;DR: The goal is not to contribute to the science of visual information processing, but rather, as the author states, it is "an engineering one": It is to derive rationales for the design of better display methods.
Abstract: In his opening section Cleveland writes provocatively: "It is not true that the selection of statistical information is the only critical part and that virtually any method of display suffices." This is so aptly demonstrated that even a reader who only browses the article benefits from looking at the sample graphs. The "good/bad" examples show how important properties of time curves are disclosed by selecting the right aspect ratio, and how grid lines can greatly facilitate comparisons between displays. Another example shows how data from an experimental design with as many as three factors can be shown in a simple two-dimensional display so that one sees the main effects and interactions at work before turning to abstract p values and F ratios based on often doubtful assumptions. Why do some graphs work and others do not? There are scientific reasons for it that have to do with the properties of the human visual system, and these reasons are the focus of the article. However, the goal is not to contribute to the science of visual information processing, but rather, as the author states, it is "an engineering one": It is to derive rationales for the design of better display methods. 1. PARALLEL AND SERIAL VISUAL PROCESSES As many authors in vision research, Cleveland feels that there are two profoundly different modes of visual processing. He calls them "pattern perception" and "table lookup." Pattern perception consists of detection of graphical elements, grouping them so to detect structure and compare quantitative physical aspects of different elements. Table look-up means to actually decode quantitative or categorical information in the data, mainly by relating them to the scales given in the display. This distinction is reminiscent of, but not identical to, a common distinction in vision research: There is much empirical evidence to support a primary, fast, parallel visual process that instantaneously integrates information over the whole visual field that has been called preattentive vision (Julesz 1981; 1984). There is also evidence that supports a secondary visual process that can only process a small part of the visual field at a time, and that therefore scans the field in a slow and serial way. This process has been called