Showing papers in &quot;Statistics and Computing in 1995&quot;

Random sampling from databases: a survey

TL;DR: In this paper, the authors provide simulation algorithms for one-sided and two-sided truncated normal distributions, which are then used to simulate multivariate normal variables with convex restricted parameter space for any covariance structure.

...read moreread less

Abstract: We provide simulation algorithms for one-sided and two-sided truncated normal distributions. These algorithms are then used to simulate multivariate normal variables with convex restricted parameter space for any covariance structure.

...read moreread less

438 citations

Journal Article•DOI•

[...]

Frank Olken¹, Doron Rotem², Doron Rotem¹•Institutions (2)

Lawrence Berkeley National Laboratory¹, San Jose State University²

Implementing partial least squares

TL;DR: This paper reviews recent literature on techniques for obtaining random samples from databases, and describes sampling for estimation of aggregates (e.g. the size of query results).

...read moreread less

Abstract: This paper reviews recent literature on techniques for obtaining random samples from databases. We begin with a discussion of why one would want to include sampling facilities in database management systems. We then review basic sampling techniques used in constructing DBMS sampling algorithms, e.g. acceptance/rejection and reservoir sampling. A discussion of sampling from various data structures follows: B + trees, hash files, spatial data structures (including R-trees and quadtrees). Algorithms for sampling from simple relational queries, e.g. single relational operators such as selection, intersection, union, set difference, projection, and join are then described. We then describe sampling for estimation of aggregates (e.g. the size of query results). Here we discuss both clustered sampling, and sequential sampling approaches. Decision-theoretic approaches to sampling for query optimization are reviewed.

...read moreread less

161 citations

Journal Article•DOI•

[...]

M. C. Denham¹•Institutions (1)

University of Liverpool¹

A Bayesian predictive approach to determining the number of components in a mixture distribution

TL;DR: This work investigates the implementation of univariate PLS algorithms within FORTRAN and the Matlab and Splus environments, and investigates the merits of using the orthogonal invariance of PLS regression to ‘improve’ the algorithms.

...read moreread less

Abstract: Partial least squares (PLS) regression has been proposed as an alternative regression technique to more traditional approaches such as principal components regression and ridge regression. A number of algorithms have appeared in the literature which have been shown to be equivalent. Someone wishing to implement PLS regression in a programming language or within a statistical package must choose which algorithm to use. We investigate the implementation of univariate PLS algorithms within FORTRAN and the Matlab (1993) and Splus (1992) environments, comparing theoretical measures of execution speed based on flop counts with their observed execution times. We also comment on the ease with which the algorithms may be implemented in the different environments. Finally, we investigate the merits of using the orthogonal invariance of PLS regression to ‘improve’ the algorithms.

...read moreread less

38 citations

Journal Article•DOI•

[...]

Dipak K. Dey¹, Lynn Kuo¹, Sujit K. Sahu²•Institutions (2)

University of Connecticut¹, University of Cambridge²

The efficient cross-validation of principal components applied to principal component regression

TL;DR: A Bayesian approach to mixture modelling and a method based on predictive distribution to determine the number of components in the mixtures to demonstrate the power of the Gibbs sampler.

...read moreread less

Abstract: This paper describes a Bayesian approach to mixture modelling and a method based on predictive distribution to determine the number of components in the mixtures. The implementation is done through the use of the Gibbs sampler. The method is described through the mixtures of normal and gamma distributions. Analysis is presented in one simulated and one real data example. The Bayesian results are then compared with the likelihood approach for the two examples.

...read moreread less

38 citations

Journal Article•DOI•

[...]

Bart Mertens¹, Tom Fearn², Michael Thompson¹•Institutions (2)

Birkbeck, University of London¹, University College London²

Sampling from spatial databases

TL;DR: This paper presents an efficient approach to leave-one-out cross-validation of principal components that exploits the regular nature of leave- one-out principal component eigenvalue downdating and derives influence statistics and considers the application to principal component regression.

...read moreread less

Abstract: The cross-validation of principal components is a problem that occurs in many applications of statistics. The naive approach of omitting each observation in turn and repeating the principal component calculations is computationally costly. In this paper we present an efficient approach to leave-one-out cross-validation of principal components. This approach exploits the regular nature of leave-one-out principal component eigenvalue downdating. We derive influence statistics and consider the application to principal component regression.

...read moreread less

27 citations

Journal Article•DOI•

[...]

Frank Olken¹, Doron Rotem², Doron Rotem¹•Institutions (2)

Lawrence Berkeley National Laboratory¹, San Jose State University²

Principal points of univariate continuous distributions

TL;DR: This paper deals with techniques for obtaining random point samples from spatial databases by discussing two fundamental approaches to sampling with spatial predicates, depending on whether the authors sample first or evaluate the predicate first.

...read moreread less

Abstract: This paper deals with techniques for obtaining random point samples from spatial databases. We seek random points from a continuous domain (usually ℝ2) which satisfy a spatial predicate that is represented in the database as a collection of polygons. Several applications of spatial sampling (e.g. environmental monitoring, agronomy, forestry, etc) are described. Sampling problems are characterized in terms of two key parameters: coverage (selectivity), and expected stabbing number (overlap). We discuss two fundamental approaches to sampling with spatial predicates, depending on whether we sample first or evaluate the predicate first. The approaches are described in the context of both quadtrees and R-trees, detailing the sample first, acceptance/rejection tree, and partial area tree algorithms. A sequential algorithm, the one-pass spatial reservoir algorithm is also described. The relative performance of the various sampling algorithms is compared and choice of preferred algorithms is suggested. We conclude with a short discussion of possible extensions.

...read moreread less

27 citations

Journal Article•DOI•

[...]

Alice Zoppè¹•Institutions (1)

University of Trento¹

Accurate and efficient construction of bootstrap likelihoods

TL;DR: In this paper, the K principal points of a p-variate random variable X are defined as those points ξ 1,..., ξ 2, ξ 3, K 4, K 5 which minimize the expected squared distance of X from the nearest of the ξ k ≥ 0.

...read moreread less

Abstract: The K principal points of a p-variate random variable X are defined as those points ξ1,...,ξ K which minimize the expected squared distance of X from the nearest of the ξ k . This paper reviews some of the theory of principal points and presents a method of determining principal points of univariate continuous distributions. The method is applied to the uniform distribution, to the normal distribution and to the exponential distribution.

...read moreread less

27 citations

Journal Article•DOI•

[...]

Anthony C. Davison¹, David Hinkley², B. J. Worton³•Institutions (3)

University of Oxford¹, University of California, Santa Barbara², University of Edinburgh³

Two permutation tests of equality of variances

TL;DR: The simplest construction of bootstrap likelihoods involves two levels of bootstrapping, kernel density estimation, and non-parametric curve-smoothing as mentioned in this paper, which is the most accurate and efficient approach.

...read moreread less

Abstract: The simplest construction of bootstrap likelihoods involves two levels of bootstrapping, kernel density estimation, and non-parametric curve-smoothing. We describe more accurate and efficient constructions, based on smoothing at the first level of nested bootstraps and saddlepoint approximation to remove second-level bootstrap variation. Detailed illustrations are given.

...read moreread less

22 citations

Journal Article•DOI•

[...]

Rose Baker¹•Institutions (1)

University of Salford¹

A study of simulated annealing and a revised cascade algorithm for image reconstruction

TL;DR: The F-ratio test for equality of dispersion in two samples is by no means robust, while non-parametric tests either assume a common median, or are not very powerful.

...read moreread less

Abstract: The F-ratio test for equality of dispersion in two samples is by no means robust, while non-parametric tests either assume a common median, or are not very powerful. Two new permutation tests are presented, which do not suffer from either of these problems. Algorithms for Monte Carlo calculation of P values and confidence intervals are given, and the performance of the tests are studied and compared using Monte Carlo simulations for a range of distributional types. The methods used to speed up Monte Carlo calculations, e.g. stratification, are of wider applicability.

...read moreread less

15 citations

Journal Article•DOI•

[...]

Merrilee Hurn¹, Christopher Jennison¹•Institutions (1)

University of Bath¹

A semi-parametric model for multivariate extreme values

TL;DR: In this paper, an image reconstruction problem and the computational difficulties arising in determining the maximum a posteriori (MAP) estimate are described, and two algorithms, iterated conditional modes (ICM) and simulated annealing, are usually applied pixel by pixel.

...read moreread less

Abstract: We describe an image reconstruction problem and the computational difficulties arising in determining the maximum a posteriori (MAP) estimate. Two algorithms for tackling the problem, iterated conditional modes (ICM) and simulated annealing, are usually applied pixel by pixel. The performance of this strategy can be poor, particularly for heavily degraded images, and as a potential improvement Jubb and Jennison (1991) suggest the cascade algorithm in which ICM is initially applied to coarser images formed by blocking squares of pixels. In this paper we attempt to resolve certain criticisms of cascade and present a version of the algorithm extended in definition and implementation. As an illustration we apply our new method to a synthetic aperture radar (SAR) image. We also carry out a study of simulated annealing, with and without cascade, applied to a more tractable minimization problem from which we gain insight into the properties of cascade algorithms.

...read moreread less

13 citations

Journal Article•DOI•

[...]

Mark J. Dixon¹, Jonathan A. Tawn¹•Institutions (1)

Lancaster University¹

Bayes linear computation: concepts, implementation and programs

TL;DR: An extension of the existing models, which removes this constraint, is proposed and the resulting model is semi-parametric and requires computationally intensive techniques for likelihood evaluation.

...read moreread less

Abstract: Threshold methods for multivariate extreme values are based on the use of asymptotically justified approximations of both the marginal distributions and the dependence structure in the joint tail. Models derived from these approximations are fitted to a region of the observed joint tail which is determined by suitably chosen high thresholds. A drawback of the existing methods is the necessity for the same thresholds to be taken for the convergence of both marginal and dependence aspects, which can result in inefficient estimation. In this paper an extension of the existing models, which removes this constraint, is proposed. The resulting model is semi-parametric and requires computationally intensive techniques for likelihood evaluation. The methods are illustrated using a coastal engineering application.

...read moreread less

Journal Article•DOI•

[...]

Michael Goldstein¹, David Wooff¹•Institutions (1)

Durham University¹

Exact conditional tests for incomplete contingency tables: estimating attained significance levels

TL;DR: The computer language [B/D] (an acronym for beliefs adjusted by data), which implements Bayes linear methods, incorporates a natural graphical representation of the analysis, providing a powerful way of thinking about the process of knowledge formulation and criticism which is also accessible to non-technical users.

...read moreread less

Abstract: We demonstrate how Bayes linear methods, based on partial prior specifications, bring us quickly to the heart of otherwise complex problems, giving us natural and systematic tools for evaluating our analyses which are not readily available in the usual Bayes formalism. We illustrate the approach using an example concerning problems of prediction in a large brewery. We describe the computer language [B/D] (an acronym for beliefs adjusted by data), which implements the approach. [B/D] incorporates a natural graphical representation of the analysis, providing a powerful way of thinking about the process of knowledge formulation and criticism which is also accessible to non-technical users.

...read moreread less

Journal Article•DOI•

[...]

Peter K. Smith¹, John W. McDonald¹•Institutions (1)

University of Southampton¹

Comments on J. A. Nelder. ‘The statistics of linear models: back to basics’

TL;DR: In this paper, a Monte Carlo exact conditional test of quasi-independence in two-way incomplete contingency tables is proposed, which depends only on the counts in the cells of interest and not on the remaining cells.

...read moreread less

Abstract: A Monte Carlo exact conditional test of quasi-independence in two-way incomplete contingency tables is proposed. The null distribution of a random table under quasiindependence is derived. This distribution depends only on the counts in the cells of interest and not on the counts in the remaining cells. This result is used to improve the efficiency of a proposed simulate-and-reject Monte Carlo procedure for estimating the attained significance level.

...read moreread less

Journal Article•DOI•

[...]

Shayle R. Searle¹•Institutions (1)

Cornell University¹

The uses and limits of linear models

Journal Article•DOI•

[...]

James Lindsey

Probability model choice in single samples from exponential families using Poisson log-linear modelling, and model comparison using Bayes and posterior Bayes factors

Journal Article•DOI•

[...]

Murray Aitkin¹, Murray Aitkin²•Institutions (2)

University of Western Australia¹, Tel Aviv University²

Testing for randomness in stream ciphers using the binary derivative

TL;DR: In this paper, the authors describe a method due to Lindsey (1974a) for fitting different exponential family distributions for a single population to the same data, using Poisson log-linear modelling of the density or mass function.

...read moreread less

Abstract: This paper describes a method due to Lindsey (1974a) for fitting different exponential family distributions for a single population to the same data, using Poisson log-linear modelling of the density or mass function. The method is extended to Efron's (1986) double exponential family, giving exact ML estimation of the two parameters not easily achievable directly. The problem of comparing the fit of the non-nested models is addressed by both Bayes and posterior Bayes factors (Aitkin, 1991). The latter allow direct comparisons of deviances from the fitted distributions.

...read moreread less

Journal Article•DOI•

[...]

Neville Davies¹, Ed Dawson², Helen Gustafson², Anthony N. Pettitt²•Institutions (2)

Nottingham Trent University¹, Queensland University of Technology²

Generating random variates from D-distributions via substitution sampling

TL;DR: A uniform method of testing randomness in binary strings is described based on using the binary derivative and it is shown that the new tests are faster and more powerful than several of the well-established tests for randomness.

...read moreread less

Abstract: The binary derivative has been used to measure the randomness of a binary string formed by a pseudorandom number generator for use in cipher systems. In this paper we develop statistical properties of the binary derivative and show that certain types of randomness testing in binary derivatives are equivalent to well-established tests for randomness in the original string. A uniform method of testing randomness in binary strings is described based on using the binary derivative. We show that the new tests are faster and more powerful than several of the well-established tests for randomness.

...read moreread less

Journal Article•DOI•

[...]

Stephen G Walker¹•Institutions (1)

Imperial College London¹

A self-validating numerical method for computation of central and non-central F probabilities and percentiles

TL;DR: In this article, an alternative method using substitution sampling is given, and an algorithm for the random variate generation from SD-distributions is also given, where substitution sampling has been used for the first time.

...read moreread less

Abstract: Laud et al. (1993) describe a method for random variate generation from D-distributions. In this paper an alternative method using substitution sampling is given. An algorithm for the random variate generation from SD-distributions is also given.

...read moreread less

Journal Article•DOI•

[...]

Morgan C. Wang¹, William J. Kennedy²•Institutions (2)

University of Central Florida¹, Iowa State University²

Aggregate statistical data: models for their representation

TL;DR: A self-validating numerical method based on interval analysis for the computation of central and non-central F probabilities and percentiles is reported and can be adapted to approximate the probabilities and%iles for other commonly used distribution functions.

...read moreread less

Abstract: A self-validating numerical method based on interval analysis for the computation of central and non-central F probabilities and percentiles is reported. The major advantage of this approach is that there are guaranteed error bounds associated with the computed values (or intervals), i.e. the computed values satisfy the user-specified accuracy requirements. The methodology reported in this paper can be adapted to approximate the probabilities and percentiles for other commonly used distribution functions.

...read moreread less

Journal Article•DOI•

[...]

Maurizio Rafanelli

A simple, automatic and adaptive bivariate density estimator based on conditional densities

TL;DR: A review of a number of data models for aggregate statistical data which have appeared in the computer science literature in the last ten years are given.

...read moreread less

Abstract: The paper gives a review of a number of data models for aggregate statistical data which have appeared in the computer science literature in the last ten years. After a brief introduction to the data model in general, the fundamental concepts of statistical data are introduced. These are called statistical objects because they are complex data structures (vectors, matrices, relations, time series, etc) which may have different possible representations (e.g. tables, relations, vectors, pie-charts, bar-charts, graphs, and so on). For this reason a statistical object is defined by two different types of attribute (a summary attribute, with its own summary type and with its own instances, called summary data, and the set of category attributes, which describe the summary attribute). Some conceptual models of statistical data (CSM, SDM4S), some semantic models of statistical data (SCM, SAM*, OSAM*), and some graphical models of statistical data (SUBJECT, GRASS, STORM) are also discussed.

...read moreread less

Journal Article•DOI•

[...]

Jeffrey S. Simonoff¹•Institutions (1)

New York University¹

Improved Monte Carlo inference for models with additive error

TL;DR: In this paper, a simple, automatic and adaptive bivariate density estimator is proposed based on the estimation of marginal and conditional densities, and guidance to practical application of the method is given.

...read moreread less

Abstract: The standard approach to non-parametric bivariate density estimation is to use a kernel density estimator. Practical performance of this estimator is hindered by the fact that the estimator is not adaptive (in the sense that the level of smoothing is not sensitive to local properties of the density). In this paper a simple, automatic and adaptive bivariate density estimator is proposed based on the estimation of marginal and conditional densities. Asymptotic properties of the estimator are examined, and guidance to practical application of the method is given. Application to two examples illustrates the usefulness of the estimator as an exploratory tool, particularly in situations where the local behaviour of the density varies widely. The proposed estimator is also appropriate for use as a ‘pilot’ estimate for an adaptive kernel estimate, since it is relatively inexpensive to calculate.

...read moreread less

Journal Article•DOI•

[...]

Martin L. Hazelton¹•Institutions (1)

University College London¹

On the use of diagnostic biplots in model screening for genotype by environment tables

TL;DR: In this article, a methodology for reducing the bias inherent in density estimation can be developed for models with an additive error component, and an illustrative example involving a stochastic model of molecular fragmentation and measurement is given.

...read moreread less

Abstract: Some statistical models defined in terms of a generating stochastic mechanism have intractable distribution theory, which renders parameter estimation difficult. However, a Monte Carlo estimate of the log-likelihood surface for such a model can be obtained via computation of nonparametric density estimates from simulated realizations of the model. Unfortunately, the bias inherent in density estimation can cause bias in the resulting log-likelihood estimate that alters the location of its maximizer. In this paper a methodology for radically reducing this bias is developed for models with an additive error component. An illustrative example involving a stochastic model of molecular fragmentation and measurement is given.

...read moreread less

Journal Article•DOI•

[...]

F. A. Van Eeuwijk, L. C. P. Keizer

Heuristic algorithms for finding inexpensive elimination schemes

TL;DR: Diagnostic biplots on their own do not seem to provide a sufficient means for model selection for genotype by environment tables, but in combination with other methods they certainly can provide extra insight into the structure of the data.

...read moreread less

Abstract: Popular rank-2 and rank-3 models for two-way tables have geometrical properties which can be used as diagnostic keys in screening for an appropriate model. Row and column levels of two-way tables are represented by points in two or three dimensional space, whereupon collinearity and coplanarity of row and column points provide diagnostic keys for informal model choice. Coordinates are obtained from a factorization of the two-way table Y in the matrix product UV T. The rows of U then contain row-point coordinates and the rows of V column-point coordinates. Illustrations of applications of diagnostic biplots in the literature were restricted to data from chemistry and physics with little or no noise. In plant breeding, two-way tables containing substantial amounts of noise regularly arise in the form of genotype by environment tables. To investigate the usefulness of diagnostic biplots for model screening for genotype by environment tables, data tables were generated from a range of two-way models under the addition of various amounts of noise. Chances for correct diagnosis of the generating model depended on the type of model. Diagnostic biplots on their own do not seem to provide a sufficient means for model selection for genotype by environment tables, but in combination with other methods they certainly can provide extra insight into the structure of the data.

...read moreread less

Journal Article•DOI•

[...]

Chris Harbron¹•Institutions (1)

Rowett Research Institute¹

Random variate transformations in the Gibbs sampler: issues of efficiency and convergence

TL;DR: The new heuristic and relaxed heuristic algorithms proposed in this paper are shown to find computationally more efficient elimination orderings than previously proposed heuristicgorithms.

...read moreread less

Abstract: The computational cost, in both storage requirements and calculation, of performing an elimination ordering is considered as a function of the order in which the vertices of a graph are eliminated. Several different heuristic and relaxed heuristic algorithms for finding low cost elimination orderings are described and compared. The new heuristic and relaxed heuristic algorithms proposed in this paper are shown to find computationally more efficient elimination orderings than previously proposed heuristic algorithms.

...read moreread less

Journal Article•DOI•

[...]

Petros Dellaportas¹•Institutions (1)

Athens State University¹

A method for simulating familial disease data with variable age at onset and genetic and environmental effects

TL;DR: In this paper, a general idea based on variate transformations which can be tailored in all the above methods and increase the Gibbs sampler efficiency is presented. And a simple technique to assess convergence is suggested and illustrative examples are presented.

...read moreread less

Abstract: In the non-conjugate Gibbs sampler, the required sampling from the full conditional densities needs the adoption of black-box sampling methods. Recent suggestions include rejection sampling, adaptive rejection sampling, generalized ratio of uniforms, and the Griddy-Gibbs sampler. This paper describes a general idea based on variate transformations which can be tailored in all the above methods and increase the Gibbs sampler efficiency. Moreover, a simple technique to assess convergence is suggested and illustrative examples are presented.

...read moreread less

Journal Article•DOI•

[...]

W. James Gauderman¹•Institutions (1)

University of Southern California¹

When the metadata exceed the data: data management with uncertain data

TL;DR: An algorithm is presented for simulating disease data in pedigrees, incorporating variable age at onset and genetic and environmental effects, and is computationally efficient, making multi-dataset simulation studies feasible.

...read moreread less

Abstract: The field of genetic epidemiology is growing rapidly with the realization that many important diseases are influenced by both genetic and environmental factors. For this reason, pedigree data are becoming increasingly valuable as a means of studying patterns of disease occurrence. Analysis of pedigree data is complicated by the lack of independence among family members and by the non-random sampling schemes used to ascertain families. An additional complicating factor is the variability in age at disease onset from one person to another. In developing statistical methods for analysing pedigree data, analytic results are often intractable, making simulation studies imperative for assessing the performance of proposed methods and estimators. In this paper, an algorithm is presented for simulating disease data in pedigrees, incorporating variable age at onset and genetic and environmental effects. Computational formulas are developed in the context of a proportional hazards model and assuming single ascertainment of families, but the methods can be easily generalized to alternative models. The algorithm is computationally efficient, making multi-dataset simulation studies feasible. Numerical examples are provided to demonstrate the methods.

...read moreread less

Journal Article•DOI•

[...]

John C. Klensin¹•Institutions (1)

United Nations University¹

Rejoinder to comments on ‘The statistics of linear models: back to basics’

TL;DR: The types of metadata encountered and the problems associated with dealing with them are discussed, and an alternative approach based on textual markup rather than, for example, the relational model is described.

...read moreread less

Abstract: With many types of scientific data, the amount of descriptive and qualifying information associated with the data values is quite variable and potentially large compared with the number of actual data values. This problem has been found to be particularly acute when dealing with data about the nutrient composition of foods, and a system—based on textual markup rather than, for example, the relational model—has been developed to deal with it. This paper discusses the types of metadata encountered and the problems associated with dealing with them, and then describes this alternative approach. The approach described has been installed in several locations around the world, and is in preliminary use as a tool for interchanging data among different databases as well as local database management.

...read moreread less

Journal Article•DOI•

[...]

John A. Nelder¹•Institutions (1)

Imperial College London¹

Comments on J. A. Nelder 'The statistics of linear models." back to basics'

TL;DR: The discussion has shown, I fear, that the formulation of linear models, one of the basic tools of the authors' trade, is in an unsatisfactory state; also that the argument is not just one between Nelder and the rest of the world.

...read moreread less

Abstract: The discussion has shown, I fear, that the formulation of linear models, one of the basic tools of our trade, is in an unsatisfactory state; also that the argument is not just one between Nelder and the rest of the world. Why the sorry state? I believe that the problem is that statistical theory is driven too often by mathematics rather than by the requirements of scientific inference. None of the hypotheses I have described as uninteresting are mathematically wrong; rather I contend that they do not correspond to any inferential quantities of interest. Searle points out that in the first example of Section 5 the original authors did not specify a nested structure of the type I discussed. However, without such a prior nested structure the hypotheses being discussed are, to me, pointless. They are mathematically specifiable but do not correspond to anything of inferential interest. Even with the prior structure two of the three remain uninteresting. The contributions of Aitkin and van Eeuwijk show what a mess the applied literature is in, as the result of some of the confusions I discussed. This confusion is not something that statisticians can or should ignore.

...read moreread less

Journal Article•DOI•

[...]

Robert N. Rodriguez¹, Randall D. Tobias¹, Russell D. Wolfinger¹•Institutions (1)

SAS Institute¹

Application of wavelets to the pre-processing of underwater sounds

Journal Article•DOI•

[...]

K. Powell¹, Theofanis Sapatinas¹, Trevor C. Bailey¹, Wojtek J. Krzanowski¹•Institutions (1)

University of Exeter¹